HOME  ›   overview
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium De Novo Assembly

Supernova 2.0 performance

The Supernova 2.0 release includes a number of significant changes to the code, with corresponding changes in performance. We compared the performance of versions 1.2 and 2.0 with respect to 20 different datasets, including ones that had been run previously, customer datasets, and novel datasets created just for this purpose. Most of these, along with their assemblies, are available for download.

Synopsis of Supernova's current performance


Samples

We have tested Supernova on an extensive range of samples, ranging from controls to wild-caught specimens. For each sample, we created a single Chromium Linked-Read library, which we sequenced and then assembled using both Supernova 1.2 and 2.0.0, without any tuning or specification of parameters, except varying the number of input reads in a few cases (see below).

# Sample Description Material DNA Prep Notes
1 hgp Human Genome project, male [1] blood MagAttract control
2 chm equimolar mix of CHM1/CHM13 cell line MagAttract control
3 wfu NA12878, European, female cell line MagAttract control
4 chi HG00512, Chinese, male cell line MagAttract  
5 yor NA19240, Yoruba, female cell line MagAttract  
6 yorm NA19238, Yoruba, female cell line MagAttract  
7 ash NA24385, Ashkenazi, male cell line MagAttract  
8 pr HG00733, Puerto Rican, female cell line MagAttract  
9 hummer hummingbird [2] tissue KingFisher  
10 fish zebrafish SAT from ZIRC tissue Amplicon Express control
11 ruby dog named Ruby blood MagAttract  
12 grape flame seedless grape [3] leaves grape protocol  
13 maize maize B73 leaves Amplicon Express control
14 chili chili pepper [4] leaves modified CTAB  
15 fly fruit fly iso-1 x Canton-S [5] one insect salting out control
16 omoth one moth collected in Pleasanton one insect salting out  
17 pmoth second moth collected in Pleasanton one insect salting out  
18 cater caterpillar collected in Pleasanton one insect salting out  
19 aphid aphid collected in Pleasanton one insect salting out  
20 aedes Aedes aegypti F1 ref cross [6] one insect salting out control

1. Anonymous donor 4. Allen van Deynze, UCD, Hort Res 5
2. Erich Jarvis, HHMI, bioRxiv       5. Bloomington Stock Center
3. Doreen Ware, CSHL 6. Ben Matthews, Rockefeller Institute

The number of reads provided as input was generally our best guess based on the estimated genome size, so as to yield about 56x coverage. In those cases for which we did not know the genome size, a preliminary run with a guess was sufficient to get an estimate from Supernova. In a few cases, it was advantageous to raise the coverage above 56x; the actual number of reads used is available.

Genome and data characteristics

Supernova calculates various metrics on different aspects of the input data and the genome it represents, such as genome size, repetitivity, heterozygosity, and molecule length. For these datasets, they vary widely, as shown below; brief descriptions of the metrics follow the table.

Sample gsize %rep het %hat mol_len p10 seq raw_cov
hgp 3274 8.1 1.42 0.09 139 234 X 55.0
chm 3212 6.5 1.30 0.10 79 139 X 56.0
wfu 3391 8.1 1.38 0.09 95 146 X 53.1
chi 3247 8.1 1.61 0.11 103 125 X 55.4
yor 3156 6.4 1.03 0.10 122 156 X 57.0
yorm 3288 7.4 1.13 0.10 119 132 X 54.7
ash 3124 7.2 1.39 0.11 119 140 X 57.6
pr 3399 8.5 1.44 0.11 103 146 X 53.0
hummer 1102 4.2 0.36 0.06 66 230 2500 61.2
fish 1680 12.6 0.30 0.47 89 93 Nova 54.3
ruby 2407 4.5 0.86 0.22 81 180 2500 54.0
grape 602 20.5 0.21 1.03 74 247 2500 47.0
maize 2219 35.8 7.01 0.03 81 175 X 64.0
chili 3215 6.7 0.29 0.25 45 88 X 61.1
fly 143 8.3 0.23 0.12 68 455 Nova 68.8
omoth 199 7.7 0.08 0.51 20 34 Nova 56.4
pmoth 330 6.0 0.17 0.20 22 40 Nova 72.8
cater 458 13.0 0.16 0.08 20 18 Nova 72.1
aphid 512 15.7 0.41 0.99 30 78 Nova 57.1
aedes 1323 17.6 0.41 0.04 70 69 Nova 62.9

Sample
A nickname for the sample, used in these charts.
gsize (est_genome_size)
The estimated genome size, in megabases (Mb).
%rep (repfrac)
Repeat Content Index (%): the percent of read kmers having depth ≥ twice the expected depth.
het (hetdist)
The estimated mean separation between heterozygous sites, in kilobases (kb).
%hat (high_AT_index)
High AT Index (%): the percent of kmers in the reads having ≥ 90% AT content. Locally extreme AT content is correlated with assembly gaps.
mol_len (lw_mean_mol_len)
The length-weighted mean molecule length, in kilobases (kb).
p10 (p10)
For an average point on the genome, the estimated number of molecules that extend 10 kb in both directions from that point, counting both alleles.
seq (likely_sequencers)
Abbreviated name of the sequencing instrument used.
raw_cov (raw_coverage)
Total number of bases in all of the sequence reads, before trimming off barcodes, divided by the estimated genome size.

Contigs, phase blocks, and scaffolds are all longer

The following table adds N50 contig, phase block, and scaffold sizes. (In the metrics description table, these metrics are contig_N50, phase_block_N50 and scaffold_N50, respectively.) These all show a notable improvement in almost every case. For example, for the hgp sample, the N50 contig size rose from 120.9 to 162.0 kb, the N50 phase block size rose from 4.30 to 5.83 Mb, and the N50 scaffold size rose from 17.18 to 45.60 Mb, a greater than two-fold improvement.

Sample gsize %rep het %hat mol_len p10 seq raw_cov 1.2 contig 2.0 contig 1.2 phase 2.0 phase 1.2 scaff 2.0 scaff
hgp 3274 8.1 1.42 0.09 139 234 X 55.0 120.9 162.0 4.30 5.83 17.18 45.60
chm 3212 6.5 1.30 0.10 79 139 X 56.0 116.4 175.2 2.65 3.21 14.78 39.53
wfu 3391 8.1 1.38 0.09 95 146 X 53.1 120.1 165.5 2.79 3.15 18.31 39.92
chi 3247 8.1 1.61 0.11 103 125 X 55.4 113.7 156.1 2.60 3.12 15.51 38.17
yor 3156 6.4 1.03 0.10 122 156 X 57.0 119.2 167.4 9.76 14.15 15.23 47.78
yorm 3288 7.4 1.13 0.10 119 132 X 54.7 113.4 159.0 8.68 12.55 19.42 49.47
ash 3124 7.2 1.39 0.11 119 140 X 57.6 106.1 153.8 4.02 5.26 16.71 36.11
pr 3399 8.5 1.44 0.11 103 146 X 53.0 122.3 169.0 3.29 3.96 18.16 46.30
hummer 1102 4.2 0.36 0.06 66 230 2500 61.2 100.5 175.0 11.38 17.48 12.42 31.86
fish 1680 12.6 0.30 0.47 89 93 Nova 54.3 17.1 20.5 0.17 1.70 0.68 4.04
ruby 2407 4.5 0.86 0.22 81 180 2500 54.0 77.5 100.4 2.91 3.69 13.05 36.24
grape 602 20.5 0.21 1.03 74 247 2500 47.0 38.3 55.7 0.48 1.70 0.58 2.29
maize 2219 35.8 7.01 0.03 81 175 X 64.0 20.9 31.0 0.04 0.04 0.27 1.78
chili 3215 6.7 0.29 0.25 45 88 X 61.1 105.7 167.2 1.72 3.91 3.09 13.60
fly 143 8.3 0.23 0.12 68 455 Nova 68.8 113.7 166.5 5.00 13.68 9.12 20.49
omoth 199 7.7 0.08 0.51 20 34 Nova 56.4 37.8 63.2 0.23 0.68 0.24 0.69
pmoth 330 6.0 0.17 0.20 22 40 Nova 72.8 63.7 107.8 0.97 2.23 1.71 6.68
cater 458 13.0 0.16 0.08 20 18 Nova 72.1 21.7 32.8 0.07 0.17 0.06 0.06
aphid 512 15.7 0.41 0.99 30 78 Nova 57.1 75.4 104.7 0.98 4.48 1.04 5.00
aedes 1323 17.6 0.41 0.04 70 69 Nova 62.9 20.3 29.7 0.09 0.35 0.07 0.15

Assembly accuracy and organization are also improved

The following table adds assembly accuracy and organization measures. Because the perfect stretch and misassembly estimate rely on a reference sequence, these two metrics are not generally available; we have calculated them here where we can. All three metrics are described below the table.

Sample gsize %rep het %hat mol_len p10 seq raw_cov 1.2 perf 2.0 perf 1.2 mis 2.0 mis 1.2 m10 2.0 m10
hgp 3274 8.1 1.42 0.09 139 234 X 55.0 22.77 26.79 1.13 0.44 2.41 1.89
chm 3212 6.5 1.30 0.10 79 139 X 56.0 . . 0.32 0.12 2.05 1.57
wfu 3391 8.1 1.38 0.09 95 146 X 53.1 20.04 21.74 1.02 0.69 2.08 1.59
chi 3247 8.1 1.61 0.11 103 125 X 55.4 . . 0.75 0.42 2.38 1.93
yor 3156 6.4 1.03 0.10 122 156 X 57.0 . . 0.41 0.26 2.18 1.73
yorm 3288 7.4 1.13 0.10 119 132 X 54.7 . . 0.38 0.18 2.51 1.99
ash 3124 7.2 1.39 0.11 119 140 X 57.6 . . 0.62 0.35 2.46 1.90
pr 3399 8.5 1.44 0.11 103 146 X 53.0 . . 0.46 0.24 2.24 1.84
hummer 1102 4.2 0.36 0.06 66 230 2500 61.2 . . . . 6.02 5.40
fish 1680 12.6 0.30 0.47 89 93 Nova 54.3 . . . . 31.62 25.18
ruby 2407 4.5 0.86 0.22 81 180 2500 54.0 . . . . 2.89 2.11
grape 602 20.5 0.21 1.03 74 247 2500 47.0 . . . . 26.67 15.26
maize 2219 35.8 7.01 0.03 81 175 X 64.0 15.82 30.55 2.14 1.40 26.38 9.85
chili 3215 6.7 0.29 0.25 45 88 X 61.1 . . . . 6.79 4.48
fly 143 8.3 0.23 0.12 68 455 Nova 68.8 29.27 37.10 0.62 0.09 7.06 5.66
omoth 199 7.7 0.08 0.51 20 34 Nova 56.4 . . . . 26.39 14.16
pmoth 330 6.0 0.17 0.20 22 40 Nova 72.8 . . . . 6.88 3.29
cater 458 13.0 0.16 0.08 20 18 Nova 72.1 . . . . 36.93 20.14
aphid 512 15.7 0.41 0.99 30 78 Nova 57.1 . . . . 12.28 6.92
aedes 1323 17.6 0.41 0.04 70 69 Nova 62.9 10.30 13.22 3.76 3.32 44.41 22.27

perf
This column provides the N50 perfect stretch, in kb. This can only be computed and is only shown for samples having a reference sequence from the same sample. It measures the N50 size of sequences in the reference that are perfectly mirrored in the assembly. Such ‘perfect stretches’ are terminated either by errors or gaps. In this context, transitioning from one allele to the other is an error. The N50 perfect stretch increased for all samples where it could be measured.
mis
This column shows the percent of the assembly that is misassembled. This can only be computed for assemblies having a reference sequence. As an example of the accounting, if a scaffold connects 5 Mb of one chromosome to 10 Mb of another chromosome, that counts as a 5 Mb error, which gets converted into a fraction by dividing by the assembly size. Errors of order and orientation are also included. The misassembly rate declined for all assemblies for which it could be measured.
m10 (m10)
This metric estimates the percent of genomic kmers that are either missing from the assembly entirely or present only in scaffolds shorter than 10 kb. Each kmer counts once regardless of its multiplicity in the genome and thus this measure discounts repeats. It measures assembly disorganization. Assembly disorganization is less in all cases and markedly less in some.

Computational performance of Supernova 2.0

The following table adds computational performance statistics. As shown, all but two assemblies were run on 256 GB servers. Memory use for Supernova has increased about 10% since version 1.2, and run times have increased on average by 60%. However, because of targeted optimizations, the likelihood of the extreme run times experienced by some users of 1.2 should now be much lower.

Sample gsize %rep het %hat mol_len p10 seq rawcov mem days
hgp 3274 8.1 1.42 0.09 139 234 X 55.0 256 3.2
chm 3212 6.5 1.30 0.10 79 139 X 56.0 256 2.8
wfu 3391 8.1 1.38 0.09 95 146 X 53.1 256 3.4
chi 3247 8.1 1.61 0.11 103 125 X 55.4 256 3.3
yor 3156 6.4 1.03 0.10 122 156 X 57.0 256 2.9
yorm 3288 7.4 1.13 0.10 119 132 X 54.7 256 3.1
ash 3124 7.2 1.39 0.11 119 140 X 57.6 256 3.2
pr 3399 8.5 1.44 0.11 103 146 X 53.0 256 3.4
hummer 1102 4.2 0.36 0.06 66 230 2500 61.2 256 1.1
fish 1680 12.6 0.30 0.47 89 93 Nova 54.3 256 2.3
ruby 2407 4.5 0.86 0.22 81 180 2500 54.0 256 1.9
grape 602 20.5 0.21 1.03 74 247 2500 47.0 256 0.7
maize 2219 35.8 7.01 0.03 81 175 X 64.0 512 7.9
chili 3215 6.7 0.29 0.25 45 88 X 61.1 512 3.4
fly 143 8.3 0.23 0.12 68 455 Nova 68.8 256 0.1
omoth 199 7.7 0.08 0.51 20 34 Nova 56.4 256 0.2
pmoth 330 6.0 0.17 0.20 22 40 Nova 72.8 256 0.3
cater 458 13.0 0.16 0.08 20 18 Nova 72.1 256 0.8
aphid 512 15.7 0.41 0.99 30 78 Nova 57.1 256 0.5
aedes 1323 17.6 0.41 0.04 70 69 Nova 62.9 256 2.4

mem (mem_peak)
The amount of memory (RAM) in GB on the server used for the assembly.
days (etime_h)
The total number of days elapsed during the assembly.

All assemblies were carried out on 28 core servers at 10x Genomics, having processor “Intel Xeon CPU E5-2697 v3 @ 2.6GHz”.

All assemblies were run twice to confirm exact reproducibility of results.