10x Genomics
Chromium De Novo Assembly
Supernova2.0, printed on 11/12/2024
Assembly Statistics
Upon the successful completion of a Supernova pipeline a number of useful
statistics about the input data and the assembly are logged in
outs/summary.csv, and the similar but more complete
outs/assembly/stats/summary.json.
We define below many of the various statistics contained there. Please also see
this document on molecule length statistics.
In cases where metrics refer to kmers without specifying the size, the value of k is 48.
Some of the metrics refer to the base graph. This is the directed graph
created initially by Supernova, whose edges represent unbranched paths in a
de Bruijn graph of k=48 kmers. It is also known as a unipath graph.
Information provided on the Supernova command line
Name |
Description |
sample_id |
Identifier of the sample. |
bcfrac |
Fraction of barcodes in input reads to use. The bcfrac option is deprecated, so normally this will be 1. |
Metrics about the genome, computed from the data
Name |
Description |
est_genome_size |
Estimated genome size in bases, computed from the distribution of kmers. For tested control samples, genome size estimates appear to be accurate to within about 10%, however it is theoretically possible that estimates could be 'way off', and we would like to see these cases. Single copy sex chromosomes are undercounted by a factor of two. This statistic could be confounded by microbiome sequences and contamination. |
repfrac |
Genome repetitivity index: percent of read kmers, counted with multiplicity, whose depth exceeds twice the expected depth. Intended as an index of repetitivity, rather than a measure of ‘which fraction of the genome is repetitive’. This statistic could be confounded by microbiome sequences and contamination. |
hetdist |
Mean distance in bases between heterozygous sites. May be overestimated in cases where alleles are so different that they assemble completely separately. |
high_AT_index |
High AT index: predicted percent of kmers in genome that are ≥ 90% AT. Downbiased by presence in data: the probability that a true high AT kmer will be present in the data is less than that of an average true kmer. |
ploidy_histogram |
Ploidy Histogram: For each base graph edge of length 1000-2000 kmers, we estimate its ploidy, meaning the number times which the sequence defined by the edge appears exactly in the genome, with homologous copies counted separately. We make our estimate based on depth of read coverage, and normalize it to put a peak at 2.0, corresponding to the assumption that the genome is diploid. Thus, though the true ploidies are normally integers, the estimates are floating point numbers. We round them to one digit after the decimal point, then count the number of edges for the ploidy values 0.0, 0.1, …, 6.0. These are stored in a vector that we call the ploidy histogram. Note: Sometimes there will also be a visible peak at 1.0, typically arising from highly heterozygous regions or single copy sex chromosomes. This peak could be smaller or larger than the peak at 2.0. Presence of other peaks could be a sign either of peculiar input data or defects in the normalization algorithm. The ploidy data is used by Supernova to estimate the genome size and mediates joining so as to prevent misassembly. |
Non-barcode metrics about the data, computed from the data
Name |
Description |
likely_sequencers |
Illumina instrument model or models, inferred from flowcell id(s), with some uncertainty. |
nreads |
Number of reads provided as input, after downsampling if requested. |
raw_coverage |
Raw coverage. Total bases in all reads, before trimming off barcode sequences, divided by the estimated genome size. |
effective_coverage |
Estimated effective coverage. This is the estimated deduplicated coverage of an average base on the genome, counting both alleles. The reported value is the mean for base graph edges of length 1000-2000 kmers that appear to have ploidy two (see ploidy_histogram). Changed in Supernova 2.0. |
effective_coverage_median |
Estimated effective coverage, median definition. In Supernova 1.2, this was the effective coverage metric. It is included here for backward compatibility, and may may deleted in subsequent versions of Supernova. |
bases_per_read |
Mean read length after removing the first 23 bases from the beginning of read one of each pair (the 16-base 10x barcode plus 7 additional bases). |
dup_perc |
Percentage of read pairs that are called duplicates. Two read pairs are declared duplicates of each other if the placements of their first reads on the initial (K=48, de Bruijn) assembly graphs are identical, and the first 5 bases of their second reads are the same. (On this basis, read pairs form naturally into 'duplicate groups'.) Because the barcode is not considered in this comparison, read pairs having different barcodes may be called duplicates. Thus, duplication could be overestimated, especially in genomes with high repeat content. See also interdup_perc. |
interdup_perc |
Of reads declared duplicates (see dup_perc), the percentage that occur in duplicate groups comprising more than one distinct barcode. |
median_ins_sz |
Estimated size of median inserts in library, as determined by read positions on the assembly graph. |
placed_frac |
Fraction of reads placed uniquely on the final (phased) assembly. |
proper_pairs_perc |
Of read pairs for which both reads are placed on the assembly, inferred percentage for which the reads have the correct orientation and separation. |
q30_r2_perc |
Percentage of bases assigned quality ≥ 30 on read two. |
Barcode metrics about the data, computed from the data
Please see this document for more details on the lw_mean_mol_len and bridge metrics.
Name |
Description |
lw_mean_mol_len |
Estimated length-weighted mean of molecule lengths, in bases, inferred from data. |
p10 |
For an average point on the genome, the estimated number of molecules that extend 10 kb in both directions from that point, counting both alleles. |
rpb_N50 |
N50 number of reads per 10x barcode. |
valid_bc_perc |
Percent of reads assigned a valid 10x barcode. |
bridge |
Mean number of barcodes whose reads contain two genomic 48-mers separated by one of several fixed distances. This is a vector. |
bridge_50 |
Mean number of barcodes whose reads contain two genomic 48-mers separated by 50 kb. |
bridge_1_50 |
The ratio bridge_1 / bridge_50. |
Output metrics
Note that assembly sizes and scaffold lengths do not include Ns.
Name |
Description |
assembly_size |
Size of assembly in bases, counting only one allele, excluding scaffolds < 10 kb. |
edge_N50 |
N50 size in bases of raw graph assembly edges. |
contig_N50 |
N50 size of contigs in bases, excluding scaffolds < 10 kb. |
phase_block_N50 |
N50 size of phase blocks in bases, excluding scaffolds < 10 kb. |
scaffold_N50 |
N50 size of scaffolds in bases, excluding scaffolds < 10 kb. |
scaffolds_1kb_plus |
Number of scaffolds that are at least 1 kb long. |
scaffolds_10kb_plus |
Number of scaffolds that are at least 10 kb long. |
m10 |
The estimated percent of genomic kmers that are either missing from the assembly entirely or present only in scaffolds shorter than 10 kb. Each kmer counts once regardless of its multiplicity in the genome and thus this measure discounts repeats. It measures assembly disorganization. How it is computed. The m10 statistic is computed as the percent of base graph kmers in edges ≥ 100 bases that are missing from scaffolds > 10 kb in the final assembly. This does not include kmers that are completely missing from the data, although that fraction is expected to be very small for genomes having typical overall GC content. The statistic could include some noise. |
checksum |
Assembly checksum. Used to confirm deterministic behavior. |
Computational performance metrics
Name |
Description |
mem_peak |
Peak memory in GB: the maximum amount of memory used at any point by Supernova, as reported by the operating system. Because some Supernova stages base their memory usage on the total amount that is available, this statistic is not necessarily meaningful. |
etime_h |
Wall clock time in hours for Supernova run. |
Auxiliary statistics
In addition to the metrics contained in the outs/summary.csv file,
the outs/assembly/stats/ folder contains more fine-grained information about the input data and the assembly as discussed below.
File |
Contents |
histogram_reads_per_barcode.json |
Histogram of the number of reads that share a common 10x barcode (bin size = 10). |
histogram_kmer_count.json |
Histogram of the frequency of kmers amongst the reads, after removing potentially erroneous kmers based on quality scores, low multiplicity, or occurrence in only one barcode (histogram uses bin size = 1). |
kmer_spectrum.pdf |
Plot of the histogram in histogram_kmer_count.json, truncated to kmer frequencies in the range 0 - 100. |
mol_length_dist.pdf |
Plot of the predicted abundance of molecule lengths. See Molecule Length Calculation. |
histogram_edge.json |
Histogram of assembly graph edge lengths (in 1 kb bins). |
histogram_contig.json |
Histogram of contig lengths (in 1 kb bins). |
histogram_phase_block.json |
Histogram of phase block lengths in the assembly (in 1 kb bins). |
histogram_scaffold.json |
Histogram of scaffold lengths (in 10 kb bins). |