Long Ranger2.1, printed on 11/23/2024
Long Ranger reports small variant calls in a VCF file, a standard format compatible with other tools. When appropriate, additional data produced by the Chromium platform are included in standard fields. However, in some cases we add fields to report data that are not yet accounted for by the spec.
Phasing results are encoded as per section 1.4.2 of the VCF standard, using genotype fields GT
(genotype), PS
(phase set), and PQ
(phasing quality). Long Ranger also emits two non-standard tags containing additional data. The BX
tag contains per-allele barcode information, and the JQ
tag contains a phasing 'junction' quality value.
The GT
(genotype) field encodes allele values separated by either of / or |. The allele values are 0 for the reference
allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and
so on. For diploid calls examples could be 0/1
, 1|0
, or 1/2
, etc. / indicates an unphased genotype, and | indicates a phased genotype. For phased genotypes, the allele to the left of the bar is haplotype 1, and the allele to the right of the bar is haplotype 2.
PS
(phase set) marks the set of variants that have been phased into a block. Variants with the same PS
value are in the same phase block. Variants with different PS
values are not phased with respect to one another, typically due to
a lack of heterozygous SNPs or long-molecule coverage needed to extend a phase block. When evaluating phasing it is important to only consider
phasing assertions within a single phase set. We use the recommended convention that the PS value is the position of the first variant in the phase set.
CHROM | Pos | REF | ALT | GT | PS |
---|---|---|---|---|---|
chr1 | 1000 | A | C | 0|1 | 1000 |
chr1 | 1010 | T | G | 1|0 | 1000 |
chr1 | 2000 | C | T | 0|1 | 2000 |
chr1 | 2005 | T | G | 0/1 | 2000 |
chr1 | 2008 | G | C | 0|1 | 2000 |
In this example we have two phase blocks, denoted by PS=1000 and PS=2000. PS=1000 spans position 1000-1010, and PS=2000 spans position 2000-2008. In PS=1000, haplotype 1 contains the REF A allele at position 1000, and the ALT G allele as position 1010, while haplotype 2 contains the ALT C allele at position 1000 and the REF T allele at position 1010.
In PS=2000, haplotype 1 contains REF alleles at position 2000 and 2008, while haplotype 2 contains ALT alleles. At position 2005, we have detected a variant but have not phased it, so we don't know which allele is on which haplotype.
PS=1000 and PS=2000 are different phase blocks, so we don't know if haplotype 1 in PS=1000 corresponds to haplotype 1 or haplotype 2 in PS=2000.
The PQ
(phasing quality) tag is a phred-scaled probability that alleles are phased incorrectly in a heterozygous call. PQ
is derived from the likelihood ratio of the maximum-likelihood phasing solution and an alternate solution where the phasing of this variant is flipped.
The JQ
(junction quality) tag is a 10x-specific addition. It contains the phred-scaled probability that there is a large-scale phasing switch error occuring between this variant and the following variant. JQ
is derived from the likelihood ratio of the best phasing solution and an alternate solution where every downstream variants is flipped. If flipping downstream variants doesn't decrease the likelihood much, the JQ will be low. Phase blocks are broken at variants with JQ < 25.
BX
stores the 10x barcodes supporting each allele of the variant. It is encoded as a comma-delimited string of the form:
BC_STRINGref,BC_STRINGalt1,BC_STRINGalt2,...
Each BC_STRING
entry stores the barcode and base QV of each read that supported the corresponding allele.
BC_STRING
is semicolon-delimited strings consisting of underscore-delimited strings:
BC1_QUAL1-1_QUAL1-2_...;BC2_QUAL2-1_QUAL2-2...
Where BC1 is the first barcode, and QUAL1-1 and QUAL1-2 are the observed Phred qualities of the bases that aligned to the variant position
For example, a BX
field that contains:
AAAA_40_38;CCCC_40,GGGG_39
encodes two BC_STRINGs--one for the reference allele (AAAA_40_38;CCCC_40) and one for the alternate allele (GGGG_39):
The Long Ranger pipeline uses a number of custom filters, which will show up in the FILTER
field of the VCF. These use barcode and phasing information to improve the quality of variant calls. Here we give an overview of how those filters are implemented.
FILTER | Description |
---|---|
10X_QUAL_FILTER | A basic variant quality filter, tuned for 10x data. Heterozygous variants with QUAL < 15 and homozygous variants with QUAL < 50 will fail this filter. |
10X_ALLELE_FRACTION_FILTER | Filters heterozygous variants with allele fraction < 15%. |
10X_PHASING_INCONSISTENT | Flags heterozygous variants where the reads supporting each allele do not segregate cleanly onto the local haplotypes. The phasing algorithm compares the likelihoods of a background false-positive model and a sequencing error model to classify likely false positives. This is a powerful filter for reducing false-positive variant calls. Be aware that somatic or mosaic variants, where only a subset of the sample carries the variant, will be preferentially tagged with this filter. If you are interested in these variants, you may want to include these variants in your analysis. |
10X_HOMOPOLYMER_UNPHASED_INSERTION | A 10x-specific filter for insertions in homopolymers with length >= 4, that are unphased. This class of variant calls is observed to be mostly false positives. |
10X_RESCUED_MOLECULE_HIGH_DIVERSITY | Filter variants that are supported primarily by reads that have been 'rescued' with barcode-aware alignment, where the mapped molecule has a high degree of divergence from the reference. This filter reduces false-positive variant calls in complex duplicated loci that tend to have missing copies in the reference genome. |