Long Ranger2.1, printed on 11/23/2024
Long Ranger uses a new aligner called 'Lariat'. Lariat aligns all the linked reads for a single barcode simultaneously, with the prior knowledge that the reads arise from a small number of long (10kb - 200kb) molecules. This approach allows reads to be mapped to repetitive regions with modest copy number such as segmental duplications. Lariat is based on the original RFA method developed by Alex Bishara, Yuling Liu et al in Serafim Batzoglou’s lab at Stanford. (Genome Research, 2015). Lariat generates candidate alignments by calling the BWA C API, then performs the RFA inference to select the final mapping position and MAPQ.
Long Ranger wraps standard short-read variant callers to generate SNP and small indel calls. By default FreeBayes is used, but Long Ranger can also be used with a customer provided copy of GATK. The variant callers are invoked by Long Ranger with best-practices parameters, along with some parameter changes to optimize results for 10x libraries. After phasing, the variant caller is invoked separately in 'haploid mode' on reads from each phased haplotype. This step boosts the sensitivty of the variant calls by calling low-allele fraction variants that are the dominant allele on one haplotype. Variant called in this phase will be tagged with HAPLOCALLED=1
in the INFO field.
The POPULATE_INFO_FIELDS
stage determines which barcodes are associated with each observed allele of each heterzygous SNP.
Long Ranger aligns the raw read sequence to the sequence of both alleles to determine which allele the read supports. The phasing algorithm finds a phasing configuration that optimizes probabilistic of the barcoding and read-generation process. The basic model is similar to the model in HASH (Bansal and Halpern, Genome Research, 2008), with improvements that account for false-positive variant calls, incorrect assignment of alleles to barcodes, and the possibility that a barcode carries two molecules on opposite haplotypes of the same locus.
While phasing the alleles of each SNP, we also determine the haplotype of each input molecule, and tag each read in the input BAM with an 'HP' and 'PS' tags indicating the haplotype and phase set that each read came from. See our BAM documentation for details. Phased reads can be very valuable when analyzing more complex variation such as SVs, CNVs and somatic variation.
The large-scale SV caller looks for distant pairs of loci in the genome that share many more barcode than would be expected by chance. This overlap indicates that the two loci that are distant in the reference sequence are nearby in the sample and generates a candidate SV. Candidate SVs are refined by comparing the layout of reads and barcodes around the event the patterns expected in deletions, inversions, duplications, and translocations to identify the SV type and find the maximum-likelihood breakpoints.
In whole-genome mode, Long Ranger calls deletion SVs in the 50bp-30kbp size range. Long Ranger uses haplotype-specific coverage drops and discordant read pairs to identify potential deletions. A local assembly of phased reads, or a probabilistic model of phased coverage and discordant reads is used to confirm the event and determine the breakpoints.
In targeted mode, Long Ranger calls heterozygous and homozygous deletion SVs ranging in size from 1 exon, up to 50kb. By looking for haplotype-specific drops in coverage Long Ranger can detect deletions without seeing any discordant read pairs. Sufficient coverage of phased reads is required to detect heterozygous deletions, which requires covered heterozygous variants in the vicinity of the event.