Long Ranger2.1, printed on 11/14/2024
The longranger pipeline outputs an indexed BAM file containing position-sorted, aligned reads. Each read in this BAM file has Chromium barcode and phasing information attached. The following assumes basic familiarity with the BAM format. More details on the the SAM/BAM standard are available at the hts-specs website. Long Ranger follows standardized tags, and also adds some additional tags that are in the process of being standardized. 10x also provides a bamtofastq tool to convert BAM files produced by longranger back to FASTQs that an be used to re-run longranger.
Chromium barcode information for each read is stored as TAG fields:
Tag | Type | Description |
---|---|---|
BX | Z | Chromium barcode sequence that is error-corrected and confirmed against a list of known-good barcode sequences. Use this for analysis. |
BC | Z | Sample index (I7) read. |
QT | Z | Sample index (I7) read quality. Phred scores as reported by sequencer. |
RX | Z | Raw Chromium barcode sequence. This read is subject to sequencing errors. Do not use for analysis. |
QX | Z | Raw Chromium barcode read quality. Phred scores as reported by sequencer. |
TR | Z | Sequence of the 7 trimmed bases following the barcode sequence at the start of R1. Can be used to reconstruct the original R1 sequence. |
TQ | Z | Quality values of the 7 trimmed bases following the barcode sequence at the start of R1. Can be used to reconstruct the original R1 quality values. |
The BX
tag includes a suffix with a dash separator followed by a number:
AGAATGGTCTGCATCG-1
This number denotes what we call a GEM group, and is used to virtualize barcodes in order to achieve a higher effective barcode diversity when combining samples generated from separate GEM chip channel runs. Normally, this number will be "1" across all barcodes when analyzing a sample generated from a single GEM chip channel. It can either be left in place and treated as part of a unique barcode identifier, or explicitly parsed out to leave only the barcode sequence itself.
The following tags will also be present on reads that were confidently assigned to a haplotype.
Tag | Type | Description |
---|---|---|
PC | i | Phred-scaled confidence that this read was phased correctly. |
PS | i | Phase set containing this read. This corresponds to the phase set (PS) field in the VCF file. The value is the position of the first SNP in the phase block. |
HP | i | Haplotype of the molecule that generated the read. |
MI | i | Global molecule identifier for molecule that generated this read. |
Phase sets, defined in the VCF standard,
are regions within which identified haplotypes are mutually consistent. As a
result, HP
tags are only comparable between reads that share a
common PS
. By definition, adjacent phase sets lack sufficient
Linked-Reads to determine the relationship between their haplotypes.
The Lariat aligner used by Long Ranger uses the long range information carried in the barcodes to improve mapping into duplicated regions of the genome. Lariat emits extra non-standard tags that indicate how the alignment results were affected by the molecule inference process. If Lariat finds strong evidence that a molecule must exist at particular locus, it can boost the MAPQ of ambiguous reads that have an alignment inside the molecule by roughly 40 MAPQ points.
For more details see the paper that Lariat is based on: Read clouds uncover variation in complex regions of the human genome (Bishara et al)
Tag | Type | Description |
---|---|---|
AS | i | Defined in SAM spec. The alignment score of read to genome sequence, at the mapping location selected for this read. The score includes match, indel, and clipping and mate pairing penalties, but excluding any molecule scoring. Note that the scaling and offset of this field differ from BWA. Perfect alignments will have a score of 0, and alignment penalties will reduce the score below 0. |
XS | i | The alignment score of read to genome sequence (see AS tag), at the second best mapping for this read. Because Lariat also considers molecule scoring when selecting the best mapping, it may not choose the one with with the best reported alignment score. For this reason XS may be greater than AS. |
AM | A | 1 if this alignment in a long molecule, 0 otherwise. Alignments in long molecules will have their MAPQ boosted above alternative alignments not in molecules. |
XM | A | 1 if second best alignment is in a long molecule, 0 otherwise. |
XT | i | Indicate if there is tandem duplication affecting this alignment. 1 if second best alignment is in the same molecule as the best alignment, 0 otherwise. |