Long Ranger1.3, printed on 11/14/2024
The principal output of the longranger run pipeline includes aligned reads with barcode and phasing information in BAM format, phased SNPs and indels in VCF format, and SV calls and candidates in BEDPE format. These are all standard file formats designed to interoperate with existing tools, and the additional information produced by the GemCode Platform are included as standards-compliant fields when appropriate.
The output of the SV calling code is BEDPE, a format similar to BED that describes pairs of genomic regions. Long Ranger uses this format to describe pairs of breakpoints that define a structural variant.
The BEDPE contains one SV per line with the following tab-delimited columns:
chrom1 - chromosome of the first breakpoint
start1 - start position of the first breakpoint
end1 - end position of the first breakpoint
chrom2 - chromosome of the second breakpoint
start2 - start position of the second breakpoint
end2 - end position of the second breakpoint
name - a unique string identifying the SV
quality - Phred-like quality score
strand1 - strand of the first breakpoint (not currently used; always '+
')
strand2 - strand of the second breakpoint (not currently used; always '+
')
filter - a semicolon-delimited list of filters that were applied to the SV, or single period (.
) if the SV was not filtered out
info - extra information about the SV or a single period (.
)
Long Ranger defines each breakpoint as a region rather than a single position because sequencing parameters such as depth and target pull-down limit the resolution of breakpoint detection. |
The filter field (column 11) is a semicolon-delimited string of filters that the SV failed to pass. The following filters may have been applied:
Filter | Description |
---|---|
BLACK_DIST | At least one breakpoint is within 10Kb of the blacklist (see also the BLACK_DIST1 and BLACK_DIST2 info fields below). |
BLACK_FRAC | The SV has >10% of base pairs overlapping the blacklist (see also the BLACK_FRAC info field below). |
SEG_DUP | The SV breakpoints are within 10Kb from copies of the same segmental duplication. |
NMATES | Both breakpoints of the SV participate in multiple (>5) SVs. This is an indication of low-complexity regions or barcode coalescence. |
LOW_MAPQ | Average MAPQ of reads in the call region < 40. Suggests potential alignment problems leading to a false positive call. |
DEPTH_DROP | Depth drop that is inconsistent with the presence of a deletion. Suggests alignment problems or coverage unevenness. |
HIGH_BC_COV | Barcode coverage on either breakpoint > 3 times the average barcode coverage genomewide. Suggests alignment problems leading to read pileups. |
TOO_MANY_FILTERED_BCS | More than 30% of the barcodes supporting the call have been associated with calls filtered by one or more of the other filters. |
The SV blacklist and segmental duplication list are included in the refdata-hg19 package required by Long Ranger. These lists define gaps and other ambiguous regions of the reference genome that have been found to raise spurious SV candidates and calls.
The info field (column 12) is a semicolon-delimited string of
key=value pairs. A single period (.
) in the value suggests
that the value is missing (eg. because the corresponding info key does not apply to this
entry of the BEDPE file). The following keys may be defined for a
given SV:
Key | Description |
---|---|
BCOV | Number of linked-read sets supporting the SV |
BLACK1 | If the first breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap). |
BLACK2 | If the second breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap). |
BLACK_DIST1 | Distance between the first breakpoint and the blacklist |
BLACK_DIST2 | Distance between the second breakpoint and the blacklist |
BLACK_FRAC | Fraction of the SV length that overlaps the blacklist |
MATCHES | Comma-separated list of ground-truth SVs that match the BEDPE entry. Always missing (. ), unless a ground-truth list of SV calls is provided to the longranger run pipeline. |
NBCS1 | Number of linked-read sets overlapping the first breakpoint |
NBCS2 | Number of linked-read sets overlapping the second breakpoint |
NMATES1 | Number of SVs involving the first breakpoint. A large number usually suggests a false positive. |
NMATES2 | Number of SVs involving the second breakpoint. A large number usually suggests a false positive. |
NOOV | Rough estimate of the number of linked-read sets that oppose the presence of the SV (eg. linked-read sets from the haplotype that does not carry the SV). |
NPAIRS | Number of read-pairs supporting the SV |
NSPLIT | Number of split reads supporting the SV |
RP_LR | Log-likelihood ratio score of read-pair support. Higher values correspond to stronger read-pair evidence. |
SEG_DUP | Comma-separated list of segmental duplications that overlap the breakpoints of the SV |