Cell Ranger4.0, printed on 12/22/2024
Cell Ranger 4.0 includes a new dedicated cell calling algorithm that is applied specifically to identify cells from Targeted Gene Expression datasets. The Targeted Gene Expression cell calling method relies on identification of a transition point based on the shape of the barcode rank plot to separate cell-associated barcodes from background partitions, and is designed to successfully classify cells for a wide variety of potential target gene panels.
Note: This method is distinct from the cell calling algorithm used for non-targeted Whole Transcriptome Analysis (WTA) Chromium single cell datasets, and it does not employ the EmptyDrops approach (Lun et al., 2018) due to the distinct features of Targeted Gene Expression datasets.
The algorithm has three key steps:
In the first step, an approach analogous to the original Cell Ranger cell calling algorithm is used to identify the primary mode of high RNA content cells, using a cutoff based on the total UMI count (from both targeted and non-targeted genes, but excluding any filtered UMIs) for each barcode. Cell Ranger takes as input the expected number of recovered cells, N (see --expect-cells
). Let m be the 99th percentile of the top N barcodes by total UMI counts. All barcodes with total UMI counts m/10 or greater are called as cells in the first pass.
In the second step, the algorithm fits a smoothed cubic spline curve to the log-transformed barcode rank plot (defined as the curve whose y-axis is the log of the UMI counts per barcode, and whose x-axis is the log of the number of barcodes with UMI counts greater than or equal to the associated y-value). Next, the algorithm looks for a potential transition point where the gradient (or first derivative value) of this curve has its minimum value, within a defined range of allowed barcode ranks considered for additional cell calling (beyond the number of cells called in the first step). Then, any additional barcodes whose total UMI counts exceed the cutoff value associated with the transition point are added to the set of positive cell calls. This second step retains additional cells from the region of the barcode rank plot preceding the steep ‘cliff’ at the transition point separating cell-associated barcodes from background partitions.
In the third step, any barcodes called as cells whose total targeted UMI counts are zero (such that all of their total UMI counts derive from non-targeted genes) are removed from the final set of positive cell calls. This last step ensures that all identified cells contain positive targeted UMI counts that will be useful for subsequent analysis steps.
For Targeted Gene Expression data, targeted UMI filtering is performed after the usual UMI Counting step. This additional filtering is only active for sequencing libraries with very high depth, in which spurious molecules can be observed in a very small fraction of reads. Targeted UMI filtering can be disabled if desired using --no-target-umi-filter
, although this is not recommended.
The targeted UMI filtering threshold is computed based on the distribution of number of supporting reads per UMI (RPU) over UMIs counted towards targeted genes. The threshold is equal to the 90th percentile RPU value, multiplied by 0.01 and rounded up. For practical reasons, this threshold is computed using a random subset of valid barcodes comprising at least 10% of all barcodes or 1 million reads (whichever is greater). Then, after aligning and annotating the full set of reads, each targeted UMI with read count strictly lower than the targeted UMI threshold is removed. The UMIs removed by this filter will not appear in any Cell Ranger outputs except in the BAM file, where the associated reads are indicated using the xf
tag.
Note that when the 90th percentile RPU is lower than 100 reads, the filtering threshold is computed as 1, meaning that no UMIs are filtered. In this case, the targeted UMI filtering threshold appears as N/A in the Web Summary.
In order to specifically recover molecules from targeted genes of interest in gene expression libraries, a set of bait oligonucleotides are designed with sequence complementarity for each gene in a targeted panel. Baits designed for the Targeted Gene Expression assay are 120 base pair (bp) biotinylated oligonucleotides that are used within a hybridization-capture workflow.
Targeted Gene Expression baits were designed with the following broad design principles:
Baits are tiled across the full length of all transcripts annotated in the GRCh38-2020-A reference. The transcript annotations in this reference are based on the GENCODE comprehensive set. See release notes for documentation on this new reference.
For each transcript within a gene, bait sequences are designed starting at the 5’ and 3’ ends and progresses inwards until the center of the transcript is reached. Baits are typically 120bp in length (see the Optimizing Specificity section for more details on exceptions) and are aimed to tile at 1x coverage. If the transcript length is not a multiple of 120bp, a small coverage gap appears in the center of the transcript (away from the annotated 3’ or 5’ ends). In order to prevent the unnecessary selection of redundant baits from regions of identical sequence shared across transcripts from the same gene, a De Bruijn graph data structure is used. By querying the De Bruijn graph constructed using all 120bp subsequences derived from each transcript, the algorithm ensures that any potential new bait sequence is rejected if it would be redundant with an existing previously-designed bait from the same target gene.
A subtle consequence of this procedure is that the ordering of transcripts can, to some extent, impact the exact set of baits designed for a given gene. We utilize the rankings provided by the APPRIS database (Annotating principal splice isoforms) to order transcripts such that higher confidence transcript annotations are tiled first. Other measures are taken to make sure ordering is deterministic when ties occur in categorizations from the APPRIS.
Introns are not included in our tiling of annotated transcripts, only UTRs and coding regions. As a consequence, intronic reads from targeted genes will be mostly depleted in Targeted Gene Expression libraries.
There are many repetitive elements in the transcriptome that, if overlapped by baits, could result in reduced enrichment in Targeted Gene Expression libraries, particularly within large target gene panels. Before designing baits, K-mers with many near-identical occurrences in the genome are identified so that they can be avoided during bait design.
If a bait intended to be designed at a given position contains one of the K-mers above, a combination of shifting the bait position up or downstream and removing portions of the bait sequence is used to attempt to retain a functional bait that will not overlap the problematic sequence. If this is not possible, the algorithm will not design a bait at this location.
Approximately 2-3% of baits within the pre-designed panels are shorter than 120bp because of the procedure above.
Via the 10x Genomics Custom Panel Designer users can also design baits for entirely custom exogenous sequences. These sequences follow the same design procedure as endogenous genes, including the procedure described in the Optimizing Specificity section. Each custom sequence is treated independently. No check for redundancy is performed between submitted sequences, as would normally occur for transcripts within the same gene. Similarly, no check for further homology between submitted sequences and endogenous sequences is performed.
In order to assign a UMI to a particular gene, Cell Ranger has to be able to confidently identify which gene a given read came from. A small number of genes share a substantial proportion of their sequence content with other genes in the genome, which in extreme cases makes it difficult to assign the corresponding UMI count to a single gene. These genes may have reduced counts in the resulting expression matrices.
The bait design algorithm intentionally avoids designing baits that overlap repetitive sequences as described in the Optimizing Specificity section above. A very small number of genes have low coverage of baits around annotated transcript ends due to the presence of repetitive sequences. These genes are flagged with poor coverage within the last 360bp of the 3’ or 5’ ends of annotated transcripts, where the majority of signal for most genes is expected to be localized in 3'/5' gene expression libraries. While the baits that are provided for these genes may work well, enrichment of these genes may be less robust depending on the particular sample and location of library molecules within annotated transcripts.
Genes with very low mappability or low bait coverage at annotated transcript ends are noted using the mappability_flag
column in the gene metadata file provided for predesigned and customized panels. These genes will also trigger a warning if added to customized panels via the 10x Genomics Custom Panel Designer.
If a gene you want to add to your panel has triggered a mappability warning (or equivalently has the value TRUE
in the mappability_flag
column of the gene metadata file), the following steps are recommended:
Check if this gene has near-zero counts in existing whole transcriptome datasets. If so, this warning may indicate the gene is very poorly mappable and Cell Ranger is not able to assign many reads to this gene. Targeting this gene may not yield the desired results.
If you have existing whole transcriptome data, you can use IGV or another genome browser to view coverage within this gene alongside the bait locations (as defined in the panel's BED12 file). If the baits appear to cover the regions where reads have aligned, then enrichment is likely to work well. If baits do not overlap the main regions of coverage, enrichment may not work well for this gene and you may want to consider whether to include this gene on your panel. See our documentation on BED12 files for more information.
In a Targeted Gene Expression sample, it is assumed that there are two classes of genes: genes from the target panel that are enriched by targeting, and non-targeted genes that constitute the background. Since the number of reads per gene in the sample before targeting is not known, standard approaches to directly calculate read enrichments are not available. Instead, the mean number of reads per UMI (or Mean Reads per UMI) for each gene is calculated, which serves as a proxy for read enrichment. The mean number of reads per UMI for a gene is closely related to the sequencing saturation for that gene (mean reads per UMI = 1/(1 - sequencing saturation)
, also see GEX Metrics). Given there are a finite number of UMIs per gene in a given sample, enrichment will tend to recover more UMIs from those genes as well as more PCR duplicates arising from those UMIs. Therefore enriched genes are likely to have greater sequencing saturation and greater mean reads per UMI.
In order to assess whether targeted genes were enriched in a given sample, the mean reads per UMI values are first calculated for all genes (both targeted and non-targeted genes). Only UMIs in cell-associated barcodes are used and only genes with at least 10 UMIs in cell barcodes are considered for this analysis. Cell Ranger then fits a two-component Gaussian Mixture Model to the mean reads per UMI distribution, grouping genes into one of two classes: enriched genes (those with higher mean reads per UMI) and non-enriched genes (those with lower mean reads per UMI). The numbers of targeted and non-targeted genes that are considered enriched by this method are shown in the Targeted Gene Expression Run Summary Targeted Enrichment dashboard.
Cell Ranger may sometimes be unable to confidently assign genes as enriched or non-enriched using this method, particularly when there are too few targeted and/or non-targeted genes meeting the criteria above, or if sequencing saturation is too low. In these rare cases, enrichment calculations may be disabled, and the metrics reporting the number of genes enriched will be N/A
.
The computation of gene enrichments in targeted-compare is analogous to that described above, with one modification. When running targeted-compare, the parent sample read counts are known, allowing for direct calculation of read enrichments. Cell Ranger therefore uses the same Gaussian Mixture Model described above, substituting Mean Reads per UMI with Read Enrichments. Read Enrichments per Gene are calculated as (Number of reads in the Targeted Sample)/(Number of reads in the Parent Sample), after rescaling both samples to the same number of read pairs from Gene Expression libraries.