HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell Multiome ATAC + Gene Exp.

Feature-Barcode Matrices

The cellranger-arc count pipeline outputs two types of feature-barcode matrices described in the table below. There are two types of features, Gene Expression and Peaks, in a matrix. For a Gene Expression feature each element in the matrix is the number of UMIs associated with the corresponding feature (row) and barcode (column). For a Peaks feature each element in the matrix is the the number of cut sites associated with the corresponding feature (row) and barcode (column).

Type Description
Raw feature-barcode matrix The rows (features) consist of every barcode that could theoretically have been observed. The matrix contains non-zero counts for background and cell-associated barcodes and zero counts for unobserved barcodes.
Filtered feature-barcode matrix The rows (features) of the matrix are restricted to the detected cellular barcodes.

Both the matrices described above are sparse, in other words, a large number of entries in the matrix are zero. Each matrix is stored in two formats for sparse matrices, the text-based Market Exchange Format (MEX) that is described below and the HDF5 format described here. It also contains gzipped TSV files with feature and barcode sequences corresponding to row and column indices respectively. For example, the matrices output may look like:

$ cd /home/jdoe/runs/sample345/outs
$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz

0 directories, 3 files

For a Gene Expression feature: The first and second column of the features.tsv.gz file stores the gene ID and name as defined in the reference GTF, respectively. If no gene_name field is present in the reference GTF, gene name is equivalent to gene ID. The third column identifies the type of feature, which will be one of Gene Expression or Peaks, depending on the feature type. The fourth, fifth and sixth columns store the chromosome, start and end positions of the cellranger-arc determined TSS for this gene in 0-based bed-format.

TSS for each gene: The TSS is defined as the 0-based co-ordinate of the 5'-most position of a transcript. For each gene, transcripts are restricted to those that have the GENCODE basic tag. If a gene does not have a transcript with this tag then all associated transcripts are selected. For each gene, we define the TSS as the minimum region that spans the TSSs of all selected transcripts.

For a Peaks feature: The first two columns of the features.tsv.gz file, store the peak ID which is the location of peak, denoted as "contig:start-end". The third column identifies the type of feature, which will be one of Gene Expression or Peaks, depending on the feature type. The fourth, fifth, and sixth columns store the chromosome, start and end positions of the peak in 0-based bed-format.

Below is a minimal example features.tsv.gz file showing data collected for 3 genes and 2 peaks.

$ zcat filtered_feature_bc_matrix/features.tsv.gz | head -n 5
ENSG00000139687         RB1                     Gene Expression chr13   48303725        48303747
ENSG00000141510         TP53                    Gene Expression chr17   7675492         7687550
ENSG00000012048         BRCA1                   Gene Expression chr17   43125314        43125483
chr13:48301972-48306754 chr13:48301972-48306754 Peaks           chr13   48301972        48306754
chr17:76751362-76755140 chr17:76751362-76755140 Peaks           chr17   76751362        76755140

Barcode sequences correspond to column indices.

$ zcat  filtered_feature_bc_matrices/barcodes.tsv.gz | head -n 5
AAACAGCCAAATATCC-1
AAACAGCCAGGAACTG-1
AAACAGCCAGGCTTCG-1
AAACCAACACCTGCTC-1
AAACCAACAGATTCAT-1

Each barcode sequence includes a suffix with a dash separator followed by a number:

AAACAGCCAAATATCC-1

More details on the barcode sequence format are available in the GEX BAM section. Note that the barcodes correspond to the 10x barcode sequence of the Gene Expression library, and to learn more about the pairing between ATAC and GEX barcodes see Barcode Translation

R and Python support the MEX format, and sparse matrices can be used for more efficient manipulation.

Loading Matrices into R

The R package Matrix supports loading MEX format data, and can be easily used to load the sparse feature-barcode matrix, as shown in the example code below.

library(Matrix)
matrix_dir = "/opt/sample345/outs/filtered_feature_bc_matrix/"
barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz")
features.path <- paste0(matrix_dir, "features.tsv.gz")
matrix.path <- paste0(matrix_dir, "matrix.mtx.gz")
mat <- readMM(file = matrix.path)
feature.names = read.delim(features.path, 
                           header = FALSE,
                           stringsAsFactors = FALSE)
barcode.names = read.delim(barcode.path, 
                           header = FALSE,
                           stringsAsFactors = FALSE)
colnames(mat) = barcode.names$V1
rownames(mat) = feature.names$V1

Loading Matrices into Python

The csv, os, gzip and scipy.io modules can be used to load a feature-barcode matrix into Python as shown below.

import csv
import gzip
import os
import scipy.io
 

matrix_dir = "/opt/sample345/outs/filtered_feature_bc_matrix"
mat = scipy.io.mmread(os.path.join(matrix_dir, "matrix.mtx.gz"))

features_path = os.path.join(matrix_dir, "features.tsv.gz")
feature_ids = [row[0] for row in csv.reader(gzip.open(features_path), delimiter="\t")]
gene_names = [row[1] for row in csv.reader(gzip.open(features_path), delimiter="\t")]
feature_types = [row[2] for row in csv.reader(gzip.open(features_path), delimiter="\t")]
barcodes_path = os.path.join(matrix_dir, "barcodes.tsv.gz")
barcodes = [row[0] for row in csv.reader(gzip.open(barcodes_path), delimiter="\t")]