Cell Ranger ATAC1.2, printed on 11/14/2024
The reference data for Cell Ranger ATAC pipelines consists of the reference genome sequence and its associated genome annotation, which includes gene and transcript coordinates. The genome sequences and annotations can be obtained from reputable, well-established consortia such as NCBI, GENCODE, Ensembl and ENCODE. We provide pre-built single and mixed species references described in the next section, as well as a command-line tool mkref to build references that are not pre-built.
Single species pre-built references (hg19, b37, GRCh38, mm10) can be built using mkref with recognized keyword input arguments. However, there is no practical need to generate these references again and we strongly recommend you download the pre-built references directly (see advanced section for more details). We do not support building of custom mixed species references via mkref in the Cell Ranger ATAC 1.2.0 pipelines.
|
We provide the following pre-built references on the downloads page.
Standard single species reference packages:
Note that we do not use the decoy and alternate contigs in any analysis steps in the pipeline. |
Standard multi-species reference packages:
These are made by taking the union of reference sequences and annotations from individual single species pre-built references.
Note that the contigs names are prefixed by species build. For example, chr1 from hg19 is labelled as hg19_chr1 inside the hg19_and_mm10 build. |
cellranger-atac 1.2.0 supports building single
species references using mkref
.
Parameter | Function |
---|---|
GENOME | (Required) Name of the genome reference. New reference will be built as a new directory named GENOME under the current working directory. |
--config | (Optional for standard references) Configuration file to build a custom reference. Ignored when GENOME is one of the standard references: hg19, b37, GRCh38 or mm10. |
To build a custom reference, a configuration file specifying the source for
genome sequences and annotations as well as contigs present in the genome is
required (more on this in the configuration file requirements). The
following is an example config file fly_BDGP6.config
for building a reference
for Drosophila melanogaster.
{ GENOME_FASTA_INPUT: "ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.25_FB2018_06/fasta/dmel-all-chromosome-r6.25.fasta.gz", GENE_ANNOTATION_INPUT: "ftp://ftp.ensembl.org/pub/release-95/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.95.gtf.gz", MOTIF_INPUT: "http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_insects_non-redundant_pfms_jaspar.txt", ORGANISM: "Drosophila melanogaster", PRIMARY_CONTIGS: ["2L", "2R", "3L", "3R", "4", "X", "Y"], NON_NUCLEAR_CONTIGS: ["mitochondrion_genome"] }
To build the reference, run mkref:
$ cd /home/jdoe/ref $ cellranger-atac mkref fly_BDGP6 --config fly_BDGP6.config Non-standard genome name detected, building custom reference... >>> Creating reference for fly_BDGP6 <<< Creating new reference folder at /home/jdoe/ref/fly_BDGP6 Downloading fasta files from source... done Generating samtools index... done Generating pyfasta indexes... Number of contigs: 1870 Total genome size: 143726002 done Downloading gene annotation files from source... done Writing TSS and transcripts bed file... Parsed 23541 unique TSS and 28827 unique transcripts. done Generating bwa index (may take over an hour for a 3Gb genome)... [bwa_index] Pack FASTA... 1.23 sec [bwa_index] Construct BWT for the packed sequence... [BWTIncCreate] textLength=287452004, availableWord=32225820 [BWTIncConstructFromPacked] 10 iterations done. 53158068 characters processed. [BWTIncConstructFromPacked] 20 iterations done. 98205524 characters processed. [BWTIncConstructFromPacked] 30 iterations done. 138239796 characters processed. [BWTIncConstructFromPacked] 40 iterations done. 173818340 characters processed. [BWTIncConstructFromPacked] 50 iterations done. 205436596 characters processed. [BWTIncConstructFromPacked] 60 iterations done. 233534948 characters processed. [BWTIncConstructFromPacked] 70 iterations done. 258504820 characters processed. [BWTIncConstructFromPacked] 80 iterations done. 280694052 characters processed. [bwt_gen] Finished constructing BWT in 84 iterations. [bwa_index] 94.16 seconds elapse. [bwa_index] Update BWT... 0.93 sec [bwa_index] Pack forward-only FASTA... 0.74 sec [bwa_index] Construct SA from BWT and Occ... 33.75 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index /home/jdoe/ref/fly_BDGP6/fasta/genome.fa [main] Real time: 131.225 sec; CPU: 130.816 sec done Downloading pfm files from source... done Finishing up... >>> Reference successfully created! <<<
Indexing is the computational bottleneck in building references for Cell Ranger ATAC. Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory.
For building custom references, you must supply a configuration file like the drosophila example shown in Building with mkref section. The example file is written in "human readable" JSON format, though a strictly formatted JSON is perfectly acceptable. Below is a table of required input keys for the configuration file. Each key is provided with a value that must satisfy type constraints specified in the second column. There are format requirements on the values, for example if the value is a url pointing to a file or a file path.
Required Input Keys | Type Requirements | Format Requirements |
---|---|---|
GENOME_FASTA_INPUT | valid url or file path |
|
GENE_ANNOTATION_INPUT | valid url or file path |
|
MOTIF_INPUT | valid url or file path. Use "" to indicate it as not available. |
|
PRIMARY_CONTIGS | list | Must be within the bracket `[]` and each contig must be within quote "". Note that PRIMARY_CONTIGS cannot be an empty list. |
NON_NUCLEAR_CONTIGS | list | Must be within the bracket `[]` and each contig must be within quote "". Use empty brackets `[]` for specifying empty list. |
ORGANISM | string | Can be left empty as "". If provided, it will be displayed on the summary html file. |
A single species reference compatible with the Cell Ranger ATAC pipelines has the following file structure:
$ tree /home/jdoe/ref /home/jdoe/ref ├── fasta │ ├── contig-defs.json [required, input] │ ├── genome.fa [required, input, for pre-built references, sources: NCBI] │ ├── genome.fa.amb [required, derived from genome.fa using samtool faidx, bwa, pysam] │ ├── genome.fa.ann [required, derived from genome.fa using samtool faidx, bwa, pysam] │ ├── genome.fa.bwt [required, derived from genome.fa using samtool faidx, bwa, pysam] │ ├── genome.fa.fai [required, derived from genome.fa using samtool faidx, bwa, pysam] │ ├── genome.fa.flat [required, derived from genome.fa using samtool faidx, bwa, pysam] │ ├── genome.fa.gdx [required, derived from genome.fa using samtool faidx, bwa, pysam] │ ├── genome.fa.pac [required, derived from genome.fa using samtool faidx, bwa, pysam] │ └── genome.fa.sa [required, derived from genome.fa using samtool faidx, bwa, pysam] ├── genes │ ├── genes.gtf [required, input, GENCODE sources for pre-built references: hg19, b37, GRCh38 and mm10] │ └── regulatory.gff [pre-built references only, Ensembl sources: hg19, b37, GRCh38 and mm10] ├── genome [required, input] ├── metadata.json [required, input] └── regions ├── blacklist.bed [pre-built references only, ENCODE sources: hg19, b37, GRCh38, mm10] ├── ctcf.bed [pre-built references only] ├── dnase.bed [pre-built references only, ENCODE sources: hg19, b37, mm10, Anshul Kundaje's pipeline: GRCh38] ├── enhancer.bed [pre-built references only, source: Ensembl regulatory build release 95] ├── promoter.bed [pre-built references only, source: Ensembl regulatory build release 95] ├── motifs.pfm [optional, input, source for pre-built references: JASPAR vertebrate non-redundant collection] ├── transcripts.bed [required for 1.1 and later references, derived from transcript coordinates in genes.gtf] └── tss.bed [required, derived from first nt position of each transcript in genes.gtf]
The required files mentioned above are the minimal set of files required to
create a directory structure compatible with Cell Ranger ATAC pipelines. Some
required files are specified as part of input in the config file described
in the configuration file requirements section. Other required files
are derived by processing a required input file. The regulatory and functional
domain files such as promoter.bed are present only in the pre-built references.
The transcripts.bed is a derived file not present in 1.0 references but the
1.2.0 pipelines are backwards compatible with old 1.0
references. Note that mkref
recognizes four keywords (hg19,b37,mm10,GRCh38)
and running cellranger-atac mkref
will create our pre-built references.