29 Jan 2021 • on alevin-fry

Processing data quickly and in small memory with alevin-fry

The alevin-fry pipeline provides many different options for processing your single-cell RNA-seq data, with different choices making different sets of simplifying assumptions about what type of processing is appropriate. As the developers of this tool, we are still actively exploring the effects that different choices (which imply different assumptions) have on data pre-processing and downstream analysis. However, in this tutorial, the focus is on processing data quickly and in a small amount of memory. Thus, we will run the pipeline in a configuration that makes the most simplifying assumptions about how the data should be processed. In this configuration, the pipeline is adopting certain computational simplifications that are argued for in Melsted et al. 2019, specifically the use of pseudoalignment to map reads to the target transcriptome, the omission of small-edit distance UMI collapse (only identical UMIs are collapsed).

Preliminaries and overview of the pipeline

For this pipeline, you will need salmon (>= v1.4.0), and alevin-fry (>= v0.1.0). As always, it’s best to use most recent versions of these tools as possible.

In this pipeline, we will be processing the PBMC10k v3 data made available by 10x Genomics.

The general flow of the pipeline will be to first download the reference sequence to build an index for alevin (this need only be done once per reference annotation, and you can use the same index to map reads for any datasets sequenced in this organism); alevin itself is capable of using a number of different types of reference sequences, potentially considering different types of “decoy” sequences during mapping. However, “decoy”-aware mapping is not yet supported in the alevin-fry pipeline, so here we will simply make use of the reference transcriptome. Then, we will download the raw sequencing reads, and map the reads to the reference transcriptome with alevin (here, we will be using --sketch mode, which maps the reads by pseudoalignment with additional structural constraints, though one can also choose to selectively-align the reads). Then, given the output produced by alevin, we will use alevin-fry to determine the set of barcodes that correspond to high-confidence cells, to collate the mapping data by corrected barcode, and finally to quantify the data into a cell by gene matrix.

Obtaining the reference transcriptome and building an index

For this example we will make use of the Human reference (GRCh38) dataset provided by 10x genomics. This will make it easy for one to later compare the generated counts to those produced by Cell Ranger using the same reference. First, let’s download the data and decompress it.

$ wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
$ tar xzf refdata-gex-GRCh38-2020-A.tar.gz

Given the genome and annotations in the 10x reference package, we’ll extract a transcriptome from the reference genome and GTF file that encodes the transcripts. For this we’ll use the excellent gffread tool.

$ gffread -w refdata-gex-GRCh38-2020-A/transcriptome.fa -g refdata-gex-GRCh38-2020-A/fasta/genome.fa refdata-gex-GRCh38-2020-A/genes/genes.gtf

Now, we will make the working directory for our analysis and build the index:

$ mkdir -p af_tutorial
$ cd af_tutorial
$ salmon index -t ../refdata-gex-GRCh38-2020-A/transcriptome.fa -i grch38_idx -p 16

The -p 16 flag tells salmon to use multiple threads during indexing. While not all steps of indexing are multi-threaded, many are, and this will speed up index construction substantially (this step should take around 1 minute).

Obtaining the read data

Now we will download the sequencing data. This is a tagged-end single-cell dataset that makes use of version 3 of the 10x chromium chemistry. First download the raw FASTQ files (that you can obtain here).

$ wget https://cg.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_fastqs.tar
$ tar xf pbmc_10k_v3_fastqs.tar

now you should be able to list the files as such, and see the following:

$ ls pbmc_10k_v3_fastqs/
10xv3_whitelist.txt                  pbmc_10k_v3_S1_L001_R1_001.fastq.gz  pbmc_10k_v3_S1_L002_I1_001.fastq.gz  pbmc_10k_v3_S1_L002_R2_001.fastq.gz
pbmc_10k_v3_S1_L001_I1_001.fastq.gz  pbmc_10k_v3_S1_L001_R2_001.fastq.gz  pbmc_10k_v3_S1_L002_R1_001.fastq.gz

Obtaining the transcript to gene file

now we will generate the transcript to gene mapping file needed for running alevin-fry.

$ awk '{if($3=="transcript") {OFS="\t"; print $14, $10} }' ../refdata-gex-GRCh38-2020-A/genes/genes.gtf | sed 's/[;\"]//g' > t2g.tsv

Mapping the data to obtain a RAD file

Now that we have downloaded the data and indexed the reference, we are ready to map our reads. We can do this as follows:

$ salmon alevin -i grch38_idx -p 16 -l IU --chromiumV3 --sketch \
-1 pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L001_R1_001.fastq.gz pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L002_R1_001.fastq.gz \
-2 pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L001_R2_001.fastq.gz pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L002_R2_001.fastq.gz \
-o pbmc_10k_v3_map
--tgMap t2g.tsv

This will produce an output folder called pbmc_10k_v3_map that contains all of the information that we need to process the mapped reads ^(¹).

Processing the mapped reads

Generating a permit list of filtered cells

The next step in the pipeline is to generate a “permit list” of barcodes that are believed to belong to real cells. Many of the barcodes that appear in the mapping file will contain only a handful of reads, and therefore are very unlikely to correspond to droplets containing a cell. Given that alevin-fry works by converting the raw mapping file into a mapping file where read mappings are collated by corrected barcode and chunked within the file on a per-cell basis, providing a filtered permit list is an important step in the subsequent processing of the data.

The alevin-fry tool provides a number of different ways to generate a permit list (here we will use the knee method). One can also force it to extract a given number (k) of cells (it will sort and take the k most frequent barcode), or to give an expected number of cells (where it will estimate the number of barcodes to include using the same algorithm as STARsolo). Finally, you can also provide it with an explicit list of filtered cell barcodes. It is important that, if you provide an explicit list, that it correspond to the filtered cell barcode list and not e.g. the full list of potentially valid barcodes (like the 10xv3_whitelist.txt in our pbmc_10k_v3_fastqs directory) ^(²).

As mentioned above, here we will generate the permit list using the knee method, which looks at the cumulative frequency histogram of the counts of barcodes and attempts to determine the knee of this curve to decide on the number of filtered barcodes to process. The iterative knee estimation procedure is the knee finding method adopted by UMI-tools with some modifications to the constants used. We can generate the permit list using the following command:

$ alevin-fry generate-permit-list -d fw -k -i pbmc_10k_v3_map -o pbmc_10k_v3_quant

Collating the file

The next command, collate, actually serves two distinct purposes simultaneously. First, this command will correct the barcodes according to the information written out during the generate-permit-list phase. That is, for each read in the mapping file, collate will determine if the barcode is either one of the true cell barcode that was selected by the generate-permit-list command, or if the barcode can be corrected to such a barcode (is within an edit distance of 1 of a selected barcode). At the same time the barcodes are corrected, they are also collated, ensuring that all of the mappings that correspond to reads (and therefore UMIs) within the same cell are placed subsequently in the output file. This allows the next command (quant) to read in all the mapped reads corresponding to a given cell at one time so that the UMIs can be resolved, the gene expression estimates for that cell determined, and any used memory be freed or made available for other work.

$ alevin-fry collate -t 16 -i pbmc_10k_v3_quant -r pbmc_10k_v3_map

Quantifying UMIs per-gene and per-cell

The last pre-processing step we will complete with alevin-fry is to generate a cell-by-gene count matrix from the collated mapping file. Just as with the initial mapping and permit-listing, there are a number of different options one can choose for the algorithm used to resolve UMIs. Here we will apply a relatively simple-algorithm. Within each corrected and filtered cell barcode, the reads are grouped by their UMIs, with exact duplicate UMIs being collapsed. UMIs that map uniquely to a single gene are assigned to that gene. If a UMI maps to more than one gene, then the frequency of occurrence of this UMI is evaluated with respect to each gene to which it maps. If a unique most-frequent mapping exists, the UMI is assigned to that gene; otherwise it is discarded.

$ alevin-fry quant -t 16 -i pbmc_10k_v3_quant -o pbmc_10k_v3_quant --tg-map t2g.tsv --resolution cr-like --use-mtx

The output

After running this command, the resulting quantification information can be found in pbmc_10k_v3_quant/alevin/. Within this directory, there are 3 files; quants_mat.mtx, quants_mat_cols.txt, and quants_mat_rows.txt which correspond, respectively, to the count matrix, the gene names for each column of this matrix, and the corrected, filtered cell barcodes for each row of this matrix.

The requirement to pass the transcript-to-gene file during mapping is dropped after salmon 1.4.0. So, if you are using a more recent version of salmon, you can leave off that flag from this step. ↩
Originally, the fry pipeline was not designed to perform unfiltered quantification (i.e. correcting to a list of all possible barcodes generated by a technology, like e.g. the 10x v3 permit list) and that functionality was not supported. However, as of v0.2.0 alevin-fry has beta support for making use of an external permit list. So, for example, if you wanted to use the external permit list to quantify the present (rather than just the high-quality cells), this can be done by passing the 10x permit list to the generate-permit-list command using the --unfiltered-pl argument. ↩