16 Jul 2021 • on alevin-fry

A quick start guide to a portable and simplified alevin-fry pipeline

Here, we show how to perform a complete analysis on the 1k PBMCs from a Healthy Donor data from 10X Genomics. This run through includes all steps, even extracting the splici sequence and building the salmon index, which you typically would not do per-sample. To make this sample as easy as possible to follow, we have bundled all of the required software and utilities in a singularity container that we use in the commands below.

Download input data and singularity container

First, create a working directory with sufficient space to download all of the input data and to hold the output (50GB should be sufficient). We alias this directory and use the alias below so that you can easily set it to something else if you want and still copy and paste the later commands.

$ mkdir af_test_workdir
$ export AF_SAMPLE_DIR=$PWD/af_test_workdir

Then, we download all of our test data. This consist of the human reference genome and annotation (based, in this case, on the CellRanger 3.0 reference annotation) and the FASTQ files from the PBMC1k (v3) healthy donor samples.

$ cd $AF_SAMPLE_DIR
$ mkdir -p human_CR_3.0/fasta
$ mkdir -p human_CR_3.0/genes
$ wget -v -O human_CR_3.0/fasta/genome.fa -L https://umd.box.com/shared/static/3kuh1lc03bxg1d3hi1jfloez7zoutfjc
$ wget -v -O human_CR_3.0/genes/genes.gtf -L https://umd.box.com/shared/static/tvyg43710ufuuvp8mnuoanowm6xmkbjk
$ mkdir -p data/pbmc_1k_v3_fastqs
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz -L https://umd.box.com/shared/static/bmhtt9db8ojhmbkb6d98mt7fnsdhsymm
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz -L https://umd.box.com/shared/static/h8ymvs2njqiygfsu50jla2uce6p6etke
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz -L https://umd.box.com/shared/static/hi8mkx1yltmhnl9kn22n96xtic2wqm5i
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz -L https://umd.box.com/shared/static/4sn4pio63kk7pku52eo3xg9ztf5tq1ul

Finally, we’ll download the singularity image that contains all of the software we’ll need to do our processing.

$ singularity pull docker://combinelab/usefulaf:latest

Info about the singularity container

The singularity container we just downloaded above contains a recent release of simpleaf, which itself pulls in recent versions of salmon, alevin-fry, and pyroe (versions 1.9.0, 0.8.0, and 0.6.2 as of this writing), as well as an installation of R, including the fishpond package and all of its dependencies.

To build the reference index (and quantify) we’ll use simpleaf. This is a rust script written around salmon, alevin-fry, and pyroe that simplifies processing by grouping together related commands, using a fixed directory structure for processing, and also by simplifying some different options that are otherwise exposed by salmon and alevin-fry. If you would like to run the “raw” commands, the Singularity image contains salmon and alevin-fry and pyroe in the path, so you can explore more detailed options.

Building the splici reference and index

To build our reference index (this will both extract the splici fasta and transcript to gene mapping, and build the salmon index on it), use the following command:

$ singularity exec --cleanenv \
--bind $AF_SAMPLE_DIR:/workdir \
usefulaf_latest.sif \
simpleaf index \
-f /workdir/human_CR_3.0/fasta/genome.fa \
-g /workdir/human_CR_3.0/genes/genes.gtf \
-r 91 \
-d \
-o /workdir/human_CR_3.0_splici

Quantifying the sample

Given the constructed index (which will be written by the above command to $AF_SAMPLE_DIR/human_CR_3.0_splici/index), the next step is to quantify the sample against this index. This can be done with the following command, which will run salmon to generate the RAD file in sketch mode, perform unfiltered permit-list generation — automatically downloading the appropriate external permit-list — collate the RAD file and quantify the gene counts using the cr-like strategy):

$ singularity exec --cleanenv \
--bind $AF_SAMPLE_DIR:/workdir \
usefulaf_latest.sif \
simpleaf quant \
-1 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz \
-2 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz \
-i /workdir/human_CR_3.0_splici/index \
-c 10xv3 -r cr-like -d fw -u \
-m /workdir/human_CR_3.0_splici/ref/splici_fl86_t2g_3col.tsv
-o /workdir/quants/pbmc1k_v3

Output

At the end of this process, the directory $AF_SAMPLE_DIR/quants/pbmc1k_v3/af_quant will have the final output of running alevin-fry’s quant command. The alevin subdirectory will include a file specifying the row names (cell barcode), the column names (unspliced, spliced and ambiguous genes) and the counts (in MTX coordinate format). You can load these counts up in your favorite analysis environment to explore further.

R : In R, you can make use of the R loadFry() from the fishpond package, and read the input with the command:

m <- fishpond::loadFry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", outputFormat=list("spliced"=c("S","A"), "unspliced"=c("U"))

where $AF_SAMPLE_DIR is appropriately replaced by the path to the working directory we chose at the start of this exercise. This will return a SingleCellExperiment object containing the counts for this experiment.

Python : In python, you can make use of the python load_fry() function from the pyroe package. To read the input you can use a command like the following:

m = load_fry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", output_format={'X' : ['S', 'A'], 'unspliced' : ['U']})

which will provide the spliced (S+A) and unspliced (U) counts as separte layers of the returned AnnData object. If you wanted to simply sum the spliced and unspliced counts (as would be done in the output of e.g. CellRanger7) you could use the command:

m = load_fry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", output_format="snRNA")

m = load_fry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", output_format={'X' : ['S', 'A', 'U']})

where, again, $AF_SAMPLE_DIR is appropriately replaced by the path to the working directory we chose at the start of this exercise. This will return a scanpy AnnData object with the counts.

From here, you can use your favorite downstream analysis packages (like Seurat in R or scanpy in python) to perform quality control, filtering and analysis of your data.