Here, we show how to perform a complete analysis on the 1k PBMCs from a Healthy Donor data from 10X Genomics. This run through includes all steps, even extracting the splici sequence and building the salmon index, which you typically would not do per-sample. To make this sample as easy as possible to follow, we have bundled all of the required software and utilities in a singularity container that we use in the commands below.
Download input data and singularity container
First, create a working directory with sufficient space to download all of the input data and to hold the output (50GB should be sufficient). We alias this directory and use the alias below so that you can easily set it to something else if you want and still copy and paste the later commands.
$ mkdir af_test_workdir
$ export AF_SAMPLE_DIR=$PWD/af_test_workdir
Then, we download all of our test data. This consist of the human reference genome and annotation (based, in this case, on the CellRanger 3.0 reference annotation) and the FASTQ files from the PBMC1k (v3) healthy donor samples.
$ cd $AF_SAMPLE_DIR
$ mkdir -p human_CR_3.0/fasta
$ mkdir -p human_CR_3.0/genes
$ wget -v -O human_CR_3.0/fasta/genome.fa -L https://umd.box.com/shared/static/3kuh1lc03bxg1d3hi1jfloez7zoutfjc
$ wget -v -O human_CR_3.0/genes/genes.gtf -L https://umd.box.com/shared/static/tvyg43710ufuuvp8mnuoanowm6xmkbjk
$ mkdir -p data/pbmc_1k_v3_fastqs
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz -L https://umd.box.com/shared/static/bmhtt9db8ojhmbkb6d98mt7fnsdhsymm
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz -L https://umd.box.com/shared/static/h8ymvs2njqiygfsu50jla2uce6p6etke
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz -L https://umd.box.com/shared/static/hi8mkx1yltmhnl9kn22n96xtic2wqm5i
$ wget -v -O data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz -L https://umd.box.com/shared/static/4sn4pio63kk7pku52eo3xg9ztf5tq1ul
Finally, we’ll download the singularity image that contains all of the software we’ll need to do our processing.
$ singularity pull docker://combinelab/usefulaf:latest
Info about the singularity container
The singularity container we just downloaded above contains a recent release of simpleaf
, which itself pulls in recent versions of salmon
, alevin-fry
, and pyroe
(versions 1.9.0, 0.8.0, and 0.6.2 as of this writing), as well as an installation of R
, including the fishpond
package and all of its dependencies.
To build the reference index (and quantify) we’ll use simpleaf
. This is a rust script written around salmon
, alevin-fry
, and pyroe
that simplifies processing by grouping together related commands, using a fixed directory structure for processing, and also by simplifying some different options that are otherwise exposed by salmon
and alevin-fry
. If you would like to run the “raw” commands, the Singularity image contains salmon
and alevin-fry
and pyroe
in the path, so you can explore more detailed options.
Building the splici reference and index
To build our reference index (this will both extract the splici fasta and transcript to gene mapping, and build the salmon
index on it), use the following command:
$ singularity exec --cleanenv \
--bind $AF_SAMPLE_DIR:/workdir \
usefulaf_latest.sif \
simpleaf index \
-f /workdir/human_CR_3.0/fasta/genome.fa \
-g /workdir/human_CR_3.0/genes/genes.gtf \
-r 91 \
-d \
-o /workdir/human_CR_3.0_splici
Quantifying the sample
Given the constructed index (which will be written by the above command to $AF_SAMPLE_DIR/human_CR_3.0_splici/index
), the next step is to quantify the sample against this index. This can be done with the following command, which will run salmon
to generate the RAD file in sketch
mode, perform unfiltered permit-list generation — automatically downloading the appropriate external permit-list — collate the RAD file and quantify the gene counts using the cr-like
strategy):
$ singularity exec --cleanenv \
--bind $AF_SAMPLE_DIR:/workdir \
usefulaf_latest.sif \
simpleaf quant \
-1 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz \
-2 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz \
-i /workdir/human_CR_3.0_splici/index \
-c 10xv3 -r cr-like -d fw -u \
-m /workdir/human_CR_3.0_splici/ref/splici_fl86_t2g_3col.tsv
-o /workdir/quants/pbmc1k_v3
Output
At the end of this process, the directory $AF_SAMPLE_DIR/quants/pbmc1k_v3/af_quant
will have the final output of running alevin-fry
’s quant
command. The alevin
subdirectory will include a file specifying the row names (cell barcode), the column names (unspliced, spliced and ambiguous genes) and the counts (in MTX coordinate format). You can load these counts up in your favorite analysis environment to explore further.
R : In R, you can make use of the R
loadFry()
from the
fishpond
package, and read the input with the command:
m <- fishpond::loadFry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", outputFormat=list("spliced"=c("S","A"), "unspliced"=c("U"))
where $AF_SAMPLE_DIR
is appropriately replaced by the path to the working directory we chose at the start of this exercise. This will return a SingleCellExperiment object containing the counts for this experiment.
Python : In python, you can make use of the python
load_fry()
function from the pyroe
package. To read the input you can use a command like the following:
m = load_fry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", output_format={'X' : ['S', 'A'], 'unspliced' : ['U']})
which will provide the spliced (S+A) and unspliced (U) counts as separte layers of the returned AnnData
object. If you wanted to simply sum the spliced and unspliced counts (as would be done in the output of e.g. CellRanger7) you could use the command:
m = load_fry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", output_format="snRNA")
or
m = load_fry("$AF_SAMPLE_DIR/quants/pbmc1k_v3/quant", output_format={'X' : ['S', 'A', 'U']})
where, again, $AF_SAMPLE_DIR
is appropriately replaced by the path to the working directory we chose at the start of this exercise. This will return a scanpy
AnnData
object with the counts.
From here, you can use your favorite downstream analysis packages (like Seurat
in R or scanpy
in python) to perform quality control, filtering and analysis of your data.