Simpleaf is a command line toolkit written in rust that exposes a unified and simplified interface for processing scRNA-seq datasets using the alevin-fry pipeline. It encapsulates the whole alevin-fry pipeline into two steps:
simpleaf index
indexes the provided reference or makes an expanded reference and indexes it.simpleaf quant
maps the sequencing reads in provided FASTQ files against the indexed reference and quantifies the mapping records to generate a gene count matrix.
For more information about the processing pipeline conceptually, and the alevin-fry commands that correspond to each simpleaf command, see the raw data processing chapter in the Single-cell best practices book.
In this tutorial, we will show you how to generate a single-cell count matrix from the raw input data, the FASTQ files, using simpleaf, as well as describe some of the different options that simpleaf provides and what they do.
Installation
The recommended way of using simpleaf is from a conda environment. While you can also access it from a docker/singularity image or compile it manually.
# Remove piscem if on Apple silicon
conda create -n af -y -c bioconda -c conda-forge simpleaf piscem
conda activate af
For Apple silicon computers, e.g., those with an M-series chip, conda does not currently build most packages, including simpleaf, natively. Therefore, you will need to tell the conda in your native shell to use x86 packages by following these instructions, or open your Terminal using Rosetta and install simpleaf using conda x86_64.
As for
piscem
, it is not available on conda for Apple silicon. However, you can still download pre-built executables or compile it from source, and give its path to simpleaf as we showed in this section. Furthermore, because it is an improved alternative of the default pufferfish index, you can choose to avoid it by removing the --use-piscem
argument in the example simpleaf index
command.
If docker/singularity is an option to you, you can call simpleaf with piscem from a usefulaf image as discussed in section The Docker Hub Container.
Download FASTQ Files
We create a working directory with sufficient space to download all the input data and hold the output (50GB should be sufficient). We alias this directory and use the alias below so that you can easily set it to something else and still copy and paste the later shell commands.
export AF_SAMPLE_DIR=$PWD/af_test_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR
Then, we download the FASTQ files from the PBMC1k (v3) healthy donor samples.
FASTQ_DIR="$AF_SAMPLE_DIR/data/pbmc_1k_v3_fastqs"
mkdir -p $FASTQ_DIR
wget -qO- https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar | tar xf - --strip-components=1 -C $FASTQ_DIR
A Quick Simpleaf Run Through Example
Step 0 : Configuration
In order to operate properly, simpleaf requires that you set the environment variable ALEVIN_FRY_HOME
. It will use the directory pointed to by this variable to cache useful information. To configure simpleaf, please run the following shell commands. Please check section Configure simpleaf if you want to specify the path to the executable of its dependencies.
# define env variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"
# simpleaf configuration
simpleaf set-paths
Besides, if you are running simpleaf
on your PC or laptop, you should not give simpleaf
more threads (via --threads
or -t
for short) than the number of high performance cores you have, because, upon testing, some of its dependencies run extremely slow on M1 chip’s efficiency cores. For Apple M1 chips, you should at most use 8 threads, --threads 8
.
Furthermore, you might need to update the maximum number of file descriptors that a process can have to a large number, becuase the index builders used by simpleaf, piscem and cuttlefish, sometimes create a large number of intermediate files. To do this, you can run the following shell command.
ulimit -n 2048
Step 1 : Build the reference index
Simpleaf maps the sequencing reads recorded in FASTQ files against a reference index. The index usually needs only to be built once for one set of the reference genome and gene annotations. In this example, we show the shell command of building a piscem spliced+intronic (splici) index, which contains the spliced transcripts and intronic sequences, for Human reference (GRCh38) dataset version 2020-A from 10x Genomics. If you want to build a pufferfish index instead of a piscem index, please remove the --use-piscem
argument when running simpleaf index
. For more options about indexing and mapping, check secion Indexing and Mapping Options.
cd $AF_SAMPLE_DIR
REF_DIR="$AF_SAMPLE_DIR/data/refdata-gex-GRCh38-2020-A"
mkdir -p $REF_DIR
IDX_DIR="$AF_SAMPLE_DIR/human-2020-A_splici"
# Download reference genome and gene annotations
wget -qO- https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz | tar xzf - --strip-components=1 -C $REF_DIR
# simpleaf index
simpleaf index \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf \
--rlen 91 \
--threads 16 \
--use-piscem # remove this if missing piscem
Notice that we need a read length argument --rlen
(or -r
for short) in addition to the genome and gene annotation file for building a splici index. This value can usually be found in the dataset description or log. The read length of our example dataset can be found on its webpage. For 10x Chromium 3’ protocols, if library preparation follows 10x Genomics’ recommendataions, --rlen
is usually set as 91 for v3 chemistry, and 98 for v2 chemistry. We recommend checking and using the exact read length of the processed dataset for the read length argument --rlen
. For other options of reference building, check section References.
Step 2 : Quantify the sample
Given a reference index, a count matrix can be generated from the FASTQ files of a scRNA-seq dataset by the simpleaf quant
program. In the following shell code chunk, in addition to the simpleaf quant
program, we show how to get the expected order of the FASTQ filenames in a directory (and its subdirectory) recursively and store them in the reads1
and reads2
shell variables. If typing those filenames manually, please note that FASTQ filenames have to be comma-separated.
# Define filename pattern
reads1_pat="_R1_"
reads2_pat="_R2_"
# Obtain and sort filenames
reads1="$(find -L ${FASTQ_DIR} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
reads2="$(find -L ${FASTQ_DIR} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
# simpleaf quantfication
simpleaf quant \
--reads1 $reads1 \
--reads2 $reads2 \
--threads 16 \
--index $IDX_DIR/index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw --unfiltered-pl \
--t2g-map $IDX_DIR/index/t2g_3col.tsv \
--output $AF_SAMPLE_DIR/pbmc1k_quant
The above simpleaf quant
shell command will run piscem
to generate the RAD, perform unfiltered permit-list generation — automatically downloading the appropriate external permit-list — collate the RAD file, and quantify the gene counts using the cr-like
UMI resolution. More advanced usages could be found in this section.
Step 3 : Load output
At the end of this process, the count matrix is in the directory $AF_SAMPLE_DIR/pbmc1k_quant/af_quant
. Below, we show how to load the count matrix in R or Python.
R : In R, we use the loadFry()
function from the fishpond
Bioconductor package to load the count matrix as a SingleCellExperiment object, which can be easily turned into a seurat object.
# Run in R
library(SingleCellExperiment)
library(Seurat)
# define count matrix format
custom_format <- list("counts" = c("S", "A"),
"unspliced" = c("U"))
# load count matrix
sce <- fishpond::loadFry("pbmc1k_quant/af_quant",
outputFormat = custom_format)
# create seurat object
seurat_obj <- CreateSeuratObject(counts(sce))
This will return a SingleCellExperiment
object containing the counts for this experiment. The counts
field contains the total counts of spliced and ambiguous UMIs, and the unspliced
field contains unspliced counts.
CellRanger7 includes both spliced and unspliced counts in the gene count matrix of single-cell and single-nucleus RNA-seq. If using a splici (default) or spliceu index (see section Indexing and Mapping Options for details), simpleaf returns the counts of spliced and unspliced UMIs separately. So it is trivial to mimic CellRanger7’s behavior: just pass outputFormat="snRNA"
to loadFry
, or use the following custom format:
# Run in R
custom_format <- list("counts" = c("U","S","A"))
sce <- fishpond::loadFry("pbmc1k_quant/af_quant",
outputFormat = custom_format)
For other pre-defined output formats and materials about the custom format, please see its reference page.
Python : In python, you can make use of the load_fry()
python function from the pyroe package (A dependency of simpleaf) to load the count matrix as a scanpy’s AnnData
object.
# Run in python
from pyroe import load_fry
#define count matrix format
custom_format = {'X' : ['S', 'A'],
'unspliced' : ['U']}
# load count matrix
adata = load_fry("pbmc1k_quant/af_quant",
output_format = custom_format)
For analysis approaches that utilize both the spliced (S+A) and unspliced (U) counts, one can run the above code or specify output_format="velocity"
or "scrna"
to make the spliced and unspliced counts as separate layers of the returned AnnData
object. If you wanted to simply sum the spliced and unspliced counts (as would be done in the output of, e.g., CellRanger7), you could use the command:
# Run in python
# same as pre-defined format "all", "total" and "scrna"
custom_format = {'X' : ['U','S','A']}
adata = load_fry("pbmc1k_quant/af_quant",
output_format = custom_format)
From here, you can use your favorite downstream analysis packages (like Seurat in R or Scanpy in python) to perform quality control, filtering, and analysis of your data.
Advanced Usage
1. Indexing and Mapping Options
Salmon with pufferfish index is used by simpleaf defaultly if you don’t specify --use-piscem
. Additionally, simpleaf supports indexing and mapping using the piscem. Piscem is an improved index for the compacted reference de Bruijn graph that admits the same set of operations as the pufferfish index used by salmon, but that is substantially more space efficient. For example, the full spliceu indices of the human 2020-A reference, which contains all nascent and spliced transcripts, are 1.8GB. This allows the piscem single-cell mapper to operate in a very low memory footprint, ≤3GB for all the data evaluated in He et al., 2023. Currently, piscem supports 10x Genomics Chromium 3’ v2 and v3 protocols and any custom geometry.
Piscem is available on bioconda x86_64 and will be installed when installing simpleaf. For Non-conda users and Apple silicon users, piscem can be compiled and then specified in simpleaf set-paths
or downloaded from the GitHub release page. See section Set Up Simpleaf Manually for details.
One can tell simpleaf to use piscem instead of salmon by specifying the --use-piscem
argument when running the simpleaf index
program. simpleaf quant
will figure out that piscem, instead of salmon, should be used by reading the log JSON file from the simpleaf index
run, as we showed in the quick run through example.
However, suppose the piscem index was built by directly calling the piscem build
program instead of simpleaf index
. In that case, one has to specify the --use-piscem
when running simpleaf quant
, since there will be no log JSON files for simpleaf to process. Furthermore, one needs to give the path to the piscem index prefix to the --index
argument instead of the path to the index directory because piscem adopts a different (better) parsing strategy. For example, if calling simpleaf index --use-piscem
, the piscem index directory will be in the following structure:
$ ls $IDX_DIR/index
piscem_idx_cfish.json piscem_idx.ectab piscem_idx.sshash t2g_3col.tsv
piscem_idx.ctab piscem_idx.refinfo simpleaf_index.json
Then, the path to the piscem index prefix should be $REF_DIR/index/piscem_idx
. Again, if the piscem index was built by simpleaf index
, simpleaf quant
will automatically figure out the prefix of the index, so one just needs to specify the index folder. In this case, $REF_DIR/index/
. In conclusion,
- If the piscem index was generated by
simpleaf index
, you should use--index $REF_DIR/index/
. - If the piscem index was generated by
piscem build
, you should use--index $REF_DIR/index/piscem_idx
2. References
Currently, simpleaf index
provides three strategies to build a reference index:
- If passing a genome FASTA file to
--fasta
(or-f
for short) and a gene annotation GTF file to--gtf
(or-g
for short), simpleaf will make the spliced + intronic (splici) reference, which contains the spliced transcripts and intronic sequences of each gene, and build a splici index on it. This is what we did in the above example. Notice that the read length argument--rlen
is required for splici reference construction. However, its effect on the quantification result is minor, as discussed in Supplementary section S4 in the alevin-fry manuscript.simpleaf index \ --use-piscem \ --output $IDX_DIR \ --fasta $REF_DIR/fasta/genome.fa \ --gtf $REF_DIR/genes/genes.gtf \ --threads 16 \ --rlen 91 # required by splici reference construction
- If passing
--ref-type spliced+unspliced
or--ref-type spliceu
in addition to the above arguments, simpleaf will make the spliceu (spliced + unspliced) reference, which contains the spliced transcripts and the unspliced transcript of each gene, and build a spliceu index on it. Notice that here the unspliced transcript of a gene contains the whole genomic interval of the gene defined in the GTF file. Also note, as a result, that the read length parameter--rlen
is no longer required to make the spliced+unspliced reference. More details about the spliceu reference can be found in He et al., 2023.simpleaf index \ --use-piscem \ --output $IDX_DIR \ --fasta $REF_DIR/fasta/genome.fa \ --gtf $REF_DIR/genes/genes.gtf \ --threads 16 \ --ref-type spliced+unspliced # or spliceu
- If passing the
--ref-seq
argument with a path to an existing reference FASTA file, for example, the transcript sequences downloaded from GENCODE, simpleaf will directly build a reference index for the sequences in the FASTA file. For example, we can directly pass the splici reference FASTA file created from the above example:simpleaf index \ --use-piscem \ --output $IDX_DIR \ --threads 16 \ --ref-seq $IDX_DIR/ref/splici_fl86.fa
3. The Docker Hub Container
For users with difficulty installing conda packages or piscem, we provide the usefulaf docker hub container, which bundles all of the required software and utilities, including piscem, for running simpleaf. The usefulaf container is compatible with docker and singularity. The following will show how to perform the above example in a singularity image.
First, we download the usefulaf container that contains all of the software (including piscem) we’ll need to do our processing.
singularity pull docker://combinelab/usefulaf:latest
As everything has been installed and configured in the usefulaf container, one can direct call simpleaf
programs without worrying about the paths and configuration. Overall, to run commands in a singularity image, You give singularity exec
the command you want to execute:
singularity exec --cleanenv \
--bind $AF_SAMPLE_DIR:/workdir \
usefulaf_latest.sif \
ANY_COMMAND_YOU_WANT_TO_RUN
As is the case when you run natively, before executing the index operation, you should ensure that you have set the file handle limit sufficiently high (e.g. by running
ulimit -n 2048
). This will be inherited within the singularity environment.
If you are using docker instead of singularity, then you should pass the option
--ulimit nofile=2048:2048
to the docker run
invocation.For example, to build the same splici index as in the above example, use the following shell command:
singularity exec --cleanenv \
--bind $AF_SAMPLE_DIR:/workdir \
usefulaf_latest.sif \
simpleaf index \
--fasta /workdir/data/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
--gtf /workdir/data/refdata-gex-GRCh38-2020-A/genes/genes.gtf \
--rlen 91 \
--output /workdir/human-2020-A_splici \
--threads 16 \
--use-piscem # piscem is in the image
To quantify the example dataset as in the above example, use the following shell command:
singularity exec --cleanenv \
--bind $AF_SAMPLE_DIR:/workdir \
usefulaf_latest.sif \
simpleaf quant \
--reads1 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz \
--reads2 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz \
--index /workdir/human-2020-A_splici/index \
--chemistry 10xv3 --resolution cr-like --expected-ori fw --unfiltered-pl \
--t2g-map /workdir/human-2020-A_splici/index/t2g_3col.tsv \
--threads 16 \
--output /workdir/pbmc1k_quant
4. Set Up Simpleaf Manually
If neither conda nor usefulaf container works, you can also manually set up simpleaf and its dependencies and then configure the path to the executable of the dependencies in the simpleaf set-paths
program. Before start, please ensure that you have rust and pip in your local environment.
Simpleaf and alevin-fry
As simpleaf and alevin-fry are 100% in rust, they can be installed from crate.io by running cargo install simpleaf alevin-fry
. By doing so, simpleaf and alevin-fry will be in your path, and you won’t need to worry about setting their paths.
For people enjoying compiling source code, you can run the following shell commands:
For simpleaf:
cd $AF_SAMPLE_DIR
# clone GitHub repo
git clone https://github.com/COMBINE-lab/simpleaf.git && cd simpleaf
# build from source
cargo build --release
# add to PATH environment variable
export PATH="$AF_SAMPLE_DIR/simpleaf/target/release/simpleaf:$PATH"
# print help message
simpleaf --help
cd ..
For alevin-fry:
cd $AF_SAMPLE_DIR
# clone GitHub repo
git clone https://github.com/COMBINE-lab/alevin-fry.git && cd alevin-fry
# build from source
cargo build --release
# print help message
target/release/alevin-fry --help
cd ..
piscem
For Apple silicon users, we provide the pre-built executable of piscem to avoid any problems you might encounter during compliation. Here we show how to set up piscem 0.4.3, released on Feb 13, 2023. Please check the latest released version and replace the link in the following wget
command accordingly.
# For apple silicon users
cd $AF_SAMPLE_DIR
mkdir piscem
wget -qO- https://github.com/COMBINE-lab/piscem/releases/download/v0.4.3/piscem-v0.4.3-aarch64-apple-darwin.tar.xz | tar -Jxf - --strip-components=1 -C piscem
chmod +x piscem/piscem
As piscem relies on cuttlefish and piscem-cpp, which are written in C++, one needs to clone piscem GitHub repository with submodules to install it.
cd $AF_SAMPLE_DIR
# clone repo with submodules
git clone --recursive https://github.com/COMBINE-lab/piscem.git
cd piscem
# call rust compiler
cargo build --release
# print help message
target/release/piscem --help
cd ..
If you got some weird error messages talking about cmake, it means you have to specify your C++17
-compatible C++ compiler.
# call rust compiler
CXX=<path_to_c++_compiler> CC=<path_to_c_compiler> cargo build --release
pyroe
The pyroe python package is used for constructing extended references. It could be installed via pip using pip install pyroe[scanpy]
or being compiled by running the following shell command, where scanpy is an extra dependency and is only required if using the load_fry()
python function.
cd $AF_SAMPLE_DIR
git clone https://github.com/COMBINE-lab/pyroe.git && cd pyroe
# pip install
pip install ".[scanpy]"
# print help page
bin/pyroe --help
cd ..
salmon
The latest version of salmon binary is provided here. One can download it directly to avoid compiling it (a giant beast). At the time we wrote this tutorial, the latest version of salmon was salmon v1.9.0, released on Jun 23, 2022. Here we show how to obtain its binary:
cd $AF_SAMPLE_DIR
mkdir salmon
# download and decompress salmon binary
wget -qO- https://github.com/COMBINE-lab/salmon/releases/download/v1.9.0/salmon-1.9.0_linux_x86_64.tar.gz | tar xzf - --strip-components=1 -C salmon
# give permission
chmod +x salmon/bin/salmon
salmon/bin/salmon --help
To use other versions of salmon, just replace the link to the executable in the above command with yours.
Configure simpleaf
To tell simpleaf that its dependencies are ready to use, we need to give the path to the executables in simpleaf set-paths
. If the paths to some dependencies are in your PATH
environment variable, you can just skip setting them, as simpleaf set-paths
will check your PATH
to find them. Here we assume that alevin-fry was installed by calling cargo install alevin-fry
and its path is in the PATH
variable, and everything else was compiled from source code, and their path is not in your PATH
.
cd $AF_SAMPLE_DIR
# simpleaf configuration
simpleaf set-paths \
--pyroe $AF_SAMPLE_DIR/pyroe/bin/pyroe \
--salmon $AF_SAMPLE_DIR/salmon/bin/salmon \
--piscem $AF_SAMPLE_DIR/piscem/target/release/piscem
If alevin-fry is not in your PATH
, you can add its path to the above command.
cd $AF_SAMPLE_DIR
# simpleaf configuration
simpleaf set-paths \
--pyroe $AF_SAMPLE_DIR/pyroe/bin/pyroe \
--salmon $AF_SAMPLE_DIR/salmon/bin/salmon \
--alevin-fry $AF_SAMPLE_DIR/alevin-fry/target/release/alevin-fry \
--piscem $AF_SAMPLE_DIR/piscem/target/release/piscem
Summary
In conclusion, simpleaf drastically reduces the number of shell commands one must run to generate a gene count matrix from scRNA-seq FASTQ files. It can be quickly accessed from conda and the usefulaf docker/singularity image.