24 Jan 2023 • on alevin-fry

Generating a scRNA-seq count matrix with simpleaf

Simpleaf is a command line toolkit written in rust that exposes a unified and simplified interface for processing scRNA-seq datasets using the alevin-fry pipeline. It encapsulates the whole alevin-fry pipeline into two steps:

simpleaf index indexes the provided reference or makes an expanded reference and indexes it.
simpleaf quant maps the sequencing reads in provided FASTQ files against the indexed reference and quantifies the mapping records to generate a gene count matrix.

For more information about the processing pipeline conceptually, and the alevin-fry commands that correspond to each simpleaf command, see the raw data processing chapter in the Single-cell best practices book.

In this tutorial, we will show you how to generate a scRNA-seq count matrix from the raw input data – FASTQ files – using simpleaf, as well as describe some of the different options that simpleaf provides and what they do.

Installation

The recommended way of using simpleaf is from a conda environment. While you can also access it from a docker/singularity image or compile it manually.

conda create -n af -y -c bioconda -c conda-forge simpleaf
conda activate af

Download FASTQ Files

We create a working directory with sufficient space to download all the input data and hold the output (50GB should be sufficient). We alias this directory and use the alias below so that you can easily set it to something else and still copy and paste the later shell commands.

export AF_SAMPLE_DIR=$PWD/af_test_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR

Then, we download the FASTQ files from the PBMC1k (v3) healthy donor samples.

FASTQ_DIR="$AF_SAMPLE_DIR/data/pbmc_1k_v3_fastqs"
mkdir -p $FASTQ_DIR

wget -qO- https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar | tar xf - --strip-components=1 -C $FASTQ_DIR

A Quick Simpleaf Run Through Example

Step 0 : Configuration

In order to operate properly, simpleaf requires that you set the environment variable ALEVIN_FRY_HOME. It will use the directory pointed to by this variable to cache useful information. To configure simpleaf, please run the following shell commands. Please check section Configure simpleaf if you want to specify the path to the executable of its dependencies.

# define env variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"

# simpleaf configuration
simpleaf set-paths

# Increase the number of open files for piscem's intemediate files
ulimit -n 2048

Step 1 : Build the reference index

Simpleaf maps the sequencing reads recorded in FASTQ files against a reference index. The index usually needs only to be built once for one set of the reference genome and gene annotations. In this example, we show the shell command of building a piscem spliced+intronic (splici) index, which contains the spliced transcripts and intronic sequences, for Human reference (GRCh38) dataset version 2024-A from 10x Genomics. For more options about indexing and mapping, check secion Indexing and Mapping Options.

cd $AF_SAMPLE_DIR
REF_DIR="$AF_SAMPLE_DIR/data/refdata-gex-GRCh38-2024-A"
mkdir -p $REF_DIR
IDX_DIR="$AF_SAMPLE_DIR/human-2024-A_splici"

# Download reference genome and gene annotations
wget -qO- https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz | tar xzf - --strip-components=1 -C $REF_DIR

# simpleaf index
simpleaf index \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf.gz \
--threads 8 \
--work-dir ./workdir.noindex

This will generate a piscem index in the $IDX_DIR/index directory. The index will be used in the next step, simpleaf quant, to map the sequencing reads in the FASTQ files against the reference. You can see that in the above command, genes.gtf.gz file is compressed but genome.fa is not. In general, simpleaf can work on both compressed and uncompressed files.

Step 2 : Quantify the sample

Given a reference index, a count matrix can be generated from the FASTQ files of a scRNA-seq dataset by the simpleaf quant program. In the following shell code chunk, in addition to the simpleaf quant program, we show how to get the expected order of the FASTQ filenames in a directory (and its subdirectory) recursively and store them in the reads1 and reads2 shell variables. If typing those filenames manually instead, please note that FASTQ filenames have to be comma-separated.

# Define filename pattern
reads1_pat="_R1_"
reads2_pat="_R2_"

# Obtain and sort filenames
reads1="$(find -L ${FASTQ_DIR} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
reads2="$(find -L ${FASTQ_DIR} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"

# simpleaf quantfication
simpleaf quant \
--reads1 $reads1 \
--reads2 $reads2 \
--threads 16 \
--index $IDX_DIR/index \
--chemistry 10xv3 --resolution cr-like \
--unfiltered-pl --anndata-out \
--output $AF_SAMPLE_DIR/pbmc1k_quant

Step 3 : Load output

simpleaf quant writes the count matrix to the directory $AF_SAMPLE_DIR/pbmc1k_quant/af_quant. There are two versions of the count matrix, (i) the count matrix in a mtx format, and the corresponding column names and row names in two separated files and (ii), an h5ad version containing the count matrix and all its affilicated information from the simpleaf quant run. Below, we show how to load the count matrix in R or Python.

R : In R, you can load the h5ad file using anndata, zellkonverter or SeuratDisk, or the count matrix using the loadFry() function from the fishpond Bioconductor package to load the count matrix as a SingleCellExperiment object, which can be easily turned into a seurat object.

# Run in R
library(SingleCellExperiment)
library(Seurat)

# define count matrix format
custom_format <- list("counts" = c("U", "S", "A"),
                      "spliced" = c("S"),
                      "unspliced" = c("U"),
                      "ambiguous" = c("A"))

# load count matrix
sce <- fishpond::loadFry("pbmc1k_quant/af_quant",
                         outputFormat = custom_format)

# create seurat object
seurat_obj <- CreateSeuratObject(counts(sce))

This will return a SingleCellExperiment object containing the counts for this experiment. The counts field contains the total counts of unspliced, spliced and ambiguous UMIs. Each type of counts is also stored in a separate layer.

For other pre-defined output formats and materials about the custom format, please see its reference page.

Python : In python, you can directly load the quants.h5ad file using scanpy.

# Run in python
import scanpy

adata = scanpy.read_h5ad("pbmc1k_quant/af_quant/quants.h5ad")

Simpleaf uses the sum of spliced, unspliced and ambgious counts as the X layer, and store each type of counts as a separate layer.

From here, you can use your favorite downstream analysis packages (like Seurat in R or Scanpy in python) to perform quality control, filtering, and analysis of your data.

Advanced Usage

1. Indexing and Mapping Options

Simpleaf has two underlying indexing and mapping engines, defaultly, piscem, and Salmon if specifying –no-piscem. Piscem is an improved index for the compacted reference de Bruijn graph that admits the same set of operations as the pufferfish index used by salmon, but that is substantially more space efficient. For example, the full spliceu indices of the human 2024-A reference, which contains all nascent and spliced transcripts, are 1.8GB. This allows the piscem single-cell mapper to operate in a very low memory footprint, ≤3GB for all the data evaluated in He et al., 2023.

Piscem is available on bioconda x86_64 and will be installed when installing simpleaf. For Non-conda users and Apple silicon users, piscem can be compiled and then specified in simpleaf set-paths or downloaded from the GitHub release page. See section Set Up Simpleaf Manually for details.

Suppose the piscem index was built by directly calling the salmon index program instead of simpleaf index. In that case, one needs to give the path to the piscem index prefix to the --index argument instead of the path to the index directory because piscem adopts a different (better) parsing strategy. For example, a piscem index directory is in the following structure:

$ ls $IDX_DIR/index
piscem_idx_cfish.json  piscem_idx.ectab    piscem_idx.sshash    t2g_3col.tsv
piscem_idx.ctab        piscem_idx.refinfo  simpleaf_index.json

Then, the path to the piscem index prefix should be $REF_DIR/index/piscem_idx. Again, if the piscem index was built by simpleaf index, simpleaf quant will automatically figure out the prefix of the index, so one just needs to specify the index folder. In this case, $REF_DIR/index. In conclusion,

If the piscem index was generated by simpleaf index, you should use --index $REF_DIR/index/.
If the piscem index was generated by piscem build, you should use --index $REF_DIR/index/piscem_idx

2. References

Currently, simpleaf index provides three strategies to build a reference index:

If passing a genome FASTA file to --fasta (or -f for short) and a gene annotation GTF file to --gtf (or -g for short), simpleaf will make the spliced + intronic (splici) reference, which contains the spliced transcripts and intronic sequences of each gene, and build a splici index on it. This is what we did in the above example.
```
simpleaf index \
--use-piscem \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf \
--threads 16 
```
If passing --ref-type spliced+unspliced or --ref-type spliceu in addition to the above arguments, simpleaf will make the spliceu (spliced + unspliced) reference, which contains the spliced transcripts and the unspliced transcript of each gene, and build a spliceu index on it. Notice that here the unspliced transcript of a gene contains the whole genomic interval of the gene defined in the GTF file.
```
simpleaf index \
--use-piscem \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf \
--threads 16 \
--ref-type spliced+unspliced # or spliceu
```
If passing the --ref-seq argument with a path to an existing reference FASTA file, for example, the transcript sequences downloaded from GENCODE, simpleaf will directly build a reference index for the sequences in the FASTA file. For example, we can directly pass the splici reference FASTA file created from the above example:
```
simpleaf index \
--use-piscem \
--output $IDX_DIR \
--threads 16 \
--ref-seq $IDX_DIR/ref/splici_fl86.fa
```
Probeset-based references are also supported. This is useful when processing reads genearted from feature barcoding, such as CITE-seq or 10X feature barcoding, or from probe sets, such as the 10X Flex Assay. Because feature barcoding reference usually has a different format with probe sets, we provide separate flags, --feature-csv and --probe-csv, to take them. For standardization, the provided CSV files must follow the format defined on 10x Genomics website. The feature barcoding CSV file format is defined at here. Currently, Currently, only three columns are used: id, name, and sequence. The probe set CSV file format is defined here. It should contain four mandatory columns: gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and an optional column: region (spliced or unspliced).
```
simpleaf index \
--use-piscem \
--output $IDX_DIR \
--threads 16 \
--feature-csv path/to/features.csv \
# --probe-csv patj/to/probes.csv 
```

3. Custom Chemistries

Simpleaf currently supports the following chemistries: 10X Chromium 3’ and 5’ assays, 10xv2, 10xv2-5p, 10xv3, 10xv3-5, 10xv4-3p, and smartseq, and 10x Visium assays (reads from probes or transcripts), visiumv1, visiumv1-probe, visiumv4, visiumv4-probe, visiumv5, and visiumv5-probe. If your chemistry is not listed here, you can either specify the geometry specifications as we did in the runthrough, or you can add your custom chemistries using the simpleaf chemistry subcommands. This command provides many subcommands for chemistry oerations, including add, remove, clean, loolup, and fetch. For more information about how to use this command, please see the simpleaf chemistry documentation. Here, we show how to add the 10X Chromium V3 chemistry as a custom chemistry named mychem to simpleaf.

simpleaf chemistry add \
    --name mychem \
    --geometry "1{b[16]u[12]x:}2{r:}" \
    --expected-ori fw \
    --remote-url https://umd.box.com/shared/static/vc9zd4qyjj581gvtolw5kj638wmg4f3s

To note that if you have a local file instead of a remote URL, you can use the --local-url argument instead of --remote-url. Once added, you can call it from --chemistry mychem in the simpleaf quant command.

3. The Docker Hub Container

For users with difficulty installing conda packages or piscem, simpleaf also has its dedicated docker container (from biocontainers) and singularity container from the Galaxy project available. The following will show how to perform the above example in a singularity image.

First, we pull the singularity container that contains all of the software we’ll need to do our processing.

singularity pull https://depot.galaxyproject.org/singularity/simpleaf:0.19.4--ha6fb395_0

docker pull quay.io/biocontainers/simpleaf:0.19.4--ha6fb395_0

4. Set Up Simpleaf Manually

If neither conda nor usefulaf container works, you can also manually set up simpleaf and its dependencies and then configure the path to the executable of the dependencies in the simpleaf set-paths program. Before start, please ensure that you have rust and pip in your local environment.

Simpleaf and alevin-fry

As simpleaf and alevin-fry are 100% in rust, they can be installed from crate.io by running cargo install simpleaf alevin-fry. By doing so, simpleaf and alevin-fry will be in your path, and you won’t need to worry about setting their paths.

For people enjoying compiling source code, you can run the following shell commands:

For simpleaf:

cd $AF_SAMPLE_DIR
# clone GitHub repo
git clone https://github.com/COMBINE-lab/simpleaf.git && cd simpleaf

# build from source
cargo build --release

# add to PATH environment variable
export PATH="$AF_SAMPLE_DIR/simpleaf/target/release/simpleaf:$PATH"

# print help message
simpleaf --help
cd ..

For alevin-fry:

cd $AF_SAMPLE_DIR
# clone GitHub repo
git clone https://github.com/COMBINE-lab/alevin-fry.git && cd alevin-fry

# build from source
cargo build --release

# print help message
target/release/alevin-fry --help
cd ..

piscem

As piscem relies on cuttlefish and piscem-cpp, which are written in C++, one needs to clone piscem GitHub repository with submodules to install it.

cd $AF_SAMPLE_DIR

# clone repo with submodules
git clone --recursive https://github.com/COMBINE-lab/piscem.git 
cd piscem

# call rust compiler
cargo build --release

# print help message
target/release/piscem --help
cd ..

If you got some weird error messages talking about cmake, it means you have to specify your C++17-compatible C++ compiler.

# call rust compiler
CXX=<path_to_c++_compiler> CC=<path_to_c_compiler> cargo build --release

pyroe

The pyroe python package is used for constructing extended references. It could be installed via pip using pip install pyroe[scanpy] or being compiled by running the following shell command, where scanpy is an extra dependency and is only required if using the load_fry() python function.

cd $AF_SAMPLE_DIR

git clone https://github.com/COMBINE-lab/pyroe.git && cd pyroe

# pip install
pip install ".[scanpy]"

# print help page
bin/pyroe --help
cd ..

salmon

The latest version of salmon binary is provided here. One can download it directly to avoid compiling it (a giant beast). At the time we wrote this tutorial, the latest version of salmon was salmon v1.10.1, released on March 11, 2023. Here we show how to obtain its binary:

cd $AF_SAMPLE_DIR
mkdir salmon

# download and decompress salmon binary 
wget -qO- https://github.com/COMBINE-lab/salmon/archive/refs/tags/v1.10.1.tar.gz | tar xzf - --strip-components=1 -C salmon

# give permission
chmod +x salmon/bin/salmon

salmon/bin/salmon --help

To use other versions of salmon, just replace the link to the executable in the above command with yours.

Configure simpleaf

To tell simpleaf that its dependencies are ready to use, we need to give the path to the executables in simpleaf set-paths. If the paths to some dependencies are in your PATH environment variable, you can just skip setting them, as simpleaf set-paths will check your PATH to find them. Here we assume that alevin-fry was installed by calling cargo install alevin-fry and its path is in the PATH variable, and everything else was compiled from source code, and their path is not in your PATH.

cd $AF_SAMPLE_DIR

# simpleaf configuration
simpleaf set-paths \
--pyroe $AF_SAMPLE_DIR/pyroe/bin/pyroe \
--salmon $AF_SAMPLE_DIR/salmon/bin/salmon \
--piscem $AF_SAMPLE_DIR/piscem/target/release/piscem

If alevin-fry is not in your PATH, you can add its path to the above command.

cd $AF_SAMPLE_DIR

# simpleaf configuration
simpleaf set-paths \
--pyroe $AF_SAMPLE_DIR/pyroe/bin/pyroe \
--salmon $AF_SAMPLE_DIR/salmon/bin/salmon \
--alevin-fry $AF_SAMPLE_DIR/alevin-fry/target/release/alevin-fry \
--piscem $AF_SAMPLE_DIR/piscem/target/release/piscem

Summary

In conclusion, simpleaf drastically reduces the number of shell commands one must run to generate a gene count matrix from scRNA-seq FASTQ files. It can be quickly accessed from conda and the usefulaf docker/singularity image.