Simpleaf is a command line toolkit written in rust that exposes a unified, simplified interface for processing scRNA-seq datasets using the alevin-fry pipeline.
In this tutorial, you will learn how to use simpleaf
to process spatial transcriptomics data. You can find more information about the available commands in the official simpleaf documentation. If you are new to alevin-fry and simpleaf or want to learn about processing scRNA-seq data, check out the other tutorials.
Installation
The easiest way to get started with simpleaf is by installing it in a conda environment:
conda create -n simpleaf -y -c bioconda -c conda-forge simpleaf
conda activate simpleaf
Configuration
simpleaf needs the environment variable ALEVIN_FRY_HOME
to point to a directory where it can store information. Create a working directory and set the environment variable like so:
# Make a working directory
AF_SAMPLE_DIR=$PWD/af_test_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR
# Define the required env variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"
# Simpleaf configuration
simpleaf set-paths
# Increase the number of open files for piscem
ulimit -n 2048
If you use simpleaf often, you can choose to set it as a folder in your home folder and adding ALEVIN_FRY_HOME
to your profile file to set up the environment automatically. For example, if you use bash
, you can add the following lines to your ~/.bashrc
file:
echo 'export ALEVIN_FRY_HOME="$HOME/af_home"' >> ~/.bashrc
Downloading the data
For this tutorial, we will download a small spatial transcriptomics dataset from the 10x Genomics website. The dataset is generated using a 10X Visium V5 slide and the Human Whole Transcriptome probe set. Note that although this is one of the smallest visium datasets, the compressed FASTQ tar ball is still 12GB. Please make sure you have about 40 GB free disk space to complete this tutorial.
First, we download the FASTQ files.
# Download the fastq files
DATA_DIR="$AF_SAMPLE_DIR/data"
FASTQ_DIR="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs"
mkdir -p $FASTQ_DIR
wget -qO- https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs.tar \
| tar xf - --strip-components=1 -C $FASTQ_DIR
Second, we download the probe set file and generate a FASTA file and a transcript-to-gene mapping file from it.
PROBES_FILE="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv"
# Download the probe set file
wget -qO $PROBES_FILE https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv
Running simpleaf index
Using the fetchedprobe_set.csv
, we can now run the simpleaf index
command to generate a reference index. Because piscem requires a working directory for storing intermediate files that need to be read and written frequently, if you are on a cloud computing platform, you might want to set --work-dir
to a scratch folder (e.g., /tmp/workdir.noindex
) to avoid high I/O burden. The reason is that, in cloud computing, CPUs and storage are often separate resources, and I/O speed can be much slower than local storage.
simpleaf index --probe-csv $PROBES_FILE --work-dir ./workdir.noindex --output simpleaf_index
The resulting index will be stored in the simpleaf_index/index
directory. We will use this index in the next step to generate the count matrix.
Running simpleaf quant
Next, run the simpleaf quant
command to generate the count matrix. The --output
option specifies the output directory for the count matrix. The --andata-out
flag tells simpleaf to also output the count matrix in the AnnData format, which is a common format for storing single-cell data.
# Define the reads1 and reads2 patterns
reads1_pat="_R1_"
reads2_pat="_R2_"
# Obtain and sort filenames
reads1="$(find -L ${FASTQ_DIR} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
reads2="$(find -L ${FASTQ_DIR} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
# Run simpleaf quant
simpleaf quant --chemistry visiumv5-probe \
--unfiltered-pl \
--reads1 $reads1 \
--reads2 $reads2 \
--index simpleaf_index/index \
--output simpleaf_quant \
--resolution cr-like \
--anndata-out
The resulting count matrix will be stored in the simpleaf_quant/af_quant/alevin
directory, and you can use it for downstream analysis. The quants.h5ad
file contains the count matrix in the AnnData format, and all affiliated information you need to start your analysis. If you want to start with the matrix file, quants_mat.mtx
, you can find the rownames (cell barcodee and tbhe corresponding xy coordinate) in quants_mat_rows.txt
, and gene IDs in the quants_mat_cols.txt
. The first few lines in the quants_mat_rows.txt
file look like this:
GGCGCCTACGAATGAA 69 83
CAATCTGCAATGAACA 36 50
CGCTGTATCCGGTTCT 80 102
GTCTCCATTAGATGAA 37 53
GACTGATTGGCCATGT 38 48
Conclusion
Because simpleaf supports permit lists with auxiliary information, you can easily use simpleaf to process sequencing-based spatial transcriptomics data, including 10X Visium assays and Stereo-seq.