Simpleaf is a command line toolkit written in rust that exposes a unified, simplified interface for processing scRNA-seq datasets using the alevin-fry pipeline.
In this tutorial, you will learn how to use simpleaf
to process spatial transcriptomics data. You can find more information about the available commands in the official simpleaf documentation. If you are new to alevin-fry and simpleaf or want to learn about processing scRNA-seq data, check out the other tutorials.
Installation
The easiest way to get started with simpleaf is by installing it in a conda environment:
conda create -n simpleaf -y -c bioconda -c conda-forge simpleaf
conda activate simpleaf
Configuration
simpleaf needs the environment variable ALEVIN_FRY_HOME
to point to a directory where it can store information. Create a working directory and set the environment variable like so:
# Make a working directory
AF_SAMPLE_DIR=$PWD/af_test_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR
# Define the required env variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"
# Simpleaf configuration
simpleaf set-paths
# Increase the number of open files for piscem
ulimit -n 2048
Downloading the data
For this tutorial, we will download a (not that) small spatial transcriptomics dataset from the 10x Genomics website. The dataset is generated using a 10X Visium V5 slide and the Human Whole Transcriptome probe set.
First, we download the FASTQ files.
# Download the fastq files
DATA_DIR="$AF_SAMPLE_DIR/data"
FASTQ_DIR="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs"
mkdir -p $FASTQ_DIR
wget -qO- https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs.tar \
| tar xf - --strip-components=1 -C $FASTQ_DIR
Second, we download the probe set file and generate a FASTA file and a transcript-to-gene mapping file from it.
PROBES_FILE="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv"
# Download the probe set file
wget -qO $PROBES_FILE https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv
awk -F',' 'NR>6 {
bc[$1]++;
tid=$1"_"bc[$1];
print ">"tid"\n"$2 > "data/ref.fa";
print tid"\t"$1 > "data/t2g.tsv"
}' $PROBES_FILE
Below are the first few lines of ref.fa
:
>ENSG00000000003_1
GGTGACACCACAACAATGCAACGTATTTTGGATCTTGTCTACTGCATGGC
>ENSG00000000003_2
TCTGCATCTCTCTGTGGAGTACAATCTTCAAGTTTACAGCAACTCTTAGG
>ENSG00000000003_3
AAAGCTGTTCTTAATCTCATGTCTGAAAACAAATCCTACGATGGCAGCGA
And the first few lines of t2g.tsv
:
ENSG00000000003_1 ENSG00000000003
ENSG00000000003_2 ENSG00000000003
ENSG00000000003_3 ENSG00000000003
ENSG00000000005_1 ENSG00000000005
ENSG00000000005_2 ENSG00000000005
Running simpleaf index
Using the generated ref.fa
, we can now run the simpleaf index
command to generate a reference index. Because piscem requires a working directory for storing intermediate files that need to be read and written frequently, if you are on a cloud computing platform, you might want to set --work-dir
to a scratch folder (e.g., /tmp/workdir.noindex
) to avoid high I/O burden. The reason is that, in cloud computing, CPUs and storage are often separate resources, and I/O speed can be much slower than local storage.
simpleaf index --ref-seq data/ref.fa --work-dir ./workdir.noindex --output simpleaf_index
The resulting index will be stored in the simpleaf_index/index
directory. We will use this index in the next step to generate the count matrix.
Running simpleaf quant
Next, run the simpleaf quant
command to generate the count matrix. The --work-dir
option specifies the working directory for storing intermediate files, and the --output
option specifies the output directory for the count matrix.
# Define the reads1 and reads2 patterns
R1_pat="R1"
R2_pat="R2"
R1="$(find -L ${FASTQ_DIR} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
R2="$(find -L ${FASTQ_DIR} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
# Run simpleaf quant
simpleaf quant --chemistry visiumv5-probe \
--unfiltered-pl \
--reads1 $R1 \
--reads2 $R2 \
--index simpleaf_index/index \
--t2g-map data/t2g.tsv \
--output simpleaf_quant \
--resolution cr-like
The resulting count matrix will be stored in the simpleaf_quant/af_quant/alevin
directory, and you can use it for downstream analysis.
Spatial coordinate information is stored alongside each barcode in the simpleaf_quant/af_quant/alevin/quants_mat_rows.txt
file. Here are the first few lines:
GGCGCCTACGAATGAA 69 83
CAATCTGCAATGAACA 36 50
CGCTGTATCCGGTTCT 80 102
GTCTCCATTAGATGAA 37 53
GACTGATTGGCCATGT 38 48
Conclusion
Because simpleaf supports permit lists with auxiliary information, you can easily use simpleaf to process sequencing-based spatial transcriptomics data, including 10X Visium assays and Stereo-seq.