30 Dec 2024 • on simpleaf

Generating a spatial transcriptomics count matrix with simpleaf

Simpleaf is a command line toolkit written in rust that exposes a unified, simplified interface for processing scRNA-seq datasets using the alevin-fry pipeline.

In this tutorial, you will learn how to use simpleaf to process spatial transcriptomics data. You can find more information about the available commands in the official simpleaf documentation. If you are new to alevin-fry and simpleaf or want to learn about processing scRNA-seq data, check out the other tutorials.

Installation

The easiest way to get started with simpleaf is by installing it in a conda environment:

conda create -n simpleaf -y -c bioconda -c conda-forge simpleaf
conda activate simpleaf

Configuration

simpleaf needs the environment variable ALEVIN_FRY_HOME to point to a directory where it can store information. Create a working directory and set the environment variable like so:

# Make a working directory
AF_SAMPLE_DIR=$PWD/af_test_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR

# Define the required env variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"

# Simpleaf configuration
simpleaf set-paths

# Increase the number of open files for piscem
ulimit -n 2048

If you use simpleaf often, you can choose to set it as a folder in your home folder and adding ALEVIN_FRY_HOME to your profile file to set up the environment automatically. For example, if you use bash, you can add the following lines to your ~/.bashrc file:

echo 'export ALEVIN_FRY_HOME="$HOME/af_home"' >> ~/.bashrc

Downloading the data

For this tutorial, we will download a small spatial transcriptomics dataset from the 10x Genomics website. The dataset is generated using a 10X Visium V5 slide and the Human Whole Transcriptome probe set. Note that although this is one of the smallest visium datasets, the compressed FASTQ tar ball is still 12GB. Please make sure you have about 40 GB free disk space to complete this tutorial.

First, we download the FASTQ files.

# Download the fastq files
DATA_DIR="$AF_SAMPLE_DIR/data"
FASTQ_DIR="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs"
mkdir -p $FASTQ_DIR

wget -qO- https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs.tar \
  | tar xf - --strip-components=1 -C $FASTQ_DIR

Second, we download the probe set file and generate a FASTA file and a transcript-to-gene mapping file from it.

PROBES_FILE="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv"

# Download the probe set file
wget -qO $PROBES_FILE https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv

Running simpleaf index

Using the fetchedprobe_set.csv, we can now run the simpleaf index command to generate a reference index. Because piscem requires a working directory for storing intermediate files that need to be read and written frequently, if you are on a cloud computing platform, you might want to set --work-dir to a scratch folder (e.g., /tmp/workdir.noindex) to avoid high I/O burden. The reason is that, in cloud computing, CPUs and storage are often separate resources, and I/O speed can be much slower than local storage.

simpleaf index --probe-csv $PROBES_FILE --work-dir ./workdir.noindex --output simpleaf_index

The resulting index will be stored in the simpleaf_index/index directory. We will use this index in the next step to generate the count matrix.

Running simpleaf quant

Next, run the simpleaf quant command to generate the count matrix. The --output option specifies the output directory for the count matrix. The --andata-out flag tells simpleaf to also output the count matrix in the AnnData format, which is a common format for storing single-cell data.

# Define the reads1 and reads2 patterns
reads1_pat="_R1_"
reads2_pat="_R2_"

# Obtain and sort filenames
reads1="$(find -L ${FASTQ_DIR} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
reads2="$(find -L ${FASTQ_DIR} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"

# Run simpleaf quant
simpleaf quant --chemistry visiumv5-probe \
               --unfiltered-pl \
               --reads1 $reads1 \
               --reads2 $reads2 \
               --index simpleaf_index/index \
               --output simpleaf_quant \
               --resolution cr-like \
               --anndata-out

The resulting count matrix will be stored in the simpleaf_quant/af_quant/alevin directory, and you can use it for downstream analysis. The quants.h5ad file contains the count matrix in the AnnData format, and all affiliated information you need to start your analysis. If you want to start with the matrix file, quants_mat.mtx, you can find the rownames (cell barcodee and tbhe corresponding xy coordinate) in quants_mat_rows.txt, and gene IDs in the quants_mat_cols.txt . The first few lines in the quants_mat_rows.txt file look like this:

GGCGCCTACGAATGAA        69      83
CAATCTGCAATGAACA        36      50
CGCTGTATCCGGTTCT        80      102
GTCTCCATTAGATGAA        37      53
GACTGATTGGCCATGT        38      48

Conclusion

Because simpleaf supports permit lists with auxiliary information, you can easily use simpleaf to process sequencing-based spatial transcriptomics data, including 10X Visium assays and Stereo-seq.