My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


Generating a spatial transcriptomics count matrix with simpleaf

Simpleaf is a command line toolkit written in rust that exposes a unified, simplified interface for processing scRNA-seq datasets using the alevin-fry pipeline.

In this tutorial, you will learn how to use simpleaf to process spatial transcriptomics data. You can find more information about the available commands in the official simpleaf documentation. If you are new to alevin-fry and simpleaf or want to learn about processing scRNA-seq data, check out the other tutorials.

Installation

The easiest way to get started with simpleaf is by installing it in a conda environment:

conda create -n simpleaf -y -c bioconda -c conda-forge simpleaf
conda activate simpleaf

Configuration

simpleaf needs the environment variable ALEVIN_FRY_HOME to point to a directory where it can store information. Create a working directory and set the environment variable like so:

# Make a working directory
AF_SAMPLE_DIR=$PWD/af_test_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR

# Define the required env variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"

# Simpleaf configuration
simpleaf set-paths

# Increase the number of open files for piscem
ulimit -n 2048

Downloading the data

For this tutorial, we will download a (not that) small spatial transcriptomics dataset from the 10x Genomics website. The dataset is generated using a 10X Visium V5 slide and the Human Whole Transcriptome probe set.

First, we download the FASTQ files.

# Download the fastq files
DATA_DIR="$AF_SAMPLE_DIR/data"
FASTQ_DIR="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs"
mkdir -p $FASTQ_DIR

wget -qO- https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_fastqs.tar \
  | tar xf - --strip-components=1 -C $FASTQ_DIR

Second, we download the probe set file and generate a FASTA file and a transcript-to-gene mapping file from it.

PROBES_FILE="$DATA_DIR/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv"

# Download the probe set file
wget -qO $PROBES_FILE https://cf.10xgenomics.com/samples/spatial-exp/2.1.0/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1/CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_probe_set.csv

awk -F',' 'NR>6 {
  bc[$1]++;
  tid=$1"_"bc[$1];
  print ">"tid"\n"$2 > "data/ref.fa";
  print tid"\t"$1 > "data/t2g.tsv"
}' $PROBES_FILE

Below are the first few lines of ref.fa:

>ENSG00000000003_1
GGTGACACCACAACAATGCAACGTATTTTGGATCTTGTCTACTGCATGGC
>ENSG00000000003_2
TCTGCATCTCTCTGTGGAGTACAATCTTCAAGTTTACAGCAACTCTTAGG
>ENSG00000000003_3
AAAGCTGTTCTTAATCTCATGTCTGAAAACAAATCCTACGATGGCAGCGA

And the first few lines of t2g.tsv:

ENSG00000000003_1       ENSG00000000003
ENSG00000000003_2       ENSG00000000003
ENSG00000000003_3       ENSG00000000003
ENSG00000000005_1       ENSG00000000005
ENSG00000000005_2       ENSG00000000005

Running simpleaf index

Using the generated ref.fa, we can now run the simpleaf index command to generate a reference index. Because piscem requires a working directory for storing intermediate files that need to be read and written frequently, if you are on a cloud computing platform, you might want to set --work-dir to a scratch folder (e.g., /tmp/workdir.noindex) to avoid high I/O burden. The reason is that, in cloud computing, CPUs and storage are often separate resources, and I/O speed can be much slower than local storage.

simpleaf index --ref-seq data/ref.fa --work-dir ./workdir.noindex --output simpleaf_index

The resulting index will be stored in the simpleaf_index/index directory. We will use this index in the next step to generate the count matrix.

Running simpleaf quant

Next, run the simpleaf quant command to generate the count matrix. The --work-dir option specifies the working directory for storing intermediate files, and the --output option specifies the output directory for the count matrix.

# Define the reads1 and reads2 patterns
R1_pat="R1"
R2_pat="R2"
R1="$(find -L ${FASTQ_DIR} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
R2="$(find -L ${FASTQ_DIR} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"

# Run simpleaf quant
simpleaf quant --chemistry visiumv5-probe \
               --unfiltered-pl \
               --reads1 $R1 \
               --reads2 $R2 \
               --index simpleaf_index/index \
               --t2g-map data/t2g.tsv \
               --output simpleaf_quant \
               --resolution cr-like

The resulting count matrix will be stored in the simpleaf_quant/af_quant/alevin directory, and you can use it for downstream analysis.

Spatial coordinate information is stored alongside each barcode in the simpleaf_quant/af_quant/alevin/quants_mat_rows.txt file. Here are the first few lines:

GGCGCCTACGAATGAA        69      83
CAATCTGCAATGAACA        36      50
CGCTGTATCCGGTTCT        80      102
GTCTCCATTAGATGAA        37      53
GACTGATTGGCCATGT        38      48

Conclusion

Because simpleaf supports permit lists with auxiliary information, you can easily use simpleaf to process sequencing-based spatial transcriptomics data, including 10X Visium assays and Stereo-seq.