30 Dec 2024 • on simpleaf

Operating on the Registry of Custom Chemistries in Simpleaf

Simpleaf is a command line toolkit written in rust that exposes a unified and simplified interface for processing scRNA-seq datasets using the alevin-fry pipeline.

In this tutorial, we will focus on the simpleaf chemistry commands for operations on the registry of simpleaf’s custom chemistries. The detailed description of each command can be found at simpleaf’s online documentation. For more information about using alevin-fry and simpleaf to process single-cell data, including spatial transcriptomics data, please refer to other tutorials.

Installation

The recommended way of using simpleaf is from a conda environment.

# Remove piscem if on Apple silicon
conda create -n simpleaf -y -c bioconda -c conda-forge simpleaf
conda activate simpleaf

Configuration

simpleaf requires the environment variable ALEVIN_FRY_HOME to be set to a directory where it can cache useful information. To configure simpleaf, please run the following shell commands.

# make a working directory
AF_SAMPLE_DIR=$PWD/af_test_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR

# define the required env variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"
# simpleaf configuration
simpleaf set-paths

Custom Chemistries in Simpleaf

In the single-cell world, a “chemistry” is defined as the library configuration of a single-cell assay, i.e., the position of the cell barcode, unique molecule identifier, and biological sequence relative to the sequenced reads. Single-cell protocols usually have their own unique chemistries. If you are unsure about the chemistry of your data, the Single Cell Genomics Library Structure repository describes in detail the chemistries of dozens of single-cell assays.

Chemistry definitions are essential for simpleaf to process any single-cell data. In simpleaf’s quantification stage, there are three types of chemistry definitions that the chemistry argument, --chemistry, can take:

the name of an internally defined chemistry,
a custom geometry description, or
the name of a registered custom geometry.

Internally, simpleaf only defines a few popular chemistries, such as

10xv2: the 10x Genomics Chromium 3’ v2 protocol
10xv3: the 10x Genomics Chromium 3’ v3 protocol
10xv4: the 10x Genomics Chromium 3’ v4 protocol
visiumv1: the 10x Genomics Visium Spatial Gene Expression Slide (v1, 6.5 mm)
visiumv1-probe: Same as visiumv1, but using a Visium probe set to target protein-coding genes instead of priming the polyA stretches of arbitrary RNAs
visiumv4: the 10x Genomics Visium Spatial Gene Expression Slide (v2, 6.5 mm)
visiumv4-probe: Same as visiumv4, but using a Visium probe set
visiumv5: the 10x Genomics Visium Spatial Gene Expression Slide (v2, 11 mm)
visiumv5-probe: Same as visiumv5, but using a Visium probe set

However, users can flexibly define and register any custom chemistries with the help of the simpleaf chemistry commands, and the Sequence Fragment Geometry Description Language. In the following text, we will show how to define and register a custom chemistry for the visiumv1 protocol.

Define the geometry description for custom chemistries

The most important part of defining a custom chemistry is to describe the geometry of the single-cell assay. In simpleaf, this is done by writing a geometry description in the Sequence Fragment Geometry Description Language. At a high level:

b[N]: barcode of length N
u[N]: UMI of length N
r: or r[N]: biological sequence, either everything left (r:) or a fixed number of bases (for example, r[50])
f[...]: fixed sequence (e.g., an adapter or known base sequence)
x: or x[N]: the sequence to discard, either everything left (x:) or a fixed number of bases (for example, x[10])

Here we show examples for writing the geometry description for some 10x Genomics protocols.

10X Chromium 3’ 3 protocol

Let’s start with a widely used protocol, 10x Genomics Chromium 3’ v3 protocol, 10xv3. A geometry usually consists of the following components: the barcode definition (b), the UMI definition (u), and the biological sequence definition. Sometimes we need to define some fixed fragment sequences (f) or the sequence that should be excluded (x).

When describing a geometry, we first need to figure out which component is at what position on which read, and what’s the length of each component. For reads from the 10x Genomics Chromium 3’ v3 protocol, the cell barcode is 16 bp long and is at the beginning of read1. The UMI is 12 bp long and is adjacent to the cell barcode. The entire read2 represents a cDNA fragment, i.e., the biological sequence.

Next, we want to assemble the information into a geometry description. We use {} to denote the content of each read, and use [] to denote the length of each component or : to denote “anything left.” Combining all these together, for the 10x Genomics Chromium 3’ v2 protocol (as described here), the geometry description is as follows: 1{b[16]u[12]x:}2{r:}, where

1{} and 2{} denote read1 and read2, respectively.
b[16] denotes the barcode takes the first 16 bases of read1.
u[12] denotes the UMI takes the next 12 bases after the barcode.
x: denotes anything left in read1 should be discarded. If using the recommended sequencing format, i.e., read1s are sequenced for 28 cycles (barcode+UMI), we can ignore the x: part, because the entire read1 is used for cell barcode and UMI.
r: denotes anything left in read2 should be treated as biological sequence. In this case, as it appears at the beginning of read2, it means the entire read2 is the biological sequence.

10X Chromium 5’ v3 protocol

The 10X Chromium 5’ v3 protocol, 10xv3-5p, as described here, does a round of reverse transcription at the beginning of library preparation. Regarding the geometry description, if using the recommended sequencing format, i.e., read1s are sequenced for 28 cycles, the geometry description is the same as 10xv3: 1{b[16]u[12]}2{r:}.

However, when read1s and read2s are sequenced for the same number of cycles, such as 91 or 150, and we want to utilize the extra bases in read1s during mapping, we need to consider the bases after the UMI in read1. Specifically, after the UMI sequence, there is a fixed fragment sequence, AAAGAATATACCC, and the rest of read1 is a biological sequence representing the 5’ end of the RNA. In this case, the geometry description should be 1{b[16]u[12]f[AAAGAATATACCC]r:}2{r:}, where the f[] denotes the fixed fragment sequence, and the r: in the read1 description denotes the biological sequence in read1. Interestingly, because there is a biological sequence in both read1 and read2, if we want to use the extra bases in read1s, we need to carefully define the direction, i.e., the expected orientation, of the reads. We will explain this later in detail in the expected orientation section.

In summary, if we want to follow the “default” 10X way to process 10xv3-5p data, we can use the same geometry description as 10xv3, 1{b[16]u[12]}2{r:}. If we want to utilize the extra bases in read1s, we should use 1{b[16]u[12]f[AAAGAATATACCC]r:}2{r:}, together with a carefully set direction.

10X Visium protocols

There are two types of reads that can be generated using the 10X Visium V1 protocol, visiumv1. The first type is the same as 10xv3, by priming the polyA stretches of arbitrary RNAs. The geometry description for this type of reads is 1{b[16]u[12]}2{r:}. The second type is generated by using a Visium probe set to target protein-coding genes. Because all probes are 50 bp in length, we only take the first 50 bases in read2 as the biological sequence and discard the rest. Therefore, the geometry description for this type of reads is 1{b[16]u[12]}2{r[50]}.

Expected Orientation

In the above examples, we have mentioned the expected orientation of the reads. The expected orientation, --expected-ori or -e, is a parameter that tells simpleaf, in the quantification stage, which read directions should be used to generate gene counts. It can be set to fw (forward), rc (reverse complement), or both (both forward and reverse complement). It describes the direction relative to the first (most upstream) mappable biological sequence.

For example, for a read from 10xv3 or a 10X Visium protocol, because its read2 is generated by sequencing the cDNA strand that is reversely transcribed from the corresponding mRNA (i.e., read2 is in the same direction as the mRNA), and it is the first (and only) mappable sequence, we should set the expected orientation to forward, fw.

For a read from 10xv3-5p, because its read2 is generated by sequencing the cDNA strand that is in the same direction as the mRNA, when we use only the read2 for mapping, we should set the expected orientation to reverse complement, rc. However, if we have extra bases in read1 and we want to use it for mapping, because the biological sequence in read1 is in the same direction as the mRNA (the reverse complement of read2), we should still set the expected orientation to forward, fw, to reflect the direction of the first mappable sequence, the biological sequence in read1.

Barcode Permit List

In the quantification stage, simpleaf can take a barcode permit list to correct observed barcodes against the permitted barcodes in the list. This cell barcode correction strategy is recommended for all protocols that have a permit list.

A barcode permit list is a tab-separated file without a header, where each line contains a permitted barcode in the first column and, optionally, some auxiliary information in the following columns. Usually, the permit list file for single-cell RNA-seq assays has a single column for the permitted barcode.

The permit list files for spatial transcriptomics assays (often called “coordinate files”) usually have two additional columns for x and y coordinates of the spot. For example, the following is a snippet of a permit list file for the 10X Genomics Visium Spatial Gene Expression Slide (v1, 6.5 mm) protocol:

AAACAACGAATAGTTC	17	1
AAACAAGTATCTCCCA	103	51
AAACAATCTACTAGCA	44	4
AAACACCAATAACTGC	20	60
AAACAGAGCGACTCCT	95	15
AAACAGCTTTCAGAAG	10	44
AAACAGGGTCTATATT	14	48
AAACAGTGTTCCTGGG	44	74
AAACATGGTGAGAGGA	1	63
AAACATTTCCCGGATT	98	62
AAACCACTACACAGAT	118	4
AAACCCGAACGAAATC	116	46
AAACCGGAAATGTTAA	125	55

Simpleaf Chemistry Commands

The chemistry command provides functionality to manage and inspect custom chemistries in simpleaf’s registry of recognized custom chemistries. It supports the following operations:

Add new custom chemistries.
Remove existing custom chemistries.
Add or refresh chemistry definitions from the upstream repository.
Lookup details of a specific chemistry.
Download corresponding permit lists for chemistries.
Search for unused permit lists and remove them from the cache.

Register a Custom Chemistry in Simpleaf

Once we figure out the geometry description, the expected orientation of the reads, and the permit list information of our custom chemistry, we can register it in simpleaf using the simpleaf chemistry add command.

Before we start, it is a good idea to refresh the local registry of chemistries, in case our local one is outdated.

simpleaf chemistry add \
--name 10xv3-test \
--geometry "1{b[16]u[12]}2{r:}" \
--expected-ori fw \
--remote-url "https://umd.box.com/shared/static/vc9zd4qyjj581gvtolw5kj638wmg4f3s" \
--version 0.1.0

In the above command, we have specified the name of the custom chemistry 10xv3-test, the geometry description 1{b[16]u[12]}2{r:}, and the expected orientation fw. We also versioned the chemistry as 0.1.0. If your chemistry has a corresponding permit list file and you want simpleaf to use it, there are two ways to tell simpleaf where to find the permit list file. One is to specify the --remote-url argument, which is a remote URL to download the permit list file. The other is to specify the --local-url argument, which is a path to the permit list file on the local machine. If either of these two arguments is specified, simpleaf will download or copy the permit list file to the defined ALEVIN_FRY_HOME directory. This also means that if your chemistry has an extra large permit list file, such as those generated from the mask files of Stereo-seq using their ST_BarcodeMap tool, you might want to pass them directly to the quantification command to avoid this copying step.

Update a Custom Chemistry in Simpleaf

Note that the versioning we use in simpleaf chemistry is for the custom chemistry definition itself, not for the version or revision of the protocol (like 10x Genomics Chromium 3’ V2 and V3). The purpose of versioning is to allow users to update the chemistry definition by providing a newer version number, for example, 0.2.0. In the following command, we update the geometry description of the chemistry 10xv3-test we defined above to add the x: part for read1. Please be sure to provide all information when you update the chemistry definition, including the permit list file, if any.

simpleaf chemistry add \
--name 10xv3-test \
--geometry "1{b[16]u[12]x:}2{r:}" \
--expected-ori fw \
--remote-url "https://umd.box.com/shared/static/vc9zd4qyjj581gvtolw5kj638wmg4f3s" \
--version 0.2.0

Remove Custom Chemistries in Simpleaf

If we want to remove a custom chemistry from the registry, we can use the simpleaf chemistry remove command. We just need to provide the name of the custom chemistry we want to remove, or a regex pattern that can match the chemistries’ names. If we are unsure about the behavior of the regex pattern, we can use the --dry-run argument to see which chemistries will be removed.

simpleaf chemistry remove --name 10xv3-test --dry-run

Once we are sure about the removal, we can remove the custom chemistry by removing the --dry-run argument.

simpleaf chemistry remove --name 10xv3-test

Look Up Custom Chemistries in Simpleaf

If we want to look up the details of a custom chemistry, we can use the simpleaf chemistry info command. We just need to provide the name of the custom chemistry we want to look up.

simpleaf chemistry lookup --name 10xv3-test

The --name argument can also take a regex pattern that can match the names of chemistries if we want to look up multiple chemistries at once. For example, if we want to look up all chemistries, just do .*

simpleaf chemistry lookup --name ".*"

You might notice that, for some internally defined chemistries, their geometry description is __builtin, which means they are built-in chemistries in simpleaf and their details are not stored in the registry.

Fetch the Permit List of Custom Chemistries in Simpleaf

If we want to fetch the permit list of a custom chemistry, we can use the simpleaf chemistry fetch command. We just need to provide the name of the custom chemistry from which we want to fetch the permit list. For example, if we want to fetch the permit list of all chemistries that have a defined remote permit list, we can use the following command.

simpleaf chemistry fetch --name ".*"

You might notice that for some chemistries, simpleaf will report that their permit list exists, even if we started with an empty registry. This is because some chemistries share the same permit list file, and simpleaf only downloads the permit list file once, identifying them by their hash value in the future.

Refresh the Local Registry of Chemistries in Simpleaf

The simpleaf team is constantly updating the registry of chemistries. If you want to refresh your local registry of chemistries, you can use the simpleaf chemistry refresh command.

simpleaf chemistry refresh

Summary

In conclusion, simpleaf provides the simpleaf chemistry commands to manage and inspect custom chemistries in its registry of recognized custom chemistries. We can use these commands to add new custom chemistries, remove existing custom chemistries, add or refresh chemistry definitions from the upstream repository, look up details of a specific chemistry, download corresponding permit lists for chemistries, and search for unused permit lists and remove them from the cache.