My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


How to run many different sc-RNA-seq protocols using salmon alevin's custom geometry

Author: Gaurav Sharma, Computational Biologist at Ocean Genomics
This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.

Nota Bene: The expanded custom geometry support described in this document is currently under a PR for inclusion into salmon in the near future. For the time being, to use / test this feature, you will have to build salmon from the custom_geometry branch. This documented will be updated again when the feature has been integrated into a tagged release of salmon.

The number of single-cell RNA sequencing technologies has been increasing, and new technologies continue to emerge. This leads to new protocols that need to be handled differently. To support a wide variety of protocols, salmon allows user to define the protocol in terms of cellular barcode and UMI geometry. This tutorial explains how to use this feature. Henceforth, the terms barcode and umi refer to the cell/nucleus barcode and UMI, respectively.

Types of custom protocols

There are two kinds of custom geometry protocols that are supported:

  1. Fixed length: When the lengths and positions of the barcode and UMI are known
  2. Variable length: When the lengths of the barcode and UMI may vary, but the limits are known. Furthermore, fixed sequences in the reads are supported.

1. Fixed length protocols

For single-cell protocols that involve fixed length of barcodes and reads, this feature should be used. There are 3 flags that need to be provided:

  • --bc-geometry
  • --umi-geometry
  • --read-geometry

Each of these needs to be supplied with arguments that specify barcode, umi and read geometry, respectively. The argument syntax is same for all flags. It is: ReadNumber[PositionList]. The ReadNumber can be either 1 or 2. For PositionList, here are some examples:

  • [1-5]
  • [1-10, 21-30]
  • [1-end]
  • [11-15, 21-25, 31-end]

As is clear from the examples, PositionList is a comma separated list of 1-based index positions. The keyword end denotes that the feature (barcode, umi or reads) spans till the end of the read.

Example

  • If you are working with SPLiT-seq data, here’s how you would specify the flags:
--bc-geometry 2[11-18,49-56,87-94] --umi-geometry 2[1-10] --read-geometry 1[1-end]

This means that the barcode occurs in read 2 from positions 11 to 18, 49 to 56 and 87 to 94. The final barcode would be a concatenation of the sequences in these positions. Similarly, the umi occurs in read 2 from positions 1 to 10 and the biological read is in read 1, spanning it entirely.

The full command will be something like:

salmon alevin -i <index> -l A -1 <fastq1> -2 <fastq2> -p <#Threads> --bc-geo 2[11-18,49-56,87-94] --umi-geo 2[1-10] --read-geo 1[1-end] -o <output_path> --justAlign

However, SPLiT-seq is supported by alevin, and to quantify its data one can use --splitseqV1 and --splitseqV2 switches, respectively for versions 1 and 2.

2. Custom geometry protocol

The fixed length flags well support multiple protocols, however, there are protocols that can’t be supported with these flags. For all of those protocols, you can use the --custom-geometry flag. Cases where it can be used:

  • The barcode or umi spans both reads
  • The barcode or umi has variable length
  • There is a fixed sequence in the read(s)

Let’s understand the argument that can be supplied to the --custom-geometry flag through an example:

1{b[10-12]f[ATCATC]u[10]}2{r}

Here, the first read starts with a barcode of variable length 10-12 bp, followed by a fixed sequence “ATCATC”, followed by a umi of length 10 bp. The second read is all biological sequence.

So, the way to specify the argument is — Read1{DescriptionList}Read2{DescriptionList}, where Read1 is 1 and Read2 is 2. DescriptionList is a list of descriptions of barcode, umi, biological read, fixed sequence and exclude lengths. A description is a description identifier followed by its accompanying information in square brackets. The possible description identifiers are:

  1. b for barcode. Usage: b[Length] or b[Length range]. Eg. b[10] or b[9-11].
  2. u for umi. Usage: u[Length] or u[Length range]. Eg. u[8] or u[8-10].
  3. f for fixed sequence. Usage: f[sequence]. Eg. f[CAGAGC].
  4. x for exclude. Usage: x[Length] or x[Length range]. Eg. x[4] or x[3-5].
  5. r for biological read. It is special as it is not followed by square brackets.

Example

If you’re working with sci-RNA-seq3 data, here’s how you would specify the flag:

--custom-geometry 1{b[9-10]f[CAGAGC]u[8]b[10]}2{r}

This says that read 1 starts with a variable length barcode of length 9-10 bp, followed by a fixed sequence “CAGAGC”, followed by umi and barcode of lengths 8 and 10 bp, respectively. The second read is all biological sequence.

In case there is a variable length barcode or umi1 in the protocol, there are padding sequences added to make the output barcode lengths equal, and to avoid spurious matches between barcodes. For example, if a protocol has barcodes of length 3-4 bp, and two of the barcodes are ATG and ATGA, after processing with salmon alevin using the --custom-geometry flag, these will be ATGAC and ATGAA, respectively. So, in case of variable length barcode, the length of barcodes in output is 1 more than maximum length of the barcode. For padding, the nucleotides ACGT are added in this order until the lengths are same.

Comparison between the two types

There are some key differences between the usage of flags for fixed length protocols (type 1) and custom geometry protocols (type 2) protocols:

  1. Custom geometry protocols use just one flag, --custom-geometry, instead of three.
  2. While positions need to be specified for type 1, type 2 needs lengths instead.
  3. Variable length and fixed sequences are not supported in type 1.
  4. Type 2 can be used for all type 1 protocols, but not vice-versa.

Having mentioned 4., it is important to note that type 2 uses regex parsing unlike type 1. Thus, it is about 30% slower. Therefore, if type 1 can be used for a protocol, it should be preferred if faster quantification is desired. Nonetheless, type 2 will be improved in future updates and will become faster.

Final thoughts

The field of single-cell transcriptomics is moving fast. The aforementioned features are provided with the goal of supporting almost all imaginable protocols. While salmon alevin already supports major protocols, it takes some time to add support for a new protocol. This is where the features in this tutorial come handy. It is possible to run a wide variety of protocols using the --custom-geometry flag, and thus, the users can run their data even when the explicit flag for their protocol is not present.

  1. The author is not aware of any protocols that use variable length umi.