Mudskipper Single Cell

Transcript quantification from genomic alignments of single cell RNA-Seq reads

Author: Ehsan Haghshenas, Computational Biologist at Ocean Genomics This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.


Transcript quantification for single-cell RNA-Seq (scRNA-Seq) data can be done using alevin-fry. It requires the output of salmon alevin which maps reads against the transcriptome and generates an alignment file in RAD format.

However, sometimes, it may make sense to align scRNA-Seq reads to genome (in a spliced fashion) instead of transcriptome for some tasks, or genomic alignments of the reads may already be available. Yet, alevin-fry requires alignments to be given with respect to the transcriptome. Still, it would be much more efficient to avoid starting the process from scratch (i.e. mapping reads using salmon alevin). Ideally, we could take advantage of gene annotations and somehow transform genomic alignments into transcriptomic alignments.

Mudskipper is a new tool that enables this very feature. It takes gene annotations and genomic alignments, and converts the input alignment records directly into transcript-coordinate alignments. mudskipper can output transcriptome-coordinate alignment in BAM format as well as in RAD format. The RAD file generated by mudskipper can be used directly by alevin-fry for the purpose of transcript quantification.

:pencil2: *Notes on mudskipper input requirements: * mudskipper sc command makes a few assumptions about the inputs:

  1. The input BAM/SAM contains alignments of short single-cell RNA-Seq reads, i.e. alignments of long RNA sequences are neither tested nor expected.
  2. The input BAM/SAM contains spliced alignments against reference genome.
  3. The targets of input BAM/SAM matches those in the input annotation file.

Experimental data example

In this tutorial, we first make a genomic-coordinate BAM file from a scRNA-Seq run, imagining we had, perhaps, done this for another task. Then, we use mudskipper to generate a new alignment file in RAD format that contains alignments against the reference transcriptome. We demonstrate this using the single-cell sample accessible through SRA under accession identifier SRR6376040. Note that this is a mouse sample and we should align it against the reference mouse genome. If you already have a genome-coordinate single-cell BAM/SAM file, you can skip the next section and move to the next section (Genome-coordinate to transcriptome-coordinate conversion using mudskipper).

Preparing the input data

We download raw scRNA-Seq reads directly from the Sequence Read Archive (SRA):

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR637/000/SRR6376040/SRR6376040_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR637/000/SRR6376040/SRR6376040_2.fastq.gz

This sample was sequenced using 10X genomics Chromium protocol (V2 chemistry). In order to align these short scRNA-Seq reads against the mouse reference genome, we need to use a splice-aware mapper. Among different options, in this tutorial we use the STAR aligner which provides STARsolo algorithm designed for processing single-cell samples.

We first need to create an index for STAR. Here, we download the reference mouse genome and the corresponding GTF file from GENCODE (vM25):

wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz
gunzip GRCm38.primary_assembly.genome.fa.gz
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
gunzip gencode.vM25.annotation.gtf.gz

STAR will create the index by running the following command (using 8 threads):

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles GRCm38.primary_assembly.genome.fa --sjdbGTFfile gencode.vM25.annotation.gtf

Note that the above command requires about 32 GB of memory (RAM). It is also a time consuming process so you may want to increase the number of threads passed via --runThreadN to speed up the process if your machine has more cores.

The STAR aligner enables mapping of droplet scRNA-Seq data (e.g. 10X Chromium) via STARsolo. For this, we will need a permit-list (sometimes called a whitelist) of cell barcodes that we can get from 10X Genomics github page as follows (as instructued by STARsolo manual here). Note that the permit-list must be compatible with the 10X chemistry version, so we will download the permit-list for the second version:

wget https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/737K-august-2016.txt

In order to map scRNA-Seq reads against the reference index we just created, we can use the following command:

STAR --runThreadN 8 --genomeDir star_index --readFilesCommand zcat --outFileNamePrefix mappings_genomic. --outSAMtype BAM Unsorted --chimSegmentMin 20 --chimOutType WithinBAM SoftClip --readFilesIn SRR6376040_2.fastq.gz SRR6376040_1.fastq.gz --soloType CB_UMI_Simple --soloCBwhitelist 737K-august-2016.txt --outSAMattributes NH HI AS nM CR CY UR UY

It is important to take extra care for two options of the above command. First, STAR expects the file containing cDNA reads first and then the file containing barcode. Here, since we have 10X reads, SRR6376040_2.fastq.gz should be passed first. Secondly, STAR does not store cell barcode and UMI tags by default. Therefore, we need to choose appropriate tags to keep and pass them via --outSAMattributes option.

The above command will map our scRNA-Seq reads onto the reference mouse genome and store alignments in the mappings_genomic.Aligned.out.bam file. We can confirm that this BAM contains genomic-coordinate alignments by looking at the list of targets in the header of the output BAM file:

@HD     VN:1.4
@SQ     SN:chr1 LN:195471971
@SQ     SN:chr2 LN:182113224
@SQ     SN:chr3 LN:160039680
@SQ     SN:chr4 LN:156508116
@SQ     SN:chr5 LN:151834684
@SQ     SN:chr6 LN:149736546
@SQ     SN:chr7 LN:145441459
@SQ     SN:chr8 LN:129401213
@SQ     SN:chr9 LN:124595110
@SQ     SN:chr10        LN:130694993
...

This alignment file is not ready for transcript quantification by alevin-fry. alevin-fry expects scRNA-Seq reads to be aligned against transcriptome and in RAD format. In the following section, we show how we can use mudskipper to prepare a new alignment file suitable for transcript quantification by alevin-fry.

Genome-coordinate to transcriptome-coordinate conversion using mudskipper

Building mudskipper

Mudskipper is an open-source tool written in Rust, which is freely available at https://github.com/OceanGenomics/mudskipper. If Rust is not installed on your machine, you need to install it by following the instructions on Rust language official webiste. When you have Rust ready, download the repository:

git clone git@github.com:OceanGenomics/mudskipper.git
# or alternatively by downloading the zip file and decompressing it
# wget -O mudskipper.zip https://github.com/OceanGenomics/mudskipper/archive/main.zip
# unzip -d mudskipper mudskipper.zip

and compile the code by running:

cargo build --release --manifest-path mudskipper/Cargo.toml

This will create the binary file of mudskipper at mudskipper/target/release/mudskipper.

Running mudskipper

Now everything is ready to convert the genome-coordinate BAM file mappings_genomic.Aligned.out.bam to a transcriptome-coordinate RAD file.

./mudskipper/target/release/mudskipper sc --alignment mappings_genomic.Aligned.out.bam --gtf gencode.vM25.annotation.gtf --rad --out mudskipper_out --threads 8

Note that we are passing --rad which asks mudskipper to output alignments in RAD format instead of BAM, which is the default format.

Single-cell gene quantification using alevin-fry

Using mudskipper, we created an alignment file compatible for transcript quantification using alevin-fry. More specifically, we now have a RAD file mudskipper_out/map.rad that contains alignment of scRNA-Seq reads against the mouse reference transcriptome. That means we can use alevin-fry directly (without running salmon alevin) to perform quantification as follows:

# create transcript to gene map
awk '{if($3=="transcript") {OFS="\t"; print $12, $10} }' gencode.vM25.annotation.gtf | sed 's/[;\"]//g' > txp2gene.tsv
# quantification by running alevin-fry
alevin-fry generate-permit-list -d fw -k -i mudskipper_out -o mudskipper_quant
alevin-fry collate -t 8 -i mudskipper_quant -r mudskipper_out
alevin-fry quant -t 8 -i mudskipper_quant -o mudskipper_quant --tg-map txp2gene.tsv --resolution cr-like --use-mtx

This will generate mudskipper_quant/alevin folder that contains the quantification results. More specifically, there are 3 files in this folder: quants_mat.mtx, quants_mat_cols.txt, and quants_mat_rows.txt which correspond, respectively, to the count matrix, the gene names for each column of this matrix, and the corrected, filtered cell barcodes for each row of this matrix.

Comaprison with quantification via salmon alevin + alevin-fry

alevin-fry was originally developed to perform quantification from RAD files generated by salmon alevin directly by mapping short scRNA-Seq reads. In this tutorial we used it to do quantification from a transcriptome-coordinate RAD file generated by mudskipper from a genome-coordinate BAM file. Ideally, we would like to see quantification via the second approach to be as close as possible to the original approach via salmon alevin. However, in reality this ideal case does not happen due to the difference between mapping short reads to the transcriptome and mapping them to the genome in a splice-aware fashion.

To compare these two approaches we can calculate correlation between gene expression estimates. To do this, we first need to obtain quantification results via the original approach:

# download reference transcritpome
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.transcripts.fa.gz
gunzip gencode.vM25.transcripts.fa.gz
# create an index for salmon
salmon index -t gencode.vM25.transcripts.fa -i salmon_index -p 8 --gencode
# mapping reads against the reference using salmon alevin
salmon alevin -i salmon_index -p 8 -l IU --chromium --sketch -1 SRR6376040_1.fastq.gz -2 SRR6376040_2.fastq.gz -o salmon_alevin_map --tgMap txp2gene.tsv
# quantification by running alevin-fry
alevin-fry generate-permit-list -d fw -k -i salmon_alevin_map -o salmon_alevin_quant
alevin-fry collate -t 8 -i salmon_alevin_quant -r salmon_alevin_map
alevin-fry quant -t 8 -i salmon_alevin_quant -o salmon_alevin_quant --tg-map txp2gene.tsv --resolution cr-like --use-mtx

We can now compare quantification results in mudskipper_quant/alevin/quants_mat.mtx with salmon_alevin_quant/alevin/quants_mat.mtx. Since the quantification results are stored as a count matrix, we need to calculate correlation of counts for each cell (represented by a filtered cell barcode). We can then look at the histogram of these correlations.

As it can be seen in this plot, most correlation coefficients are centered around 0.93 which shows a high correlation between the quantification from RAD file generated by mudskipper and the quantification from RAD file generated by salmon alevin.

Note that mudskipper only converts alignments from genomic coordinates into transcriptomic coordinates. It does not perform any alignment whatsoever. In other words, if a true alignment is not presented in the input BAM file, it will be also missing from the output RAD file since mudskipper is not even aware of such alignment.

Source code

mudskipper is an open-source project which is implemented in Rust. The source code is available at https://github.com/OceanGenomics/mudskipper.