18 Dec 2021 • on mudskipper

Transcript quantification from genomic alignments of bulk RNA-Seq reads with mudskipper

Author: Ehsan Haghshenas, Computational Biologist at Ocean Genomics This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.

Salmon can perform transcript quantification directly from RNA-Seq reads in its mapping-based mode. In this mode, during the quantification process, RNA-Seq reads are mapped to the reference transcript to find out which transcripts each read may have arisen from. In order to facilitate this, an index of the reference transcripts is required (see here for how this index is built).

However, if the RNA-Seq reads have been already aligned to the reference transcripts using an alignment tool, there is no need to run salmon in the mapping-based mode, as we already know the alignments for each read. Instead, salmon allows users to run quantification in alignment-based mode, in which already existing alignment of reads in BAM/SAM format are used.

The only catch is that, in the alignment-based mode, salmon expects that RNA-Seq reads have been aligned directly to the transcriptome rather than to the genome. But alignment of RNA-Seq reads against the genome are used by many tools for different tasks. What can we do if we already have such alignments?

Obviously, it is always possible to extract raw reads from the genome-coordinate BAM/SAM and run salmon in the mapping-based mode. But that incurs the substantial work of re-mapping the reads again. Can we do better and avoid another round of a time-consuming mapping step?

To address this, we have developped a new tool called mudskipper that converts genome-coordinate alignment records directly into transcript-coordinate alignments. In other words, mudskipper translates alignment targets and respective coordinates from genomic to transcriptomic coordinates. The output of mudskipper is a BAM file ready to be used in alignment-based mode of salmon for transcript quantification.

:pencil2: *Notes on mudskipper input requirements: * mudskipper bulk command makes a few assumptions about the inputs:

The input BAM/SAM contains alignments of short RNA-Seq reads, i.e. alignments of long RNA sequences are neither tested nor expected.
The input BAM/SAM contains spliced alignments against reference genome.
The targets of input BAM/SAM matches those in the input annotation file.

Experimental data example

Assuming we have a BAM/SAM file containing alignments of some RNA-Seq read against the reference genome, mudskipper generate a new alignment file in BAM format that contains alignment of those reads against the reference transcriptome.

In this tutorial, for the sake of demonstration, in the next section we will download bulk RNA-Seq reads from a human lymphoblastoid cell line sample (from the 1000 Genome Project) and align them against the reference genome using STAR. If you already have a genome-coordinate BAM/SAM file, you can skip the next section and move to the next section (Genome-coordinate to transcriptome-coordinate conversion using mudskipper).

Preparing input data

We download raw bulk RNA-Seq reads directly from European Bioinformatics Institute (EBI):

wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR188/ERR188079/NA20778.4.M_120208_1_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR188/ERR188079/NA20778.4.M_120208_1_2.fastq.gz

In order to align short RNA-Seq reads against the human reference genome, we need to use a splice-aware mapper. Among different options, in this tutorial we use STAR aligner.

We first need to create an index for STAR. Here we download the reference genome and the corresponding GTF file from GENCODE (v35):

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.primary_assembly.genome.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz
gunzip gencode.v35.annotation.gtf.gz

STAR will create the index by running the following command (using 8 threads):

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles GRCh38.primary_assembly.genome.fa --sjdbGTFfile gencode.v35.annotation.gtf

Note that the above command requires about 36 GB of memory (RAM). It is also a time consuming process so you may want to increase the number of threads passed via --runThreadN to speed up the process if your machine has more cores.

In order to map RNA-Seq reads against the reference we just created, we can use the following command:

STAR --runThreadN 8 --genomeDir star_index --readFilesCommand zcat --outFileNamePrefix mappings_genomic. --outSAMtype BAM Unsorted --chimSegmentMin 20 --chimOutType WithinBAM SoftClip --readFilesIn NA20778.4.M_120208_1_1.fastq.gz NA20778.4.M_120208_1_2.fastq.gz

This will map our RNA-Seq reads onto the reference genome and store alignments in the mappings_genomic.Aligned.out.bam file. We can confirm that this BAM contains genomic-coordinate alignments by looking at the list of targets in the header of the output BAM file:

@HD     VN:1.4
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chr10        LN:133797422
...

This alignment file is not ready for transcript quantification by salmon. salmon expects RNA-Seq reads to be aligned against transcriptome, i.e. the BAM header should show reference transcripts as targets. In the following section, we show how we can use mudskipper to prepare a new alignment file suitable for transcript quantification by salmon.

Genome-coordinate to transcriptome-coordinate conversion using `mudskipper`

Building `mudskipper`

mudskipper is an open-source tool written in Rust, which is freely available at https://github.com/OceanGenomics/mudskipper. If Rust is not installed on your machine, you need to install it by following the instructions on Rust language official webiste. When you have Rust ready, download the repository:

git clone git@github.com:OceanGenomics/mudskipper.git
# or alternatively by downloading the zip file and decompressing it
# wget -O mudskipper.zip https://github.com/OceanGenomics/mudskipper/archive/main.zip
# unzip -d mudskipper mudskipper.zip

and compile the code by running:

cargo build --release --manifest-path mudskipper/Cargo.toml

This will create the binary file of mudskipper at mudskipper/target/release/mudskipper.

Running `mudskipper`

Now everything is ready to convert the genome-coordinate BAM file mappings_genomic.Aligned.out.bam to a transcriptome-coordinate BAM.

./mudskipper/target/release/mudskipper bulk --alignment mappings_genomic.Aligned.out.bam --gtf gencode.v35.annotation.gtf --out mappings_transcriptomic.bam --threads 8

Transcript quantification using `salmon`

We utilized mudskipper to create an alignment file compatible for transcript quantification using salmon. More specifically, we now have a BAM file mappings_transcriptomic.bam that contains alignment of RNA-Seq reads against human reference transcriptome. That means we can use the alignment-based mode of salmon to perform quantification as follows:

wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.transcripts.fa.gz
salmon quant --libType a --alignments mappings_transcriptomic.bam --targets gencode.v35.transcripts.fa.gz --output quant_res --threads 8 --gencode

:pencil2: *Notes on salmon input requirements: * salmon assumes that the alignments passed via --alignments option are in a random order. This assumption is fundamental to the streaming inference method that salmon uses for transcript-level quantification. Note that mudskipper does not change the order of alignments. That means in order to make sure salmon performs the quantification properly, the user should ensure that the genomic BAM/SAM file passed to mudskipper in the first place is not sorted by target and/or position.

Comaprison with mapping-based quantification

Since aligning RNA-Seq short reads to the genome should be in a splice-aware fashion, the utilized algorithms and heuristics for this task are also different from alignment to the transcriptome. It is then not unexpected that the quantification results obtained from reads directly aligned to the transcriptome deviate slightly from the quantification on the alignments projected to the transcriptome from genomic alignments.

To check the amount of this deviation, we will compare the quantification estimates obtained above with those obtained by running salmon in mapping-based mode. In order to do this, we need an index for salmon. We can create an index with decoys as follows:

grep "^>" GRCh38.primary_assembly.genome.fa | cut -d" " -f1 | sed -e 's/>//g' > decoys.txt
cat <(zcat gencode.v35.transcripts.fa.gz) GRCh38.primary_assembly.genome.fa > gentrome.fa
salmon index -t gentrome.fa -d decoys.txt -p 8 -i salmon_index --gencode

Now we can run salmon in mapping-based mode by invoking the command below:

salmon quant --libType a -i salmon_index/ -1 NA20778.4.M_120208_1_1.fastq.gz -2 NA20778.4.M_120208_1_2.fastq.gz --output quant_res_mappingBased --threads 8

Now, we can compare the transcript expression values in quant_res/quant.sf which are obtained from alignments projected to the transcriptome using mudskipper with the transcript expression estimates in quant_res_mappingBased/quant.sf which are calculated via mapping-based mode of salmon. Note that these values are in units of TPM and therefore are already normalized (within the sample).

The plot on the left is the scatter plot of expression values. The plot on the right is similar but with log-scale axes. As it can be seen from this plot, there is a high correlation between the expression values (Pearson correlation r=0.9917, p-value $\approx$ 0). Note that reproducing these results might lead to slightly different numbers due to the stochastic nature of the salmon quantification algorithm.

It is also worth mentioning that mudskipper can only project alignments that are aligned to annotated regions of the genome by STAR. In other words, if for any reason, STAR drops a true alignment of some RNA-Seq reads, mudskipper will have no information about that alignment.

Source code

mudskipper is open-source and is implemented in Rust. The source code is available at https://github.com/OceanGenomics/mudskipper.

Experimental data example

Preparing input data

Genome-coordinate to transcriptome-coordinate conversion using mudskipper

Building mudskipper

Running mudskipper