Salmon can perform transcript quantification directly from RNA-Seq reads in its mapping-based mode. In this mode, during the quantification process, RNA-Seq reads are mapped to the reference transcript to find out which transcripts each read may have arisen
from. In order to facilitate this, an index of the reference transcripts is required (see here for how this index is built).
However, if the RNA-Seq reads have been already aligned to the reference transcripts using an alignment tool, there is no need to run
salmon in the mapping-based mode, as we already know the alignments for each read. Instead,
salmon allows users to run quantification in alignment-based mode, in which already existing alignment of reads in BAM/SAM format are used.
The only catch is that, in the alignment-based mode,
salmon expects that RNA-Seq reads have been aligned directly to the transcriptome rather than to the genome. But alignment of RNA-Seq reads against the genome are used by many tools for different tasks. What can we do if we already have such alignments?
Obviously, it is always possible to extract raw reads from the genome-coordinate BAM/SAM and run
salmon in the mapping-based mode. But that incurs the substantial work of re-mapping the reads again. Can we do better and avoid another round of a time-consuming mapping step?
To address this, we have developped a new tool called
mudskipper that converts genome-coordinate alignment records directly into transcript-coordinate alignments. In other words,
mudskipper translates alignment targets and respective coordinates from genomic to transcriptomic coordinates. The output of
mudskipper is a BAM file ready to be used in alignment-based mode of
salmon for transcript quantification.
:pencil2: *Notes on
mudskipper input requirements: *
mudskipper bulk command makes a few assumptions about the inputs:
- The input BAM/SAM contains alignments of short RNA-Seq reads, i.e. alignments of long RNA sequences are neither tested nor expected.
- The input BAM/SAM contains spliced alignments against reference genome.
- The targets of input BAM/SAM matches those in the input annotation file.
Experimental data example
Assuming we have a BAM/SAM file containing alignments of some RNA-Seq read against the reference genome,
mudskipper generate a new alignment file in BAM format that contains alignment of those reads against the reference transcriptome.
In this tutorial, for the sake of demonstration, in the next section we will download bulk RNA-Seq reads from a human lymphoblastoid cell line sample (from the 1000 Genome Project) and align them against the reference genome using
STAR. If you already have a genome-coordinate BAM/SAM file, you can skip the next section and move to the next section (Genome-coordinate to transcriptome-coordinate conversion using
Preparing input data
We download raw bulk RNA-Seq reads directly from European Bioinformatics Institute (EBI):
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR188/ERR188079/NA20778.4.M_120208_1_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR188/ERR188079/NA20778.4.M_120208_1_2.fastq.gz
In order to align short RNA-Seq reads against the human reference genome, we need to use a splice-aware mapper. Among different options, in this tutorial we use
We first need to create an index for
STAR. Here we download the reference genome and the corresponding GTF file from GENCODE (v35):
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.primary_assembly.genome.fa.gz gunzip GRCh38.primary_assembly.genome.fa.gz wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz gunzip gencode.v35.annotation.gtf.gz
STAR will create the index by running the following command (using 8 threads):
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles GRCh38.primary_assembly.genome.fa --sjdbGTFfile gencode.v35.annotation.gtf
Note that the above command requires about 36 GB of memory (RAM). It is also a time consuming process so you may want to increase the number of threads passed via
--runThreadN to speed up the process if your machine has more cores.
In order to map RNA-Seq reads against the reference we just created, we can use the following command:
STAR --runThreadN 8 --genomeDir star_index --readFilesCommand zcat --outFileNamePrefix mappings_genomic. --outSAMtype BAM Unsorted --chimSegmentMin 20 --chimOutType WithinBAM SoftClip --readFilesIn NA20778.4.M_120208_1_1.fastq.gz NA20778.4.M_120208_1_2.fastq.gz
This will map our RNA-Seq reads onto the reference genome and store alignments in the
mappings_genomic.Aligned.out.bam file. We can confirm that this BAM contains genomic-coordinate alignments by looking at the list of targets in the header of the output BAM file:
@HD VN:1.4 @SQ SN:chr1 LN:248956422 @SQ SN:chr2 LN:242193529 @SQ SN:chr3 LN:198295559 @SQ SN:chr4 LN:190214555 @SQ SN:chr5 LN:181538259 @SQ SN:chr6 LN:170805979 @SQ SN:chr7 LN:159345973 @SQ SN:chr8 LN:145138636 @SQ SN:chr9 LN:138394717 @SQ SN:chr10 LN:133797422 ...
This alignment file is not ready for transcript quantification by
salmon expects RNA-Seq reads to be aligned against transcriptome, i.e. the BAM header should show reference transcripts as targets. In the following section, we show how we can use
mudskipper to prepare a new alignment file suitable for transcript quantification by
Genome-coordinate to transcriptome-coordinate conversion using
mudskipper is an open-source tool written in Rust, which is freely available at https://github.com/OceanGenomics/mudskipper. If Rust is not installed on your machine, you need to install it by following the instructions on Rust language official webiste. When you have Rust ready, download the repository:
git clone firstname.lastname@example.org:OceanGenomics/mudskipper.git # or alternatively by downloading the zip file and decompressing it # wget -O mudskipper.zip https://github.com/OceanGenomics/mudskipper/archive/main.zip # unzip -d mudskipper mudskipper.zip
and compile the code by running:
cargo build --release --manifest-path mudskipper/Cargo.toml
This will create the binary file of
Now everything is ready to convert the genome-coordinate BAM file
mappings_genomic.Aligned.out.bam to a transcriptome-coordinate BAM.
./mudskipper/target/release/mudskipper bulk --alignment mappings_genomic.Aligned.out.bam --gtf gencode.v35.annotation.gtf --out mappings_transcriptomic.bam --threads 8
Transcript quantification using
mudskipper to create an alignment file compatible for transcript quantification using
salmon. More specifically, we now have a BAM file
mappings_transcriptomic.bam that contains alignment of RNA-Seq reads against human reference transcriptome. That means we can use the alignment-based mode of
salmon to perform quantification as follows:
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.transcripts.fa.gz salmon quant --libType a --alignments mappings_transcriptomic.bam --targets gencode.v35.transcripts.fa.gz --output quant_res --threads 8 --gencode
:pencil2: *Notes on
salmon input requirements: *
salmon assumes that the alignments passed via
--alignments option are in a random order. This assumption is fundamental to the streaming inference method that
salmon uses for transcript-level quantification. Note that
mudskipper does not change the order of alignments. That means in order to make sure
salmon performs the quantification properly, the user should ensure that the genomic BAM/SAM file passed to
mudskipper in the first place is not sorted by target and/or position.
Comaprison with mapping-based quantification
Since aligning RNA-Seq short reads to the genome should be in a splice-aware fashion, the utilized algorithms and heuristics for this task are also different from alignment to the transcriptome. It is then not unexpected that the quantification results obtained from reads directly aligned to the transcriptome deviate slightly from the quantification on the alignments projected to the transcriptome from genomic alignments.
To check the amount of this deviation, we will compare the quantification estimates obtained above with those obtained by running
salmon in mapping-based mode. In order to do this, we need an index for
salmon. We can create an index with decoys as follows:
grep "^>" GRCh38.primary_assembly.genome.fa | cut -d" " -f1 | sed -e 's/>//g' > decoys.txt cat <(zcat gencode.v35.transcripts.fa.gz) GRCh38.primary_assembly.genome.fa > gentrome.fa salmon index -t gentrome.fa -d decoys.txt -p 8 -i salmon_index --gencode
Now we can run
salmon in mapping-based mode by invoking the command below:
salmon quant --libType a -i salmon_index/ -1 NA20778.4.M_120208_1_1.fastq.gz -2 NA20778.4.M_120208_1_2.fastq.gz --output quant_res_mappingBased --threads 8
Now, we can compare the transcript expression values in
quant_res/quant.sf which are obtained from alignments projected to the transcriptome using
mudskipper with the transcript expression estimates in
quant_res_mappingBased/quant.sf which are calculated via mapping-based mode of
salmon. Note that these values are in units of TPM and therefore are already normalized (within the sample).
The plot on the left is the scatter plot of expression values. The plot on the right is similar but with log-scale axes. As it can be seen from this plot, there is a high correlation between the expression values (Pearson correlation r=0.9917, p-value $\approx$ 0). Note that reproducing these results might lead to slightly different numbers due to the stochastic nature of the
salmon quantification algorithm.
It is also worth mentioning that
mudskipper can only project alignments that are aligned to annotated regions of the genome by
STAR. In other words, if for any reason,
STAR drops a true alignment of some RNA-Seq reads,
mudskipper will have no information about that alignment.
mudskipper is open-source and is implemented in Rust. The source code is available at https://github.com/OceanGenomics/mudskipper.