Fast is Good but Fast and accurate is better !
The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. After investigating the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis, we designed selective alignment method which overcomes the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Here we give a short tutorial on how to index your genome and transcriptome to get the accurate quantification estimates.
Downloading Reference
We are first going to download the reference transcriptome and genome for salmon index. As an example we are downloading the gencode mouse reference
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.transcripts.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/GRCm38.primary_assembly.genome.fa.gz
Installing Salmon
Although there are mutliple ways to download salmon (ex: binary from github, docker image), we are going to install it through conda. Assuming a conda environment is already set up, we can install salmon through following command:
conda install --channel bioconda salmon
Make sure you have the latest version of salmon (v1.0 as on November 1st, 2019) by using salmon --version
Preparing metadata
Salmon indexing requires the names of the genome targets, which is extractable by using the grep
command:
grep "^>" <(gunzip -c GRCm38.primary_assembly.genome.fa.gz) | cut -d " " -f 1 > decoys.txt
sed -i.bak -e 's/>//g' decoys.txt
Along with the list of decoys salmon also needs the concatenated transcriptome and genome reference file for index. NOTE: the genome targets (decoys) should come after the transcriptome targets in the reference
cat gencode.vM23.transcripts.fa.gz GRCm38.primary_assembly.genome.fa.gz > gentrome.fa.gz
Salmon Indexing
We have all the ingredients ready for the salmon recipe. We can run salmon indexing step as follows:
salmon index -t gentrome.fa.gz -d decoys.txt -p 12 -i salmon_index --gencode
NOTE: --gencode
flag is for removing extra metdata in the target header separated by |
from the gencode reference. You can skip it if using other references.
Ipython Notebook
Prefer to read ipython notebook ? Check out the gist here.