Fast is Good but Fast and accurate is better !
The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. After investigating the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis, we designed selective alignment method which overcomes the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Here we give a short tutorial on how to index your genome and transcriptome to get the accurate quantification estimates.
We are first going to download the reference transcriptome and genome for salmon index. As an example we are downloading the gencode mouse reference
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.transcripts.fa.gz wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/GRCm38.primary_assembly.genome.fa.gz
Although there are mutliple ways to download salmon (ex: binary from github, docker image), we are going to install it through conda. Assuming a conda environment is already set up, we can install salmon through following command:
conda install --channel bioconda salmon
Make sure you have the latest version of salmon (v1.0 as on November 1st, 2019) by using
Salmon indexing requires the names of the genome targets, which is extractable by using the
grep "^>" <(gunzip -c GRCm38.primary_assembly.genome.fa.gz) | cut -d " " -f 1 > decoys.txt sed -i.bak -e 's/>//g' decoys.txt
Along with the list of decoys salmon also needs the concatenated transcriptome and genome reference file for index. NOTE: the genome targets (decoys) should come after the transcriptome targets in the reference
cat gencode.vM23.transcripts.fa.gz GRCm38.primary_assembly.genome.fa.gz > gentrome.fa.gz
We have all the ingredients ready for the salmon recipe. We can run salmon indexing step as follows:
salmon index -t gentrome.fa.gz -d decoys.txt -p 12 -i salmon_index --gencode
--gencode flag is for removing extra metdata in the target header separated by
| from the gencode reference. You can skip it if using other references.
Prefer to read ipython notebook ? Check out the gist here.