Author: Hossein Asghari, Computational Biologist at Ocean Genomics This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.
In this tutorial we introduce the new feature of Salmon for quantifying the transcript expressions of Oxford Nanopore Technologies (ONT) long reads.
Over the last year, Oxford Nanopore Technologies (ONT) and Ocean Genomics, Inc. have worked together to make the transcript quantifier Salmon more accurate when quantifying ONT sequencing reads. To do this, we incorporated a new error model that is specific to ONT reads to integrate with Salmon’s comprehensive bias models. This new model improves the accuracy of the quantification of ONT long reads and makes the pairing of Salmon and ONT long-reads more useful for the key use-case of quantification.
The new quantification model for ONT in Salmon has two core parts. First, the log-likelihood of each alignment is adjusted according to the number of substitutions in the alignment and the length of the soft-clippings (the unaligned regions at the start and end of each read). Second, Salmon’s length correction is suppressed as it does not apply to long-read sequencing: unlike short reads, long reads should cover the entire transcript molecule; and a single transcript molecule does not result in multiple sequenced fragments. Hence, the length correction method is not needed for quantification of long reads.
Salmon for ONT sequencing reads quantifies transcript expression starting from a BAM alignment file (for example generated by minimap2 aligning the sequencing reads to the transcriptome). In addition to the usual switches to run with short-read data, only one additional switch, --ont
, is required for long read quantification. E.g.:
salmon quant --ont -p 16 -t <ref.fasta> -l U -a <reads.bam> -o <output_directory>
Where -p
gives the number of threads to use, -l
is the library type (U
for “unstranded”), -t
gives the fasta files of the reference target transcripts, -a
gives the bam files with the alignments and -o
gives the output directory to store the results.
Here are a few notes that you should keep in mind when using Salmon for quantifying ONT long reads:
- To use the ONT error model, it is important that the input BAM is NOT sorted by alignment positions (e.g., do not use samtools sort). Although this is true for Salmon in general, Salmon with
--ont
includes extra checks for this and does not run and reports an error if the BAM file is sorted. - The ONT bias model in Salmon assumes that the adapters are removed from the input sequences. So, the ONT sequenced reads must be properly trimmed for adapters (e.g., with ONT’s pychopper tool) before they are passed to Salmon, otherwise the soft clipping error model does not apply.
Real data examples
In this tutorial we showcase two examples that illustrate this improvement. There is a better correlation of quantification with spiked-in data (RNA sequences with known concentration), and between short Illumina reads and long ONT reads coming from the same sample.
Spike-in data
Spike-in RNA is a synthetic RNA transcript with known sequence and quantity. They could be used as control in RNA-Seq experiments. Typically, multiple synthetic RNA sequences with different concentrations appear in a spike-in dataset.
Preparing the input data
The sample used in this example is an Oxford Nanopore spike-in direct RNA (SIRV ERCC Mix1) which is publicly availabe on SRA under SRR6058582 accession number. The ERCC Mix1 spike-in dataset contains 92 synthetic sequences with concentrations between 0.01 (moles/µl) and 30,000 (moles/µl). The concentrations and synthetic sequences can be downloaded from Lexogen from here and here. The quantification will be done on the concatenation of the the GENCODE v31 human transcriptome from European Bioinformatics Institute (EBI) to the synthetic sequences:
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.transcripts.fa.gz
wget --user-agent="Mozilla" https://www.lexogen.com/wp-content/uploads/2021/06/SIRV_Set3_Norm_sequence-design-overview_20210507a.xlsx
wget --user-agent="Mozilla" https://www.lexogen.com/wp-content/uploads/2021/06/SIRV_Set3_Norm_Sequences_20210507.zip
unzip SIRV_Set3_Norm_Sequences_20210507.zip
sed -n '/>SIRV1/q;p' SIRV_Set3_Norm_Sequences_20210507/SIRV_ERCC_multi-fasta_20210507.fasta > ERCC.fa
cat <( zcat gencode.v31.transcripts.fa.gz ) ERCC.fa > merged_txps.fa
Now, we can build minimap2 index on the merged transcriptome (human transcript + spike-in sequences):
minimap2 –t 8 -Y -I 1000G -d minimap2_spike_index merged_txps.fa
where -t
in minimap2 command specifies 8 threads, –Y
uses soft-clipping for supplementary alignments and -I
splits the index in every 1000G bases of the target reference. The minimap2 index will be stored in minimap2_spike_index
directory.
Quantification of spike-in data
First, we perform alignment using minimap2 for SRR6058582
on the human transcriptome + spike-in
index we built earlier:
minimap2 –t 8 –a -x map-ont –p 1.0 –N 100 minimap2_spike_index SRR6058582.fastq | samtools view –Sb > SRR6058582.bam
where -a
specifies SAM output, -x
selects the preset for ONT long reads, -p
sets the minimum secondary-to-primary score ratio to 1.0, -N
limits the number of reported secondary alignments to 100 and SRR6058582.fastq
contains the input ONT spike-in long reads in fastq format. We use samtools to convert the SAM results into BAM and store it in SRR6058582.bam
.
Next, we run the Salmon quantification command twice, once with the ONT error model (--ont
) and once without an error model and finally we compare the results.
For transcript quantification using ONT error model, run the following command:
salmon quant --ont -p 8 -t merged_txps.fa -l U -a SRR6058582.bam -o res_ONT_EM
where --ont
enables the ONT error model to be used in the quantification, -t
gives the target reference sequence, -l
specifies the library type as unstranded, -a
gives the bam file with the alignments and the results will be stored in res_ONT_EM
.
To run the Salmon quantification without an error model, simply pass --noErrorModel
instead of --ont
:
salmon quant --noErrorModel -p 8 -t merged_txps.fa -l U -a SRR6058582.bam -o res_no_EM
which uses the same settings and alignment input except that it will not use any error model during the process. The results are stored in res_no_EM
.
Now that we have the quantification results for ONT error model and noErrorModel ready, we can compare them with the original concentration values of spike-in RNAs which is used as ground truth here. After loading the the actual concentration of the synthetic sequences from Lexogen into “gt_data” and the TPM values from Salmon quantification into “qs_data”, their similarity is calculated using the Jensen–Shannon divergence (JSD) method.
JSD metric is a method to compare the similarity between two distributions where lower JSD values indicate more similar distributions. It can be easily done in python using the scipy package. The following Python script loads the concentrations into gt_data
and the TPM values are loaded into qs_data
and finally the JSD value is calculated:
import sys
import xlrd
import pandas as pd
from scipy.spatial import distance
def load_quant(quant_fname):
colnames = ['Name', 'Length', 'EffectiveLength', 'TPM', 'NumReads']
data = pd.read_csv(quant_fname, sep='\t', header=0, names=colnames, nrows=92)
print(data)
return data.TPM.tolist()
def load_excel(gt_fname):
workbook = xlrd.open_workbook(gt_fname, "rb")
sheets = workbook.sheet_names()
sheet = workbook.sheet_by_name("ERCCs")
required_data = []
for rownum in range(sheet.nrows):
row_vals = sheet.row_values(rownum)
if rownum < 5 or row_vals[2] == '':
continue
required_data.append(row_vals[4])
print('Finished loading {} conc values.'.format(len(required_data)))
return required_data
def calc_jsd(quant_fname, gt_fname):
qs_data = load_quant(quant_fname)
gt_data = load_excel(gt_fname)
print('Finished loading {} quant values.'.format(len(qs_data)))
jsd = distance.jensenshannon(qs_data, gt_data)
return jsd
def usage():
print('Usage:\npython {} quant.sf gt.xlsx output'.format(sys.argv[0]))
def main():
args = sys.argv[1:]
if len(args) != 2:
usage()
exit(1)
quant_fname = args[0]
gt_fname = args[1]
jsd = calc_jsd(quant_fname, gt_fname)
print('JSD = {:.2f}'.format(jsd))
if __name__ == '__main__':
main()
Save the python script above as calc_jsd.py and run it separately for ONT error model and no error model quantification results:
#For the ONT model:
python3 calc_jsd.py res_ONT_EM/quant.sf SIRV_Set3_Norm_sequence-design-overview_20210507a.xlsx
#No error model:
python3 calc_jsd.py res_no_EM/quant.sf SIRV_Set3_Norm_sequence-design-overview_20210507a.xlsx
The JSD value with noErrorModel
is 0.25 and for ONT error model is 0.09. This means that specifically the ONT error model has significantly reduced the divergence of the spike-in RNA quantification. Thus, using the ONT error model, the quantification is closer to the ground-truth.
ONT vs. Illumina short reads
In the second example, we compare the quantification results of ONT long reads to the Illumina short reads from the same biological sample.
Preparing the input data
In this experiment, we use 10 ONT long read RNA samples (accession numbers: ERR3218366 - ERR3218375) and 4 Illumina short paired-end samples (accession numbers: ERR3218280 - ERR3218283) that are all publicly available on SRA. All these samples are sequencing replicates from the same biological sample.
To start, we build a Salmon index for selective-alignment mode of Salmon to be used for the Illumina short-read samples. You can download the reference transcriptome from GENCODE (v31) from European Bioinformatics Institute (EBI):
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.transcripts.fa.gz
Now, build the Salmon and minimap2 indices on the downloaded transcriptome:
salmon index -t gencode.v31.transcripts.fa.gz -i salmon_index -p 8
minimap2 –t 8 -Y -I 1000G -d minimap2_index gencode.v31.transcripts.fa.gz
where -p
in Salmon command and -t
in minimap2 command specify 8 threads, –Y
uses soft-clipping for supplementary alignments and -I
splits the index in every 1000G bases of the target reference. The Salmon and minimap2 indices will be stored in salmon_index
and minimap2_index
directories respectively. These indices are built once and will be used for aligning all the short read and long reads respectively.
Comparing Illumina and ONT quantification
First, we run the quantification step for all the Illumina short-read samples. For ERR3218280 as an example:
salmon quant -i salmon_index -la -1 ERR3218280_R1.fastq.gz -2 ERR3218280_R2.fastq.gz -o ERR3218280_res -p 8 --gcBias --seqBias --posBias --thinningFactor 64
where we use the Salmon index we just built and include GC, sequence and position bias models in the calculations.
Now that we have the short-read transcript quantification results ready, we can calculate the long-read quantifications as well to finally compare the two. To do so, we perform alignment using minimap2 on the human transcriptome index we built earlier in this tutorial. For ERR3218366 for example, run:
minimap2 –t 8 –a -x map-ont –p 1.0 –N 100 minimap2_index ERR3218366.fastq.gz | samtools view –Sb > ERR3218366.bam
where -a
specifies SAM output, -x
selects the preset for ONT long reads, -p
sets the minimum secondary-to-primary score ratio to 1.0, -N
limits the number of reported secondary alignments to 100 and the alignment results will be stored in BAM format ERR3218366.bam
.
Next, we quantify the transcripts once with ONT error model and once without an error model to see the difference. For ERR3218366, you would run the following commands:
salmon quant --ont -p 8 -t human.fa -l U -a ERR3218366.bam -o ERR3218366_ONT_EM
salmon quant --noErrorModel -p 8 -t human.fa -l U -a ERR3218366.bam -o ERR3218366_no_EM
Since these samples are replicates from the same biological sample, we can calculate the Jensen–Shannon divergence (JSD) on all pairs of long-read and short-read sample (10x4=40 pairs) for both long-read error models and we report the average JSD value. In addition, we calculate the JSD on all pairs of long-read quantification (within the same error model) to themselves (10x9/2=45 pairs) and finally on all pairs of short-read quantification (4x3/2=6).
We calculate the JSD values similar to the previous example in this tutorial the scipy package in python. For any pair of quantification results, we first load the TPM values into qs1_data
and qs2_data
, and calculate the JSD value:
import sys
import pandas as pd
from scipy.spatial import distance
def load_quant(quant_fname):
colnames = ['Name', 'Length', 'EffectiveLength', 'TPM', 'NumReads']
data = pd.read_csv(quant_fname, sep='\t', header=0, names=colnames, nrows=92)
print(data)
return data.TPM.tolist()
def calc_jsd(quant1_fname, quant2_fname):
qs1_data = load_quant(quant1_fname)
print('Finished loading {} quant values.'.format(len(qs1_data)))
qs2_data = load_quant(quant2_fname)
print('Finished loading {} quant values.'.format(len(qs2_data)))
jsd = distance.jensenshannon(qs1_data, qs2_data)
return jsd
def usage():
print('Usage:\npython {} quant1.sf quant2.sf'.format(sys.argv[0]))
def main():
args = sys.argv[1:]
if len(args) != 2:
usage()
exit(1)
quant1_fname = args[0]
quant2_fname = args[1]
jsd = calc_jsd(quant1_fname, quant2_fname)
print('JSD = {:.2f}'.format(jsd))
if __name__ == '__main__':
main()
To run the python script above, save it as calc_jsd_qvq.py.
python calc_jsd_qvq.py SAMPLE1/quant.sf SAMPLE2/quant.sf
where SAMPLE1
and SAMPLE2
are two samples that you would like to compare.
The table below shows the average JSD value on the pairs explained above:
No error model | ONT error model | Short reads | |
---|---|---|---|
No error model | 0.433 | - | 0.561 |
ONT error model | - | 0.393 | 0.547 |
Short reads | 0.561 | 0.547 | 0.117 |
Note: Lower JSD values indicate more similar distributions.
Since these samples are all biologically the same, the ideal JSD would be zero, meaning that the quantification results are identical. Typically, the short reads have higher accuracy and would result in more accurate quantification and as you can see in the table, short reads show significantly lower JSD within themselves compared to long reads. When looking at long read results, the ONT error model illustrates lower JSD within themselves and also when compared to short reads. Thus, ONT error model provides better results while processing ONT long read samples using Salmon.