RNA Velocity with alevin
Charlotte Soneson
2020-03-22
RNA velocity (La Manno et al., 2018) allows investigation of the dynamic patterns in single-cell RNA-seq data sets, via estimation of the abundance of intronic (“unprocessed”) and exonic (“processed”) RNA in each cell, and modelling of these via a system of ordinary differential equations. From a quantification point of view, RNA velocity analysis requires the generation of two count matrices, representing the processed and unprocessed RNA. In this tutorial, we show how these matrices can be generated using alevin and tximeta (Love et al., 2020). For a more detailed study of the impact of quantification on RNA velocity estimates and interpretation, see Soneson et al., 2020.
The sections below contain code chunks executed in different languages, indicated with different background colors. R code is represented by grey boxes, shell commands by blue boxes, and python code by yellow boxes. We start by loading the required R packages.
suppressPackageStartupMessages({
    library(Biostrings)
    library(BSgenome)
    library(eisaR)
    library(GenomicFeatures)
    library(SummarizedExperiment)
    library(tximeta)
    library(rjson)
    library(reticulate)
    library(SingleCellExperiment)
    library(scater)
})Step 1. Generate reference fasta files
In order to quantify both exonic and intronic abundances with alevin, we need to provide a reference fasta file with both types of sequences. Several tools implement the extraction of transcript and intron sequences from the genome sequence. Here, we will use the eisaR package, but equivalent functionality is available in the BUSpaRse package, and can be replicated e.g. using basic functions from the GenomicFeatures and BSgenome packages.
We start by downloading the reference genome and the corresponding gtf file from the Gencode website.
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M24/gencode.vM24.annotation.gtf.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M24/GRCm38.primary_assembly.genome.fa.gz
gunzip GRCm38.primary_assembly.genome.fa.gzNext, we load the eisaR package and extract a GRanges object containing the genomic coordinates of each annotated transcript and intron. In this example, we use the ‘separate’ approach to define introns separately for each transcript, and add a flank length of 90nt to each intron. We note that the length of the flanking sequence should be chosen depending on the RNA read length and the desired amount of overlap with an intron that is required to consider a read potentially intronic. For more details on the different options, we refer to the help of the getFeatureRanges() function.
gtf <- "gencode.vM24.annotation.gtf.gz"
grl <- eisaR::getFeatureRanges(
  gtf = gtf,
  featureType = c("spliced", "intron"), 
  intronType = "separate", 
  flankLength = 90L, 
  joinOverlappingIntrons = FALSE, 
  verbose = TRUE
)grl[4:6]## GRangesList object of length 3:
## $ENSMUST00000161581.1
## GRanges object with 2 ranges and 5 metadata columns:
##       seqnames          ranges strand |              exon_id
##          <Rle>       <IRanges>  <Rle> |          <character>
##   [1]     chr1 3466587-3466687      + | ENSMUSE00000869502.1
##   [2]     chr1 3513405-3513553      + | ENSMUSE00000864479.1
##       exon_rank        transcript_id              gene_id        type
##       <integer>          <character>          <character> <character>
##   [1]         1 ENSMUST00000161581.1 ENSMUSG00000089699.1        exon
##   [2]         2 ENSMUST00000161581.1 ENSMUSG00000089699.1        exon
##   -------
##   seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## $ENSMUST00000192183.1
## GRanges object with 1 range and 5 metadata columns:
##       seqnames          ranges strand |              exon_id
##          <Rle>       <IRanges>  <Rle> |          <character>
##   [1]     chr1 3531795-3532720      + | ENSMUSE00001343235.1
##       exon_rank        transcript_id              gene_id        type
##       <integer>          <character>          <character> <character>
##   [1]         1 ENSMUST00000192183.1 ENSMUSG00000103147.1        exon
##   -------
##   seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## $ENSMUST00000193244.1
## GRanges object with 1 range and 5 metadata columns:
##       seqnames          ranges strand |              exon_id
##          <Rle>       <IRanges>  <Rle> |          <character>
##   [1]     chr1 3680155-3681788      + | ENSMUSE00001341983.1
##       exon_rank        transcript_id              gene_id        type
##       <integer>          <character>          <character> <character>
##   [1]         1 ENSMUST00000193244.1 ENSMUSG00000102348.1        exon
##   -------
##   seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengthsAfter defining the genomic positions of all features of interest, we can extract the sequences of these, and write to a fasta file for later indexing with Salmon.
genome <- Biostrings::readDNAStringSet(
    "GRCm38.primary_assembly.genome.fa"
)
names(genome) <- sapply(strsplit(names(genome), " "), .subset, 1)
seqs <- GenomicFeatures::extractTranscriptSeqs(
  x = genome, 
  transcripts = grl
)
Biostrings::writeXStringSet(
    seqs, filepath = "gencode.vM24.annotation.expanded.fa"
)To enable reading the estimated abundances with tximeta, automatically recognizing the underlying transcriptome, we write the expanded annotation to a GTF file. This will later be used to create a linked transcriptome for tximeta.
eisaR::exportToGtf(
  grl, 
  filepath = "gencode.vM24.annotation.expanded.gtf"
)Since alevin quantifies spliced and unspliced features jointly, we will also need to split the imported abundances by feature type. The splitting needs to be done in such a way that we can still match up a spliced feature with the corresponding unspliced feature. To help with this, the metadata of the GRanges object contains a data frame with corresponding spliced and unspliced gene IDs.
head(metadata(grl)$corrgene)##                spliced                 intron
## 1 ENSMUSG00000102693.1 ENSMUSG00000102693.1-I
## 2 ENSMUSG00000064842.1 ENSMUSG00000064842.1-I
## 3 ENSMUSG00000102851.1 ENSMUSG00000102851.1-I
## 4 ENSMUSG00000089699.1 ENSMUSG00000089699.1-I
## 5 ENSMUSG00000103147.1 ENSMUSG00000103147.1-I
## 6 ENSMUSG00000102348.1 ENSMUSG00000102348.1-Iwrite.table(
    metadata(grl)$corrgene, 
    file = "gencode.vM24.annotation.expanded.features.tsv",
    row.names = FALSE, col.names = TRUE, quote = FALSE, sep = "\t"
)Finally, we create a text file mapping transcript and intron identifiers to the corresponding gene identifiers.
df <- eisaR::getTx2Gene(
    grl, filepath = "gencode.vM24.annotation.expanded.tx2gene.tsv"
)Step 2. Index the reference features
After creating the fasta file with transcript and intron sequences as above, we can index it using Salmon. Here, we add the full genome sequence as decoy sequences (see Srivastava et al., 2019 for more details). This is recommended in order to avoid reads truly originating from intergenic regions being assigned to a suboptimal transcriptome location. However, the effect of including decoys is typically smaller when both transcripts and introns are being quantified than for ‘regular’ gene expression quantification, since in the former case a larger fraction of the genome is already covered by the features of interest.
grep ">" GRCm38.primary_assembly.genome.fa | cut -d ">" -f 2 | cut -d " " -f 1 > GRCm38.primary_assembly.genome.chrnames.txt
salmon index \
-t <(cat gencode.vM24.annotation.expanded.fa GRCm38.primary_assembly.genome.fa) \
-i gencode.vM24.annotation.expanded.sidx --gencode -p 32 \
-d GRCm38.primary_assembly.genome.chrnames.txtAs mentioned in the previous section, we also create a linked transcriptome with tximeta. This allows tximeta to recognize the reference annotation when reading the alevin quantification, and automatically annotate the resulting SummarizedExperiment object.
tximeta::makeLinkedTxome(
  indexDir = "gencode.vM24.annotation.expanded.sidx", 
  source = "GENCODE", genome = "GRCm38", 
  organism = "Mus musculus", release = "M24", 
  fasta = "gencode.vM24.annotation.expanded.fa", 
  gtf = "gencode.vM24.annotation.expanded.gtf", 
  write = TRUE, jsonFile = "gencode.vM24.annotation.expanded.json"
)rjson::fromJSON(file = "gencode.vM24.annotation.expanded.json")## [[1]]
## [[1]]$index
## [1] "gencode.vM24.annotation.expanded.sidx"
## 
## [[1]]$source
## [1] "GENCODE"
## 
## [[1]]$organism
## [1] "Mus musculus"
## 
## [[1]]$release
## [1] "M24"
## 
## [[1]]$genome
## [1] "GRCm38"
## 
## [[1]]$fasta
## [1] "gencode.vM24.annotation.expanded.fa"
## 
## [[1]]$gtf
## [1] "gencode.vM24.annotation.expanded.gtf"
## 
## [[1]]$sha256
## [1] "d99e4f1eb30d801f6bb787feeb0349e8e27b8913fa9c17f369895584475d69e8"Step 3. Quantify with alevin
After generating the index, we quantify exonic and intronic abundances with alevin. We use an example data set from Hermann et al., 2018. The following code downloads the bam file from SRA and uses the bamtofastq utility to convert it into a set of FASTQ files (note that this can take a long time to execute):
wget https://sra-pub-src-1.s3.amazonaws.com/SRR6459157/AdultMouse_Rep3_possorted_genome_bam.bam.1
mv AdultMouse_Rep3_possorted_genome_bam.bam.1 AdultMouse_Rep3_possorted_genome_bam.bam
bamtofastq --reads-per-fastq=500000000 AdultMouse_Rep3_possorted_genome_bam.bam FASTQtmp
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_I1_001.fastq.gz AdultMouseRep3_S1_L001_I1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R1_001.fastq.gz AdultMouseRep3_S1_L001_R1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R2_001.fastq.gz AdultMouseRep3_S1_L001_R2_001.fastq.gzNext, we run alevin to quantify the exonic and intronic abundances based on the index generated above.
salmon alevin -l ISR -i gencode.vM24.annotation.expanded.sidx \
-1 AdultMouseRep3_S1_L001_R1_001.fastq.gz \
-2 AdultMouseRep3_S1_L001_R2_001.fastq.gz \
-o alevin_out -p 36 --tgMap gencode.vM24.annotation.expanded.tx2gene.tsv \
--chromium --dumpFeatures --expectCells 1850Step 4. Import abundances into R with tximeta
The tximeta package can be used to import the alevin quantifications into R, and generate a SummarizedExperiment object. We first load the linked transcriptome generated above, and then read the alevin output.
tximeta::loadLinkedTxome("gencode.vM24.annotation.expanded.json")
txi <- tximeta::tximeta(coldata = data.frame(
  names = "AdultMouseRep3",
  files = "alevin_out/alevin/quants_mat.gz", 
  stringsAsFactors = FALSE
), type = "alevin")The txi object contains a single assay (‘counts’) containing both spliced and unspliced abundances. In order to calculate RNA velocities, we need to split this into two matrices, one with spliced and one with unspliced abundances, with corresponding rows. This can be done using the splitSE() function from tximeta, providing the data frame linking spliced and unspliced gene identifiers that we created above. Note that tximeta version 1.5.30 or later is required for this step.
cg <- read.delim("gencode.vM24.annotation.expanded.features.tsv",
                 header = TRUE, as.is = TRUE)
## Rename the 'intron' column 'unspliced' to make assay names compatible with scVelo
colnames(cg)[colnames(cg) == "intron"] <- "unspliced"
txis <- tximeta::splitSE(txi, cg, assayName = "counts")At this point, the txis object contains all the information required for RNA velocity analysis. However, in order to illustrate the direct application of scVelo below, we convert the txis object to a SingleCellExperiment object and perform basic normalization and dimension reduction.
txis <- as(txis, "SingleCellExperiment")
assays(txis) <- list(
    counts = assay(txis, "spliced"),
    spliced = assay(txis, "spliced"),
    unspliced = assay(txis, "unspliced")
)
txis <- scater::logNormCounts(txis)
txis <- scater::runPCA(txis)
txis <- scater::runTSNE(txis, dimred = "PCA")txis## class: SingleCellExperiment 
## dim: 54448 2323 
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(4): counts spliced unspliced logcounts
## rownames(54448): ENSMUSG00000102693.1 ENSMUSG00000064842.1 ...
##   ENSMUSG00000064369.1 ENSMUSG00000064372.1
## rowData names(0):
## colnames(2323): ACATACGTCCTCGCAT ACACCAATCTTCGGTC ...
##   AACGTTGTCATCGATG ATTGGTGGTTACCGAT
## colData names(1): sizeFactor
## reducedDimNames(2): PCA TSNE
## altExpNames(0):print(sum(assay(txis, "spliced")))## [1] 48984893print(sum(assay(txis, "unspliced")))## [1] 3229763Step 5. Run scVelo
Once we have the spliced and unspliced abundance estimates, we can run, e.g., scVelo (Bergen et al., 2019) to estimate and visualize RNA velocities. Note that the code below represents only a simple example application to illustrate the process; please consult the scVelo manual for more details and updated recommendations.
First, we extract the required components of the SingleCellExperiment object in order to create an AnnData object for use in python. Note that the AnnData object is transposed compared to the SingleCellExperiment, and requires rows to represent cells and columns to represent genes.
spliced <- t(assay(txis, "spliced"))
unspliced <- t(assay(txis, "unspliced"))
tsne <- reducedDim(txis, "TSNE")
pca <- reducedDim(txis, "PCA")Then, we import the required python modules and initialize an AnnData object with the spliced and unspliced count matrices. For later visualization purposes, we also include the PCA and tSNE embeddings.
import scanpy as sc## /tungstenfs/groups/gbioinfo/Appz/easybuild/software/Anaconda3/5.3.0/envs/scvelo_0.2.0/lib/python3.7/site-packages/anndata/_core/anndata.py:21: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
##   from pandas.core.index import RangeIndeximport sys
import numpy as np
import anndataadata = anndata.AnnData(X = r.spliced, 
                        layers = dict(spliced = r.spliced, unspliced = r.unspliced),
                        obsm = dict(X_pca = r.pca, X_tsne = r.tsne))Next, we use scVelo to select highly variable genes to use as the basis for the velocity calculations. We estimate the velocities using the dynamical approach, and visualize the results in the tSNE embedding calculated above.
import scvelo as scv
import matplotlib
import pandas as pd
matplotlib.use('AGG')
scv.settings.set_figure_params('scvelo')
scv.utils.show_proportions(adata)## Abundance of ['spliced', 'unspliced']: [0.94 0.06]scv.pp.filter_genes(adata, min_shared_counts = 30)## Filtered out 46324 genes that are detected in less than 30 counts (shared).scv.pp.normalize_per_cell(adata, enforce = True)## Normalized count data: X, spliced, unspliced.scv.pp.filter_genes_dispersion(adata, n_top_genes = 2000)
scv.pp.log1p(adata)
scv.pp.moments(adata, n_pcs = 30, n_neighbors = 30)## computing neighbors
##     finished (0:00:03) --> added 
##     'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
## computing moments based on connectivities
##     finished (0:00:00) --> added 
##     'Ms' and 'Mu', moments of spliced/unspliced abundances (adata.layers)scv.tl.recover_dynamics(adata)## recovering dynamics
## 
... 1%
... 2%
... 3%
... 4%
... 6%
... 7%
... 8%
... 9%
... 10%
... 11%
... 12%
... 13%
... 14%
... 16%
... 17%
... 18%
... 19%
... 21%
... 22%
... 23%
... 24%
... 25%
... 26%
... 27%
... 29%
... 30%
... 31%
... 32%
... 33%
... 33%
... 34%
... 36%
... 36%
... 37%
... 39%
... 40%
... 41%
... 43%
... 44%
... 45%
... 47%
... 48%
... 49%
... 50%
... 51%
... 53%
... 53%
... 54%
... 56%
... 57%
... 58%
... 59%
... 60%
... 61%
... 63%
... 64%
... 65%
... 67%
... 68%
... 70%
... 71%
... 73%
... 74%
... 75%
... 76%
... 77%
... 78%
... 79%
... 80%
... 81%
... 82%
... 83%
... 85%
... 86%
... 87%
... 88%
... 89%
... 91%
... 93%
... 94%
... 95%
... 96%
... 97%
... 98%
... 98%
... 100%
    finished (0:04:36) --> added 
##     'fit_pars', fitted parameters for splicing dynamics (adata.var)scv.tl.velocity(adata, mode = 'dynamical')## computing velocities
##     finished (0:00:01) --> added 
##     'velocity', velocity vectors for each individual cell (adata.layers)scv.tl.velocity_graph(adata)## computing velocity graph
## 
... 100%
    finished (0:00:01) --> added 
##     'velocity_graph', sparse matrix with cosine correlations (adata.uns)scv.pl.velocity_embedding_stream(adata, basis='X_tsne')## computing velocity embedding
##     finished (0:00:00) --> added
##     'velocity_tsne', embedded velocity vectors (adata.obsm)Session info
sessionInfo()## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /tungstenfs/groups/gbioinfo/Appz/easybuild/software/OpenBLAS/0.3.7-GCC-8.3.0/lib/libopenblas_skylakex-r0.3.7.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils    
## [7] datasets  methods   base     
## 
## other attached packages:
##  [1] scater_1.16.0               ggplot2_3.3.0              
##  [3] SingleCellExperiment_1.10.1 reticulate_1.15            
##  [5] rjson_0.2.20                tximeta_1.6.2              
##  [7] SummarizedExperiment_1.18.1 DelayedArray_0.14.0        
##  [9] matrixStats_0.56.0          GenomicFeatures_1.40.0     
## [11] AnnotationDbi_1.50.0        Biobase_2.48.0             
## [13] eisaR_1.0.0                 BSgenome_1.56.0            
## [15] rtracklayer_1.48.0          GenomicRanges_1.40.0       
## [17] GenomeInfoDb_1.24.0         Biostrings_2.56.0          
## [19] XVector_0.28.0              IRanges_2.22.2             
## [21] S4Vectors_0.26.1            BiocGenerics_0.34.0        
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.20.0           bitops_1.0-6                 
##  [3] bit64_0.9-7                   progress_1.2.2               
##  [5] httr_1.4.1                    tools_4.0.0                  
##  [7] irlba_2.3.3                   R6_2.4.1                     
##  [9] vipor_0.4.5                   colorspace_1.4-1             
## [11] DBI_1.1.0                     lazyeval_0.2.2               
## [13] withr_2.2.0                   gridExtra_2.3                
## [15] tidyselect_1.1.0              prettyunits_1.1.1            
## [17] bit_1.1-15.2                  curl_4.3                     
## [19] compiler_4.0.0                BiocNeighbors_1.6.0          
## [21] scales_1.1.1                  askpass_1.1                  
## [23] rappdirs_0.3.1                stringr_1.4.0                
## [25] digest_0.6.25                 Rsamtools_2.4.0              
## [27] rmarkdown_2.1                 pkgconfig_2.0.3              
## [29] htmltools_0.4.0               dbplyr_1.4.3                 
## [31] fastmap_1.0.1                 ensembldb_2.12.1             
## [33] limma_3.44.1                  rlang_0.4.6                  
## [35] RSQLite_2.2.0                 DelayedMatrixStats_1.10.0    
## [37] shiny_1.4.0.2                 jsonlite_1.6.1               
## [39] BiocParallel_1.22.0           dplyr_0.8.5                  
## [41] BiocSingular_1.4.0            RCurl_1.98-1.2               
## [43] magrittr_1.5                  GenomeInfoDbData_1.2.3       
## [45] Matrix_1.2-18                 ggbeeswarm_0.6.0             
## [47] munsell_0.5.0                 Rcpp_1.0.4.6                 
## [49] viridis_0.5.1                 lifecycle_0.2.0              
## [51] stringi_1.4.6                 yaml_2.2.1                   
## [53] edgeR_3.30.0                  zlibbioc_1.34.0              
## [55] BiocFileCache_1.12.0          AnnotationHub_2.20.0         
## [57] grid_4.0.0                    blob_1.2.1                   
## [59] promises_1.1.0                crayon_1.3.4                 
## [61] lattice_0.20-41               hms_0.5.3                    
## [63] locfit_1.5-9.4                knitr_1.28                   
## [65] pillar_1.4.4                  biomaRt_2.44.0               
## [67] XML_3.99-0.3                  glue_1.4.1                   
## [69] BiocVersion_3.11.1            evaluate_0.14                
## [71] BiocManager_1.30.10           png_0.1-7                    
## [73] vctrs_0.3.0                   httpuv_1.5.2                 
## [75] gtable_0.3.0                  openssl_1.4.1                
## [77] purrr_0.3.4                   assertthat_0.2.1             
## [79] xfun_0.14                     rsvd_1.0.3                   
## [81] mime_0.9                      xtable_1.8-4                 
## [83] AnnotationFilter_1.12.0       later_1.0.0                  
## [85] viridisLite_0.3.0             tibble_3.0.1                 
## [87] beeswarm_0.2.3                GenomicAlignments_1.24.0     
## [89] memoise_1.1.0                 tximport_1.16.0              
## [91] ellipsis_0.3.1                interactiveDisplayBase_1.26.2scv.__version__## '0.2.0' 
					