My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


An introduction to RNA-velocity using alevin-fry

Recently, RNA-velocity estimation has becomes increasingly popular tool in single-cell RNA seq analysis. In this post, we will discuss an additional advantage brought by the Unspliced-Spliced-Ambiguous (USA) mode introduced in alevin-fry 0.3.0. That is, the solution presented in that approach for controlling the spurious mapping to spliced transcripts of sequenced fragments arising from introns (in the absence of full decoy) basically gives us the preprocessing results we need to perform an RNA-velocity analysis “for free”. Here we provide an end-to-end tutorial describing how to perform an RNA-velocity analysis for a 10X Chromium dataset. In this tutorial, we will show the whole analysis pipeline, starting from the raw FASTQ files to the gorgeous velocity plots (generated by scVelo) that you may like to include in your next analysis or paper.

Preliminaries and overview of the pipeline

For this pipeline, you will need salmon (>= v1.4.0), and alevin-fry (>= v0.3.0). As always, it’s best to use as recent versions of these tools as possible. To prepare the reference spliced+intron (splici) transcriptome, you will need a recent version of R and the packages eisaR, Biostrings and BSgenome, as well as gffread. To infer the RNA-velocity, you will need scVelo python package and its dependencies installed.

Downloading Dataset

In this tutorial, we will use a mouse pancreas experiment data introduced by Bastidas-Ponce et al. 2019, GEO accession GSE132188, as a real world example. A subset of this dataset is included in the scVelo python package as a built-in example dataset (below we will call it the scVelo dataset). Here we will start with the raw FASTQ file downloaded from GEO, and walk through the RNA-velocity analysis pipeline until the generation of the velocity graph using scVelo.

To download the dataset from GEO, make sure that there is at least 30GB free space in your disk, and open your terminal to run:

$ mkdir af_tutorial_velocity
$ cd af_tutorial_velocity
$ mkdir data
$ cd data
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR920/004/SRR9201794/SRR9201794_1.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR920/004/SRR9201794/SRR9201794_2.fastq.gz

Preparing the splici reference

Following our USA mode tutorial, we build the splici reference over the mouse transcriptome.

Obtaining the reference sequeneces and GTF

First we obtain the pre-built reference sequences and GTF from 10X Genomics. You can also choose to build it on your own by following this note.

$ wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-2.1.0.tar.gz
$ tar xvf refdata-cellranger-mm10-2.1.0.tar.gz

and we’ll launch into R to do the actual processing

$ R

click to see the R function for making splici transcriptome, this function needs to be pasted into R and will be used in the following R code:

CLICK ME!


After pasted the above function into R, please run the following code in R:

suppressPackageStartupMessages({
  library(eisaR)
  library(Biostrings)
  library(BSgenome)
  library(stringr)
  library(GenomicFeatures)
})

ref_path <- file.path("refdata-cellranger-mm10-2.1.0")
gtf_path = file.path(ref_path, "genes/genes.gtf")
genome_path = file.path(ref_path, "fasta/genome.fa")
read_length = 151
flank_trim_length = 5

output_dir = paste0("transcriptome_splici_fl", read_length - flank_trim_length)

make_splici_txome(gtf_path=gtf_path, 
                  genome_path=genome_path, 
                  read_length=read_length, 
                  flank_trim_length=flank_trim_length, 
                  output_dir=output_dir)

Building the salmon/alevin splici index

First we get the Entrez gene ID to gene name mapping.

$ gffread refdata-cellranger-mm10-2.1.0/genes/genes.gtf -o refdata-cellranger-mm10-2.1.0/genes/genes.gff
$ grep "gene_name" refdata-cellranger-mm10-2.1.0/genes/genes.gff | cut -f9 | cut -d';' -f2,3 | sed 's/=/ /g' | sed 's/;/ /g' | cut -d' ' -f2,4 | sort | uniq > geneid_to_name.txt

Now, we will build the index:

$ cd ..
$ salmon index -t data/transcriptome_splici_fl146/transcriptome_splici_fl146.fa -i mm10_splici_idx -p 16

The -p 16 flag tells salmon to use multiple threads during indexing. While not all steps of indexing are multi-threaded, many are, and this will speed up index construction substantially.

Mapping the data to obtain a RAD file

Now that we have downloaded the data and indexed the splici reference, we are ready to map our reads. We will be using sketch mode which uses pseudoalignment with structural constraints to generate a RAD file (a chunk-oriented binary (Reduced Alignment Data) file storing the information necessary for subsequent phases of processing). We can do this as follows:

$ salmon alevin -i mm10_splici_idx -p 16 -l ISR --chromium --sketch \
-1 data/SRR9201794_1.fastq.gz \
-2 data/SRR9201794_2.fastq.gz \
-o pancreas_map \
--tgMap data/t2g.tsv

Note: If you are using salmon 1.5.0 or greater, you can leave off the --tgMap argument in the command above. This will produce an output folder called pancreas_map that contains all of the information that we need to process the mapped reads (1).

Processing the mapped reads

Keep following our USA-mode tutorial, we then run alevin-fry to generate permit list, collate corrected cells and do quantification.

$ alevin-fry generate-permit-list -d fw -k -i pancreas_map -o pancreas_quant
$ alevin-fry collate -t 16 -i pancreas_quant -r pancreas_map
$ alevin-fry quant -t 16 -i pancreas_quant -o pancreas_quant_res --tg-map data/transcriptome_splici_fl146/transcriptome_splici_fl146_t2g_3col.tsv --resolution cr-like --use-mtx

The output

After running this command, the resulting quantification information can be found in pbmc_10k_v3_quant_res/. Crucially, this entire directory will be important for the functions we provide below, so please keep its structure intact. Within this directory, there are some metadata files, including one actually named meta_info.json. If you look in this file, you’ll see something like the following:

{
  "alt_resolved_cell_numbers": [],
  "dump_eq": false,
  "num_genes": 86076,
  "num_quantified_cells": 11736,
  "resolution_strategy": "CellRangerLike",
  "usa_mode": true
}

The usa_mode flag says that our data was processed in Unspliced-Spliced-Ambiguous mode. This happened since the transcript-to-gene mapping we provided, t2g_3col.tsv, had 3 columns instead of 2, to allow specifying the splicing state of each transcript.

Within this directory, there is a subdirectory named alevin with 3 files: quants_mat.mtx, quants_mat_cols.txt, and quants_mat_rows.txt which correspond, respectively, to the count matrix, the gene names for each column of this matrix, and the corrected, filtered cell barcodes for each row of this matrix. Importantly, the dimensions of this matrix will be num_cells x 3num_genes — the factor of 3 arises because this quantification mode separately attributes UMIs to _spliced, or unspliced (in this case, intronic) gene sequence, or as ambiguous (not able to be confidently assigned to the spliced or unspliced category). All of these counts are packed into the same output matrix. The first num_genes columns correspond to the spliced counts, the next num_genes columns correspond to the unspliced counts, and the final num_genes columns correspond to the ambiguous counts.

RNA velocity estimation

Subsequent velocity estimation is done using scVelo. We will also show that, when using same set of genes and cells as in the scVelo tutorial, the velocity graph generated from the alevin-fry count matrix (below we call it as the fry dataset) is very similar to the scVelo tutorial output, even though these two datasets are generated by completely different toolchains.

First, we’ll load our favorite Python environment; you can use IPython, or a Jupyter notebook.

processing the count matrix

We load the packages we will use in the following procedure.

import numpy as np
import pandas as pd
import scanpy as sc
import anndata
import scvelo as scv
import matplotlib
import scipy
import json
import os

We then process the count matrix generated by alevin-fry:

frydir = "pancreas_quant_res"
e2n_path = "data/geneid_to_name.txt"
fpath = os.path.sep.join([frydir, "quant.json"])
if !os.path.exists(fpath):
  fpath = os.path.sep.join([frydir, "meta_info.json"])

meta_info = json.load(open(fpath))
ng = meta_info['num_genes']
usa_mode = meta_info['usa_mode']

if usa_mode:
    print("processing input in USA mode, will return A+S as the spliced count, and U as the unspliced count")
else:
    print("please follow previous steps to generate the ount matrix in the USA mode")
    assert(False)

af_raw = sc.read_mtx(os.path.sep.join([frydir, "alevin", "quants_mat.mtx"]))
ng = int(ng/3)
e2n = dict([ l.rstrip().split() for l in open(e2n_path).readlines()])
var_names = [ l.rstrip() for l in open(os.path.sep.join([frydir, "alevin", "quants_mat_cols.txt"])).readlines()][:ng]
var_names = [e2n[e] for e in var_names]

obs_names = [ l.rstrip() for l in open(os.path.sep.join([frydir, "alevin", "quants_mat_rows.txt"])).readlines() ]

x = af_raw.X
spliced = x[:,range(0,ng)] + x[:,range(2*ng,3*ng)]
unspliced = x[:,range(ng, 2*ng)]

Here we use S+A as the spliced count. This strategy is widely adopted by other tools and makes the assumption that ambiguous UMIs are most likely to derive from spliced transcripts. In this data, ambiguous counts make up only 4% of the total deduplicated UMIs. We tested many strategies to assign ambiguous count, and we found that as long as the rule of assigning ambiguous count relates to the proportion of S to U, the velocity graph undergoes only small changes.

Run scVelo

With the spliced and unspliced count matrices in hand, we can run scVelo to generate the velocity graph. scVelo provides multiple modes when calculating velocity, here we use the dynamical mode as an example.

# create AnnData using spliced and unspliced count matrix
adata = anndata.AnnData(X = spliced,
                        layers = dict(spliced = spliced,
                                    unspliced = unspliced))
adata.obs_names = obs_names
adata.var_names = var_names
adata.var_names_make_unique()

# get embeddings
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.tsne(adata)
sc.tl.umap(adata, n_components = 2)

# housekeeping
matplotlib.use('AGG')
scv.settings.set_figure_params('scvelo')

# get the proportion of spliced and unspliced count
scv.utils.show_proportions(adata)

# filter cells and genes, then normalize expression values
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000,enforce=True)

# scVelo pipeline
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
scv.tl.recover_dynamics(adata, n_jobs = 11)
scv.tl.velocity(adata, mode = 'dynamical')
scv.tl.velocity_graph(adata)
adata.write('pancreas_full_dim_scvelo.h5ad', compression='gzip')
scv.pl.velocity_embedding_stream(adata, basis='X_umap', save="pancreas_full_dim.png")

The velocity graph is under the figure folder of the working directory.

Fig 1. Velocity graph of the alevin-fry dataset.

Utilizing the cell type annotation provided by scVelo

The scvelo dataset contains cell type annotation information, which is very nice to show in the velocity graph. However, the scVelo dataset has fewer genes and cells than the fry dataset, probably becauase we used a slightly different version of reference transcriptome. So, we do an intersection of the two dataset to implant the annotation information from the scVelo dataset to our fry dataset.

# create adata
adata = anndata.AnnData(X = spliced, 
                        layers = dict(spliced = spliced, 
                                    unspliced = unspliced))

# assign obs_names(cell IDs) and var_names(gene names)
adata.obs_names = obs_names
adata.var_names = var_names
adata.var_names_make_unique()

example_adata = scv.datasets.pancreas()
subset_adata = adata[example_adata.obs_names, np.intersect1d(example_adata.var_names, adata.var_names)]
example_adata = example_adata[subset_adata.obs_names, subset_adata.var_names]
subset_adata.obs = example_adata.obs
subset_adata.obsm["X_umap"] = example_adata.obsm["X_umap"]

# housekeeping
matplotlib.use('AGG')
scv.settings.set_figure_params('scvelo')

# get the proportion of spliced and unspliced count
scv.utils.show_proportions(subset_adata)

# filter cells and genes, then normalize expression values
scv.pp.filter_and_normalize(subset_adata, min_shared_counts=20, n_top_genes=2000,enforce=True)

# scVelo pipeline
scv.pp.moments(subset_adata, n_pcs=30, n_neighbors=30)
scv.tl.recover_dynamics(subset_adata, n_jobs = 11)
scv.tl.velocity(subset_adata, mode = 'dynamical')
scv.tl.velocity_graph(subset_adata)
subset_adata.write('pancreas_trimmed_scvelo.h5ad', compression='gzip')
scv.pl.velocity_embedding_stream(subset_adata, basis='X_umap', save="pancreas_trimmed.png")

By doing so, the velocity grpah generated is colored by the cell type annotation. As we can see, using the same cells and genes and UMAP embedding, this velocity graph very similar to the one shown in scVelo’s tutorial.

Fig 2. Velocity graph of alevin-fry dataset using scVelo provided cell type annotation and UMAP embedding.

Summary

In this tutorial, we’ve demonstrate an end-to-end pipeline using the USA (Unspliced/Spliced/Ambiguous) mode of alevin-fry for performing RNA-velocity analysis. In this mode, alevin-fry genearate three counts for each gene within each cell, which represent the confidently unspliced count, the confidently spliced count and the ambiguous count. The scVelo package desired AnnData object can be easily generated using these three count matrices. We further show that, if we only include the cells and genes that are used in the scVelo dataset, which comes from the same raw data as the fry dataset, the velocity graph generated by the scVelo dataset and the fry dataset are very similar. This shows that alevin-fry’s USA mode — initially designed to avoid the problem of spuriously attributing UMIs arising from unspliced transcripts to (spliced) gene expression — has the ability to accurately infer the spliced and the unspliced counts for the genes in a given scRNA-seq experiment. Hence, processing your data in USA-mode gives the input necessary for subsequent RNA-velocity analysis “for free”!

Notes

  • A complete version of the scVelo related code can be found at here.
  • This tutorial is primarily the work of Dongze He. We would also like to thank Charlotte Soneson for providing valuable feedback on the proposed pipeline and code.