My Profile Photo


Tutorials for using the alevin-fry single-cell RNA-seq pipeline

Analysing split-seq data using salmon and alevin-fry pipeline

Authors: Gaurav Sharma, Computational Biologist at Ocean Genomics; Jeremy Simon, Assistant Professor at UNC Chapel Hill; Rob Patro, Cofounder at Ocean Genomics
This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.

In this tutorial we will use salmon and alevin-fry based pipeline to analyse SPLiT-seq data. We will also use a new tool splitp that modifies the fastq file to deal with the scenario where paired barcodes are used during the first round of combinatorial barcoding. We start by downloading the data. Post quantification, we will perfom single-cell analysis.


Split-pool ligation-based transcriptome sequencing, or SPLiT-seq, is a single-cell RNA-seq method that uses combinatorial barcoding to label the cell or nucleus source of RNA. This method was introduced in the article “Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding” by Rosenberg et al., Science 2018. As described in the article, different barcodes are ligated during different rounds of barcoding. There are three 8 bp cellbarcodes and a 10 bp UMI. Currently, SPLiT-seq can be performed with two chemistries — v1 and v2. Both v1 and v2 have 24 bp barcode and 10 bp UMI.

In this tutorial, we will use the data from the aforementioned article. The sample used in this tutorial, available for download on GEO, is a mix of cell lines HEK293, HelaS3, and NIH/3T3.

Download the input data

Create a new directory for the analysis

$ mkdir split-seq_tutorial
$ cd split-seq_tutorial

The SRA run corresponding to the data is SRR6750057 and can be downloaded from SRA.

$ mkdir -p data
$ fasterq-dump SRR6750057 -S -O ./data/ -e 16 -m 10GB -b 10MB

The other necessary files can be downloaded from here and saved in the data directory. Running fastp to trim and remove bad quality reads is helpful.

Paired barcoding case

In some cases, barcodes with known pairing are used during first round of combinatorial barcoding. This results in the same cell getting different barcode during the first round but the same barcode during later rounds. For example, if GATAGACA is paired with AACGTGAT, then the barcodes
denote the same cell. The sequencing order is opposite of the order of ligation. For the first round, a transcript may get amplified by either an oligo-dT primer or a random hexamer, the pairing between the barcodes corresponds to the primer of amplification.

To deal with this case, splitp can be used. Splitp is a new tool that should eventually support a variety of complex read pre-processing steps necessary to process reads from various single-cell protocols, though for now the focus is only on the pre-processing necessary to process the SPLiT-seq protocol described here. The mapping file between oligo-dT primers and random hexamers is available at the link provided earlier. The command would be:

$ splitp -r SRR6750057_2.fastq -b ./data/oligo_hex_bc_mapping.txt -s 87 -e 94 -o > SRR6750057_corrected_2.fastq

splitp usage:

    splitp [OPTIONS] --read-file <READ_FILE> --bc-map <BC_MAP> --start <START> --end <END>

    -b, --bc-map <BC_MAP>          the map of oligo-dT to random hexamers
    -e, --end <END>                end position of the random barcode
    -h, --help                     Print help information
    -o, --one-hamming              consider 1-hamming distance neighbors of random hexamers
    -r, --read-file <READ_FILE>    the input R2 file
    -s, --start <START>            start position of the random barcode
    -V, --version                  Print version information

The bc-map, which maps barcodes corresponding to oligo-dT amplification to those corresponding to random hexamers, should be a tab separated text file with oligo-dT primer barcodes in the first column. The file must have a header comment line.

Creating the index

For this tutorial, we will need salmon(>=v1.7.0) and alevin-fry(>=v0.4.2). For the index, we will use a splici (spliced + intron) index, that contains both spliced mRNA transcripts and intronic sequences. This allows us to use the unspliced, spliced, ambiguous (USA) quantification mode of alevin-fry and generates these three types of counts for each gene. It can be used to compute RNA-velocity as well.

Since the fastq is contains a mix of human and mouse cell lines, first we will create a combined index of both species. For details on how to create splici index, please refer to the splici tutorial. For human, we will use GENCODE v31 and for mouse GENCODE vM25. The files needed to generate the splici index can be downloaded from the zenodo link.

Following sequence of commands was used to generate the index. First combine the references, then generate combined splici reference, and finally create the splici index.

Commands to combine the references

$ sed 's/chr/Mchr/g' mouse.gencode.vM25.genome.fa > mouse_gencode_vM25_genome.fa
$ sed 's/chr/Mchr/g' mouse.gencode.vM25.gene_annotations.gtf > mouse_gencode_vM25_gene_annotations.gtf
$ tail -n+6 mouse_gencode_vM25_gene_annotations.gtf > mgg
$ mv mgg mouse_gencode_vM25_gene_annotations.gtf
$ cat human.gencode.v35.genome.fa mouse_gencode_vM25_genome.fa > combined_genome.fa
$ cat human.gencode.v35.gene_annotations.gtf mouse_gencode_vM25_gene_annotations.gtf > combined_gene_annotations.gtf

The following R code was used to generate the reference for indexing (using the roe package; one could also use pyroe instead):


genome_path <- "combined_genome.fa"
gtf_path <- "combined_gene_annotations.gtf"
filename_prefix = "transcriptome_splici"
output_dir = "splici_index_reference"
read_length = 66
flank_trim_length = 5

  genome_path = genome_path,
  gtf_path = gtf_path,
  read_length = read_length,
  output_dir = output_dir,
  flank_trim_length = flank_trim_length,
  filename_prefix = filename_prefix,
  dedup_seqs = FALSE,
  no_flanking_merge = FALSE

Finally, salmon index is used to generate the index

salmon index -t splici_index_reference/transcriptome_splici_fl61.fa -i human_mouse_index -p 32


The quantification process is as follows:

  • salmon alevin peforms mapping of the fragments to the index to generate a RAD file.
  • alevin-fry uses that output to generate-permit list, collate the RAD file and perform quantification of the collated RAD file.

To read more about these steps please refer to alevin-fry documentation.

For the first step, we use salmon alevin to generate a RAD (Reduced Alignment Data) file. To use SPLiT-seq protocol, we can use either --splitseqV1 or --splitseqV2 flags for v1 and v2 chemistries, respectively.

The chemistry used by Rosenberg et al. is v1, whereas v2 is used for the commercial data from Parse biosciences. Both are similar, with the only difference being in the position of the third barcode. In v1 it occurs at position 78 and in v2 it occurs at position 86 of the sequenced read. The salmon alevin implementation assumes that, like the files from Rosenberg et al., the biological reads are in R1 fastq file and the barcodes and umi are present in R2. If your data has the opposite order, i.e., barcode and umi in R1, then in the command below use -1 for R2 and -2 for R1.

Here we use the --sketch flag to perform pseudoalignment with structural constraints, though selective-alignment could also be used (by passing --rad instead of --sketch). For viewing a detailed list of help options please use salmon alevin -h.

$ salmon alevin -i ./data/splici_index_human.v31_mouse.vM25 -l A -1 ./data/SRR6750057_1.fastq.gz -2 ./data/SRR6750057_corrected_2.fastq.gz -p 32 --splitseqV1 -o ./SRR6750057_run --sketch

To correct and associate sequenced barcodes to the most likely “corrected” barcodes, alevin-fry generate-permit-list is used. Here we use the flag -k, which attempts to estimate the number of high quality barcodes to include based on the “knee” of the cumulative frequency distribution of the reads associated with barcodes. Then the RAD file is collated using alevin-fry collate and quantification is performed using alevin-fry quant. The counts are generated as mtx files with cells as rows and genes as columns.

$ alevin-fry generate-permit-list -d both -i ./SRR6750057_run --output-dir ./SRR6750057_out_permit_knee -k
$ alevin-fry collate -r ./SRR6750057_run -t 16 -i ./SRR6750057_out_permit_knee
$ alevin-fry quant -m splici_index_reference/transcriptome_splici_fl61_t2g_3col.tsv -i ./SRR6750057_out_permit_knee -o ./SRR6750057_counts -t 16 -r cr-like-em --use-mtx

Post-quantification analysis


Since SPLiT-seq/ParseBio amplifies transcripts based on both oligo-dT and random hexamer priming, we recommend using outputFormat=“snRNA” to maximally capture signal occurring anywhere in the body of the mature or immature transcript. More quantitative details and benefits of this to come in a future manuscript.

# Read in entire output directory from alevin-fry using fishpond, will create SingleCellExperiment object
sce <- loadFry(fryDir = "SRR6750057_counts/", outputFormat = "snRNA")
## locating quant file
## Reading meta data
## USA mode: TRUE
## Processing 116057 genes and 301 barcodes
## Using pre-defined output format: snRNA
## Building the 'counts' assay, which contains U S A
## Constructing output SingleCellExperiment object
## Done
# Convert gene ids to gene symbols
gene_annotation_table <- fread("./data/combined_geneid_genesymbols.txt")
geneNames <- gene_annotation_table$GeneSymbol[match(rownames(sce),gene_annotation_table$Geneid)]
rownames(sce) <- geneNames
# Some gene names are duplicated such as the same gene in chromosomes X and Y. Those are merged here. 
exp.gene.grp <- t(sparse.model.matrix(~ 0 + geneNames))
exp.summarized <- exp.gene.grp %*% counts(sce)
rownames(exp.summarized) <- rownames(exp.summarized) %>% str_replace_all("geneNames","")
# Create Seurat object
rosenberg300ubc.seurat  <- CreateSeuratObject(counts = exp.summarized)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
# Compute mitochondrial contamination and filter out low quality cells using miQC
rosenberg300ubc.seurat <- subset(rosenberg300ubc.seurat, subset = nCount_RNA > 750 & nFeature_RNA > 375)
rosenberg300ubc.seurat <- PercentageFeatureSet(object = rosenberg300ubc.seurat, pattern = "^MT|^Mouse-mt", = "")
rosenberg300ubc.seurat <- RunMiQC(rosenberg300ubc.seurat, = "", nFeature_RNA = "nFeature_RNA", posterior.cutoff = 0.75, model.slot = "flexmix_model")
## Warning in RunMiQC(rosenberg300ubc.seurat, = "",
## nFeature_RNA = "nFeature_RNA", : flexmix returned only 1 cluster
## defaulting to backup.percentile for filtering
## Warning: Adding a command log without an assay associated with it
rosenberg300ubc.seurat.filtered <- subset(rosenberg300ubc.seurat, miQC.keep == "keep")

## [1] 114900    297
# Normalize and scale data
rosenberg300ubc.seurat.filtered <- NormalizeData(rosenberg300ubc.seurat.filtered)
rosenberg300ubc.seurat.filtered <- FindVariableFeatures(rosenberg300ubc.seurat.filtered, nfeatures = 5000)
all.features <- rownames(rosenberg300ubc.seurat.filtered@assays$RNA@counts)
rosenberg300ubc.seurat.filtered <- ScaleData(rosenberg300ubc.seurat.filtered, features = all.features)
## Centering and scaling data matrix
# Run PCA, then determine how many PCs are informative
rosenberg300ubc.seurat.filtered <- RunPCA(rosenberg300ubc.seurat.filtered, verbose = FALSE, npcs = 100)
ElbowPlot(rosenberg300ubc.seurat.filtered, ndims = 30, reduction = "pca")

# Run UMAP, and identify cell clusters
rosenberg300ubc.seurat.filtered <- RunUMAP(rosenberg300ubc.seurat.filtered, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 12:27:11 UMAP embedding parameters a = 0.9922 b = 1.112
## 12:27:11 Read 297 rows and found 10 numeric columns
## 12:27:11 Using Annoy for neighbor search, n_neighbors = 30
## 12:27:11 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 12:27:11 Writing NN index file to temp file /tmp/RtmpnqRw7m/file13908e7ceec965
## 12:27:11 Searching Annoy index using 1 thread, search_k = 3000
## 12:27:11 Annoy recall = 100%
## 12:27:12 Commencing smooth kNN distance calibration using 1 thread
## 12:27:13 Initializing from normalized Laplacian + noise
## 12:27:13 Commencing optimization for 500 epochs, with 11226 positive edges
## 12:27:15 Optimization finished
rosenberg300ubc.seurat.filtered <- FindNeighbors(rosenberg300ubc.seurat.filtered, dims = 1:10, verbose = FALSE)
rosenberg300ubc.seurat.filtered <- FindClusters(rosenberg300ubc.seurat.filtered, verbose = FALSE, resolution = 0.1, algorithm=2)

# Plot UMAP labeled by clusters
DimPlot(rosenberg300ubc.seurat.filtered,reduction = "umap", label = TRUE)

# Find markers of the 3 clusters
markers = FindAllMarkers(rosenberg300ubc.seurat.filtered, assay="RNA", slot="", only.pos=T)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
# Cross-referenced top markers with Human Protein Atlas RNA expression in cell lines
HEK293.markers = c("SLIT2", "ZNF829", "KPNA5", "BRD2", "SPEN")
HELA.markers = c("NPR3", "PDE2A", "COL7A1", "FOLR1")
NIH3T3.markers = c("Mouse-Srrm2", "Mouse-Hspa5", "Mouse-Scd2", "Mouse-Huwe1") # These are the only mouse cells in the mix, so any mouse gene symbols will do
goi = c(HEK293.markers, HELA.markers, NIH3T3.markers)

# Create violin plot
VlnPlot(rosenberg300ubc.seurat.filtered, features = goi, stack=T, flip=T, sort=T, slot="")