My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


Analysing split-seq data using salmon and alevin-fry pipeline

Authors: Gaurav Sharma, Computational Biologist at Ocean Genomics; Jeremy Simon, Assistant Professor at UNC Chapel Hill; Rob Patro, Cofounder at Ocean Genomics
This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.

In this tutorial we will use salmon and alevin-fry based pipeline to analyse SPLiT-seq data. We will also use a new tool splitp that modifies the fastq file to deal with the scenario where paired barcodes are used during the first round of combinatorial barcoding. We start by downloading the data. Post quantification, we will perfom single-cell analysis.

Introduction

Split-pool ligation-based transcriptome sequencing, or SPLiT-seq, is a single-cell RNA-seq method that uses combinatorial barcoding to label the cell or nucleus source of RNA. This method was introduced in the article “Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding” by Rosenberg et al., Science 2018. As described in the article, different barcodes are ligated during different rounds of barcoding. There are three 8 bp cellbarcodes and a 10 bp UMI. Currently, SPLiT-seq can be performed with two chemistries — v1 and v2. Both v1 and v2 have 24 bp barcode and 10 bp UMI.

In this tutorial, we will use the data from the aforementioned article. The sample used in this tutorial, available for download on GEO, is a mix of cell lines HEK293, HelaS3, and NIH/3T3.

Download the input data

Create a new directory for the analysis

$ mkdir split-seq_tutorial
$ cd split-seq_tutorial

The SRA run corresponding to the data is SRR6750057 and can be downloaded from SRA.

$ mkdir -p data
$ fasterq-dump SRR6750057 -S -O ./data/ -e 16 -m 10GB -b 10MB

The other necessary files can be downloaded from here and saved in the data directory. Running fastp to trim and remove bad quality reads is helpful.

Paired barcoding case

In some cases, barcodes with known pairing are used during first round of combinatorial barcoding. This results in the same cell getting different barcode during the first round but the same barcode during later rounds. For example, if GATAGACA is paired with AACGTGAT, then the barcodes
CTGTAGCC-ACACAGAA-AACGTGAT, and
CTGTAGCC-ACACAGAA-GATAGACA
denote the same cell. The sequencing order is opposite of the order of ligation. For the first round, a transcript may get amplified by either an oligo-dT primer or a random hexamer, the pairing between the barcodes corresponds to the primer of amplification.

To deal with this case, splitp can be used. Splitp is a new tool that should eventually support a variety of complex read pre-processing steps necessary to process reads from various single-cell protocols, though for now the focus is only on the pre-processing necessary to process the SPLiT-seq protocol described here. The mapping file between oligo-dT primers and random hexamers is available at the link provided earlier. The command would be:

$ splitp -r SRR6750057_2.fastq -b ./data/oligo_hex_bc_mapping.txt -s 87 -e 94 -o > SRR6750057_corrected_2.fastq

splitp usage:

USAGE:
    splitp [OPTIONS] --read-file <READ_FILE> --bc-map <BC_MAP> --start <START> --end <END>

OPTIONS:
    -b, --bc-map <BC_MAP>          the map of oligo-dT to random hexamers
    -e, --end <END>                end position of the random barcode
    -h, --help                     Print help information
    -o, --one-hamming              consider 1-hamming distance neighbors of random hexamers
    -r, --read-file <READ_FILE>    the input R2 file
    -s, --start <START>            start position of the random barcode
    -V, --version                  Print version information

The bc-map, which maps barcodes corresponding to oligo-dT amplification to those corresponding to random hexamers, should be a tab separated text file with oligo-dT primer barcodes in the first column. The file must have a header comment line.

Quantification

The quantification process is as follows:

  • salmon alevin peforms selective alignment to generate a RAD file; though one could use pseudoalignment with structural constraints if preferred.
  • alevin-fry uses that output to generate-permit list, collate the RAD file and perform quantification of the collated RAD file.

To read more about these steps please refer to alevin-fry documentation.

Since the fastq is contains a mix of human and mouse cell lines, first we will create a combined index of both species. For human, we will use GENCODE v31 and for mouse GENCODE vM25. The generated index can be downloaded from the zenodo link. The index would need to be extracted before use.

$ zcat human.gencode.v31.transcripts.fa.gz mouse.gencode.vM25.transcripts.fa.gz > humanv31_mouseM25_transcripts.fa.gz
$ salmon index -t humanv31_mouseM25_transcripts.fa.gz -p 32 -i salmon_index_human.v31_mouse.vM25_wo_decoy --gencode

For the first step, we use salmon alevin to generate a RAD (Reduced Alignment Data) file. To use SPLiT-seq protocol, we can use either --splitseqV1 or --splitseqV2 flags for v1 and v2 chemistries, respectively. Here we use --rad flag to perform selective-alignment. For viewing a detailed list of help options please use salmon alevin -h.

$ salmon alevin -i ./data/salmon_index_human.v31_mouse.vM25_wo_decoy -l A -1 ./data/SRR6750057_1.fastq.gz -2 ./data/SRR6750057_corrected_2.fastq.gz -p 32 --splitseqV1 -o ./SRR6750057_run --rad

To correct and associate barcodes to “true” barcodes, alevin-fry generate-permit-list is used. Here we use the flag -k that attempts to estimate the number of high quality barcodes to include based on the “knee” of the cumulative frequency distribution of the reads associated with barcodes. Then the RAD file is collated using alevin-fry collate and quantification is performed using alevin-fry quant. The counts are generated as mtx files with cells as rows and genes as columns.

$ alevin-fry generate-permit-list -d both -i ./SRR6750057_run --output-dir ./SRR6750057_out_permit_knee -k
$ alevin-fry collate -r ./SRR6750057_run -t 16 -i ./SRR6750057_out_permit_knee
$ alevin-fry quant -m data/human.v31_mouse.vM25_txp2gene.tsv -i ./SRR6750057_out_permit_knee -o ./SRR6750057_counts -t 16 -r full --use-mtx

Post-quantification analysis


library(Seurat)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
## Registered S3 method overwritten by 'spatstat.geom':
##   method     from
##   print.boxx cli
## Attaching SeuratObject
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(Matrix)
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(glmGamPoi)
library(miQC)
library(SeuratWrappers)
library(flexmix)
## Loading required package: lattice
# Read in matrix market output from alevin-fry
expmat <- ReadMtx(mtx = "./SRR6750057_counts/alevin/quants_mat.mtx",cells = "./SRR6750057_counts/alevin/quants_mat_rows.txt", features = "./SRR6750057_counts/alevin/quants_mat_cols.txt", feature.column = 1, mtx.transpose = T)

# Collapse transcripts (gencode.v31 ENSG) to gene symbols
tx2gene <- read.table("./data/gencode.v31.vM25.tx2gene.tsv",header=F,sep="\t",col.names=c("tx","gene"))
exp.txId <- rownames(expmat)
exp.geneId <- as.vector(tx2gene$gene[match(exp.txId, tx2gene$tx)])
exp.tx.grp <- t(sparse.model.matrix(~ 0 + exp.geneId))
exp.summarized <- exp.tx.grp %*% expmat
rownames(exp.summarized) <- rownames(exp.summarized) %>% str_replace_all(".+.geneId","")

# Convert gene-summarized matrix to Seurat object
rosenberg300ubc.seurat <- CreateSeuratObject(counts = exp.summarized)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
# Compute mitochondrial contamination and filter out low quality cells using miQC
rosenberg300ubc.seurat <- subset(rosenberg300ubc.seurat, subset = nCount_RNA > 750 & nFeature_RNA > 375)
rosenberg300ubc.seurat <- PercentageFeatureSet(object = rosenberg300ubc.seurat, pattern = "^mt-|^HUMAN-MT-", col.name = "percent.mt")
rosenberg300ubc.seurat <- RunMiQC(rosenberg300ubc.seurat, percent.mt = "percent.mt", nFeature_RNA = "nFeature_RNA", posterior.cutoff = 0.75, model.slot = "flexmix_model")
## Warning in RunMiQC(rosenberg300ubc.seurat, percent.mt = "percent.mt",
## nFeature_RNA = "nFeature_RNA", : flexmix returned only 1 cluster
## defaulting to backup.percentile for filtering
## Warning: Adding a command log without an assay associated with it
rosenberg300ubc.seurat.filtered <- subset(rosenberg300ubc.seurat, miQC.keep == "keep")

dim(rosenberg300ubc.seurat.filtered)
## [1] 114341    290
# Normalize and scale data
rosenberg300ubc.seurat.filtered <- NormalizeData(rosenberg300ubc.seurat.filtered)
rosenberg300ubc.seurat.filtered <- FindVariableFeatures(rosenberg300ubc.seurat.filtered, nfeatures = 5000)
all.features <- rownames(rosenberg300ubc.seurat.filtered@assays$RNA@counts)
rosenberg300ubc.seurat.filtered <- ScaleData(rosenberg300ubc.seurat.filtered, features = all.features)
## Centering and scaling data matrix
# Run PCA, then determine how many PCs are informative
rosenberg300ubc.seurat.filtered <- RunPCA(rosenberg300ubc.seurat.filtered, verbose = FALSE, npcs = 100)
ElbowPlot(rosenberg300ubc.seurat.filtered, ndims = 30, reduction = "pca")

# Run UMAP, and identify cell clusters
rosenberg300ubc.seurat.filtered <- RunUMAP(rosenberg300ubc.seurat.filtered, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 23:52:48 UMAP embedding parameters a = 0.9922 b = 1.112
## 23:52:48 Read 290 rows and found 10 numeric columns
## 23:52:48 Using Annoy for neighbor search, n_neighbors = 30
## 23:52:48 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 23:52:48 Writing NN index file to temp file /tmp/RtmpaoQU9f/file36872e2ec0417c
## 23:52:48 Searching Annoy index using 1 thread, search_k = 3000
## 23:52:48 Annoy recall = 100%
## 23:52:48 Commencing smooth kNN distance calibration using 1 thread
## 23:52:49 Initializing from normalized Laplacian + noise
## 23:52:49 Commencing optimization for 500 epochs, with 10794 positive edges
## 23:52:50 Optimization finished
rosenberg300ubc.seurat.filtered <- FindNeighbors(rosenberg300ubc.seurat.filtered, dims = 1:10, verbose = FALSE)
rosenberg300ubc.seurat.filtered <- FindClusters(rosenberg300ubc.seurat.filtered, verbose = FALSE, resolution = 0.2, algorithm=2)

# Plot UMAP labeled by clusters
DimPlot(rosenberg300ubc.seurat.filtered,reduction = "umap", label = TRUE)

# Find markers of the 3 clusters
markers = FindAllMarkers(rosenberg300ubc.seurat.filtered, assay="RNA", slot="scale.data", only.pos=T)
## Calculating cluster 0
## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
# Cross-referenced top markers with Human Protein Atlas RNA expression in cell lines
HEK293.markers = c("HUMAN-SLIT2", "HUMAN-ZNF829", "HUMAN-KPNA5", "HUMAN-BRD2", "HUMAN-SPEN")
HELA.markers = c("HUMAN-NPR3", "HUMAN-PDE2A", "HUMAN-COL7A1", "HUMAN-FOLR1")
NIH3T3.markers = c("Srrm2", "Hspa5", "Scd2", "Huwe1") # These are the only mouse cells in the mix, so any mouse gene symbols will do
goi = c(HEK293.markers, HELA.markers, NIH3T3.markers)

# Create violin plot
VlnPlot(rosenberg300ubc.seurat.filtered, features = goi, stack=T, flip=T, sort=T, slot="scale.data")