8 Nov 2021 • on alevin-fry

Analysing sci-RNA-seq3 data using salmon and alevin-fry pipeline

Author: Gaurav Sharma, Computational Biologist at Ocean Genomics
This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.

In this tutorial we will look at how to process sci-RNA-seq3 reads using a salmon and alevin-fry based pipeline. We start by downloading the input data. We then generate the splici index for the organism and quantify the reads using salmon and alevin-fry. With the gene quants, we perform preprocessing and clustering. Finally, we identify marker genes for clusters which can be further used to assign cell types.

Introduction

3-level Single-cell combinatorial indexing RNA sequencing (sci-RNA-seq3) can sequence more than 2 million single cells/nuclei per run and thus is an ultra high-throuput sequencing method. To uniquely label single nuclei/cells, it uses split-pool barcoding. It in an upgrade to the previous version sci-RNA-seq.

In this tutorial, we use the data from the article “The single-cell transcriptional landscape of mammalian organogenesis” by Cao et al., Nature 2019. They profiled more than 2 million cells that amounts to more than 1 TB of sequencing data. To demonstrate the application of salmon and alevin-fry, we will quantify only a handful of the fastq files for this tutorial.

Download the input data

Create a new directory for the analysis

$ mkdir sci-rna-seq3_tutorial
$ cd sci-rna-seq3_tutorial

The SRA runs used as input are in sra_runs.txt and can be downloaded from SRA. Save the fastq files in ./data directory. The other necessary files including the splici index can be downloaded from here.

Generate a splici reference and index

Salmon aligns the reads against a reference index. Since a typical single-cell/nucelus experiment generates reads from spliced transcriptome and intronic sequences, we generate a splici(spliced + intronic) reference. This provides better resolution of reads while mapping.

For this tutorial we build index using Mouse GENCODE release M25 (GRCm25) as reference. Download the gene annotations and genome sequence.

$ wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
$ wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz

To know more about the splici index and generate it please refer to the splici tutorial or the quick guide.

The following parameters were used to generate the index

read_length = 56
flank_trim_length = 4

The generated splici index is

transcriptome_splici/splici_idx_mouse_gencodeM25/

Quantification

The quantification process is as follows:

salmon alevin peforms selective alignment to generate a RAD file
alevin-fry uses that output to generate-permit list, collate RAD file and perfrom quantification of the collated RAD file.

To read more about these steps please refer to alevin-fry documentation.

The following is a bash script that performs sample gene quantification. The data/sra_runs file contains the sra runs that make the input data. For each run, we generate RAD (Reduced Alignment Data) file using salmon alevin. To specify using sci-RNA-seq3 protocol we pass --sciseq3 flag. Here we use --justAlign flag to perform selective-alignment. For viewing a detailed list of help options please use salmon alevin -h.

To correct and associate barcodes to “true” barcodes, alevin-fry generate-permit-list is used. The list of allowed barcodes is passed using the -u flag. Then the rad file is collated using alevin-fry collate and quantification is performed using alevin-fry quant. The umi-resolution strategy cr-like is used to generate spliced aware quants which supports the unspliced, spliced and ambiguous (USA) quantification mode. The USA mode produces counts for all splicing statuses for each gene within a cell. The counts are generated as mtx files with cells as rows and genes as columns.

srr=( `cat sra_runs.txt` )
for file in ${srr[@]}
do    
    # salmon 
    salmon alevin -i transcriptome_splici/splici_idx_mouse_gencodeM25/ -l ISR  \
    -1 ./data/${file}_1.fastq.gz -2 ./data/fastq/${file}_2.fastq.gz  -o runs/$file --tgMap transcriptome_splici/transcriptome_splici_t2g.tsv -p 16 --sciseq3 --justAlign
    # alevin-fry
    alevin-fry generate-permit-list -d both -i runs/$file --output-dir runs/quant_$file -u cell_barcodes.txt
    alevin-fry collate -r runs/$file -t 16 -i runs/quant_$file
    alevin-fry quant -m transcriptome_splici/transcriptome_splici_t2g_3col.tsv -i runs/quant_$file -o runs/res_$file -t 16 -r cr-like --use-mtx
done

Preprocessing and further analysis in R

After gene quantification the processing and analysis can be peformed in R. This tutorial uses Seurat for this purpose. To see detailed instructions on single cell analysis, please refer to seurat vignettes.

Load the required libraries

Read the mtx counts file, cellbarcode and gene files.

In this tutorial we use 10 sample files from the article

nfiles <- 10
files <- fread("sra_runs.txt",header = F, col.names = "run")
  filePaths <- sapply(files$run[1:nfiles], function(run){ file.path("runs",paste0("res_",run),"alevin")
  })
quantFiles <- lapply(filePaths,function(fp){
  readMM(file.path(fp,"quants_mat.mtx"))
})
cellFiles <- lapply(filePaths,function(fp){
  fread(file.path(fp,"quants_mat_rows.txt"),header = F, col.names = "cellbarcode")
})
genes <- fread(file.path(filePaths[1],"quants_mat_cols.txt"),header = F, col.names = "geneId")

Prepare the counts and cell annotations files.

The cellbarcode in sci-RNA-seq3 can be 19 or 20bp long. In case it is 19bp, salmon adds an “AC” and if it is 20bp it adds “A” to make barcodes of same length while avoid accidentally matching separate barcodes. The files/cell_annotations.txt is the file from the article and contains the cell annotations. Only quantified runs are selected. Similar to salmon, “A” nucleotide is added to the annotation file barcode if it’s

cell_anno <- fread("files/cell_annotations.txt") # reorder and improve
cell_anno <- cell_anno[run %in% names(quantFiles)]
cellb <- sapply(cell_anno$cb,function(cb){
  if(nchar(cb)==20){
    return(paste0(cb,"A"))
  } else {
    return(paste0(cb,"AC"))
  }
})
cell_anno[,cb:=cellb]

# subset the quantFiles to include cells used in the study
for (i in 1:nfiles){
  rownames(quantFiles[[i]]) <- cellFiles[[i]]$cellbarcode
  colnames(quantFiles[[i]]) <- genes$geneId
  cbi <- cell_anno[run == names(quantFiles)[i], cb]
  rws <- cellFiles[[i]]$cellbarcode %in% cbi
  quantFiles[[i]] <- quantFiles[[i]][rws,]
}

# get matching annotations for each quantFile
cell_annotations <- lapply(1:nfiles, function(i){
  c_anno <- cell_anno[run == names(quantFiles)[i],]
  cb_matches <- match(rownames(quantFiles[[i]]), c_anno$cb)
  cb_matches <- cb_matches[!is.na(cb_matches)]
  return(c_anno[cb_matches])
  })

gene_annotations <- fread("files/gene_annotation_table.txt")
g_matches <- match(colnames(quantFiles[[1]]),gene_annotations$Geneid)
gene_annotations <- gene_annotations[g_matches]

Compute the quants as sum of spliced, unspliced and ambiguous.

To be consistent with counts produced by the article. This is a slow step, so the quantFiles object is saved as RDS format. If you want to save time, you can skip this step and read the RDS file (provided in the download package) directly.

for(i in 1:nfiles){
  print(i)
  quantFiles[[i]][,1:55401] <- quantFiles[[i]][,1:55401]+quantFiles[[i]][,55402:110802]+quantFiles[[i]][,110803:166203]
  quantFiles[[i]] <- quantFiles[[i]][,1:55401]
}
saveRDS(quantFiles,"files/quantFiles.rds")

Concatenate the quant and cell annotation files.

All the quant and cell annotation files are combined.

quantFiles <- readRDS("files/quantFiles.rds")
quants <- do.call(rbind,quantFiles)
cell_annotations <- do.call(rbind,cell_annotations)

With the final quant and cell annotation files created, now create the seurat object.

Select cells with counts of at least 200 genes and genes present in at least 3 cells.

library(Seurat)

## Attaching SeuratObject

# prepare cell and gene annotations
pd <- as.data.frame(cell_annotations)
rownames(pd) <- cell_annotations$sample
rownames(quants) <- cell_annotations$sample
colnames(quants) <- gene_annotations$gene_short_name[1:55401]
# filter cells with less than 200 features and features present in less than 3 cells
sobj <- CreateSeuratObject(counts = t(as.matrix(quants)), project = "demo", meta.data = pd, min.features = 200, min.cells = 3)

## Warning: Non-unique features (rownames) present in the input matrix, making
## unique

## Warning: The following arguments are not used: row.names

sobj

## An object of class Seurat 
## 24595 features across 24500 samples within 1 assay 
## Active assay: RNA (24595 features, 0 variable features)

Normalize the data and find 2000 most variable genes.

Using highly variable genes helps strengthen the biological signal in downstream analysis.

sobj <- NormalizeData(sobj)
sobj <- FindVariableFeatures(sobj, selection.method = "vst", nfeatures = 2000)

Scale and center features, then perform PCA

This linear transformation is a standard precursor to PCA

all.genes <- rownames(sobj)
sobj <- ScaleData(sobj, features = all.genes)

## Centering and scaling data matrix

sobj <- RunPCA(sobj, features = VariableFeatures(object = sobj))

## PC_ 1 
## Positive:  Nrxn1, Ctnna2, Fam155a, Myt1l, Cadps, Nrg3, Cntnap2, Celf4, Akap6, Rbfox1 
##     Map2, Anks1b, Syt1, Nav3, Ccser1, Nrxn3, Phactr1, Hecw1, Xkr4, Dscam 
##     Snhg11, Gm26871, Stmn2, Gabrb3, Rab3c, Ppfia2, Rtn1, Erc2, Tenm2, Cadm2 
## Negative:  Gpc3, Cped1, Pdzrn3, Ror1, Tmem132c, Mecom, Hmcn1, Col23a1, Prrx1, Col26a1 
##     Trps1, Fli1, Dlc1, Col3a1, Sema3a, Pdgfc, Prtg, Bmpr1b, Colec12, Prex2 
##     Col1a2, Fn1, Nhsl2, Svil, Palld, Ghr, Bnc2, Ptn, Etl4, Ltbp1 
## PC_ 2 
## Positive:  Adgrv1, Rfx4, Dcc, Sox2ot, Glra1, Nckap5, Npas3, Mir9-3hg, C130071C03Rik, Nav2 
##     Sema5b, Ildr2, Gm3764, Gm32061, Pax3, Gm44593, Slc35f1, Srrm4, Pantr1, Zfp536 
##     Dach1, Ednrb, Igdcc3, Ptprz1, Ptprn2, Erbb4, Kcnk10, Zswim5, Rhbdl3, Frmd4b 
## Negative:  Gpc3, Meg3, Bnc2, Prkg1, Dlc1, Col1a2, Col3a1, Ror1, Slit3, Tshz2 
##     Sema3a, Rian, Kif26b, Alcam, Cped1, Ghr, Col26a1, Ldb2, Ptprd, Hmcn1 
##     Prickle1, Colec12, Pdzrn3, Sgcd, Sdk1, Col5a2, Prrx1, Sema5a, Pcdh7, Flrt2 
## PC_ 3 
## Positive:  Slc4a1, Slc25a21, Sptb, Ank1, Kel, Rhag, Ermap, Ikzf1, Spta1, Pkhd1l1 
##     Snca, Marchf3, Hbb-bh1, Slc39a8, Ell2, Hba-x, Cpox, Gypa, Hbb-y, Tspan33 
##     Hba-a2, Blvrb, Car2, Slc25a37, 5430401H09Rik, Hba-a1, Slc16a10, Rhd, Gm20236, Hemgn 
## Negative:  Npas3, Adgrv1, Rfx4, Nckap5, Sox2ot, Sema5b, Dach1, Pax3, Slc35f1, Erbb4 
##     Ildr2, Itgb8, Grid2, Zfp536, Igdcc3, Ptprz1, Shroom3, Adgrb3, Mir9-3hg, Epha4 
##     Ntn1, Bmpr1b, Ptn, Dcc, C130071C03Rik, Trps1, Gm3764, Plch1, Foxp2, Prtg 
## PC_ 4 
## Positive:  Flt1, Emcn, Kdr, Cyyr1, Ptprb, Rasgrp3, Adgrf5, Prkch, Ptprm, Plxnd1 
##     Pecam1, Thsd1, Tek, Lcp1, Shank3, Eng, Elmo1, Adgrl4, Arap3, Dysf 
##     Arhgap18, Cdh5, Cd93, Tspan18, Rapgef5, Calcrl, Itga2, Tie1, Zfp366, Stab1 
## Negative:  Slc25a21, Slc4a1, Gpc3, Prrx1, Sptb, Ank1, Bnc2, Kel, Ermap, Rhag 
##     Col1a2, Sema5a, Pkhd1l1, Kif26b, Hpse2, Sema3a, Snca, Ror1, Spta1, Cpox 
##     Robo2, Col26a1, Aff2, Col3a1, Gypa, Pdgfra, Tmem132c, Blvrb, Tspan33, Marchf3 
## PC_ 5 
## Positive:  Npas3, Adgrv1, Slc25a21, Fgf14, Slc4a1, Grid2, Erbb4, Ank1, Adgrb3, Sptb 
##     Sox6, Kel, Rfx4, Nckap5, Slc35f1, Ptprz1, Marchf3, Pkhd1l1, Hbb-bh1, Ednrb 
##     Hbb-y, Ano1, Shroom3, Ildr2, Sox2ot, Pip5k1b, Rhag, Spta1, Atp8a1, Ermap 
## Negative:  D130009I18Rik, Ebf1, Cntn2, Thsd7b, Ebf2, Kcnmb2, D130079A08Rik, Elavl4, St18, Cacna2d1 
##     Ppp1r1c, Ebf3, Clvs1, Srrm4, Thsd7a, Chst8, Tmem163, Robo3, Srrm3, Dcx 
##     Stmn2, Pex5l, Ptprr, Pou4f1, Eya2, Isl1, Gm26871, Gap43, Neurod1, Slc17a6

Determine the data dimensionality

This helps determine how many PCs are required to capture sufficient variation in the data

ElbowPlot(sobj)

It is advisable to err on the higher side of dimensionality parameter, so here we proceed with top 15 PCs.

Find neighbors and clusters

FindNeigbors constructs a euclidean distance based KNN graph of cells in the PCA space. The edge weights are based on Jaccard similarity which represent shared overlap in their local neighborhoods. FindClusters uses Louvain algorithm to perform clustering with the resolution parameter controlling the granularity.

sobj <- FindNeighbors(sobj, dims = 1:15)

## Computing nearest neighbor graph

## Computing SNN

sobj <- FindClusters(sobj, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 24500
## Number of edges: 787163
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8936
## Number of communities: 14
## Elapsed time: 4 seconds

Run UMAP

Use non-linear dimensionality reduction method UMAP to visualize and explore the dataset

sobj <- RunUMAP(sobj, dims = 1:15)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 13:31:22 UMAP embedding parameters a = 0.9922 b = 1.112

## 13:31:22 Read 24500 rows and found 15 numeric columns

## 13:31:22 Using Annoy for neighbor search, n_neighbors = 30

## 13:31:22 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 13:31:25 Writing NN index file to temp file /tmp/RtmpTu50ER/file33ce1162d9f439
## 13:31:25 Searching Annoy index using 1 thread, search_k = 3000
## 13:31:35 Annoy recall = 100%
## 13:31:37 Commencing smooth kNN distance calibration using 1 thread
## 13:31:38 Initializing from normalized Laplacian + noise
## 13:31:39 Commencing optimization for 200 epochs, with 1073226 positive edges
## 13:31:51 Optimization finished

DimPlot(sobj, reduction = "umap")

Find marker genes for clusters

This will identify only positive marker genes for each cluster compared to all remaining cells with at least 0.25 log fold change and detected in at least 0.25 of the cells in comparion groups.

sobj.markers <- FindAllMarkers(sobj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

## Calculating cluster 9

## Calculating cluster 10

## Calculating cluster 11

## Calculating cluster 12

## Calculating cluster 13

mtable <- sobj.markers %>%
    group_by(cluster) %>%
    top_n(n = 2, wt = avg_log2FC)

head(mtable)

## Registered S3 method overwritten by 'cli':
##   method     from         
##   print.boxx spatstat.geom

## # A tibble: 6 x 7
## # Groups:   cluster [3]
##   p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene  
##   <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
## 1     0       3.20 0.564 0.072         0 0       Adgrv1
## 2     0       2.68 0.32  0.022         0 0       Rfx4  
## 3     0       2.23 0.404 0.092         0 1       Cped1 
## 4     0       2.00 0.334 0.081         0 1       Prrx1 
## 5     0       1.94 0.407 0.112         0 2       Col1a2
## 6     0       1.88 0.362 0.093         0 2       Col3a1

Session info

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
## 
## Matrix products: default
## BLAS:   /opt/R-4.0.3/lib/R/lib/libRblas.so
## LAPACK: /opt/R-4.0.3/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SeuratObject_4.0.2 Seurat_4.0.4       dplyr_1.0.3        data.table_1.13.2 
## [5] Matrix_1.3-4      
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-149          spatstat.sparse_2.0-0 matrixStats_0.57.0   
##   [4] RcppAnnoy_0.0.19      RColorBrewer_1.1-2    httr_1.4.2           
##   [7] sctransform_0.3.2     tools_4.0.3           utf8_1.1.4           
##  [10] R6_2.5.0              irlba_2.3.3           rpart_4.1-15         
##  [13] KernSmooth_2.23-17    uwot_0.1.10           mgcv_1.8-33          
##  [16] DBI_1.1.1             lazyeval_0.2.2        colorspace_2.0-0     
##  [19] tidyselect_1.1.0      gridExtra_2.3         compiler_4.0.3       
##  [22] cli_2.2.0             plotly_4.9.3          labeling_0.4.2       
##  [25] scales_1.1.1          lmtest_0.9-38         spatstat.data_2.1-0  
##  [28] ggridges_0.5.3        pbapply_1.4-3         goftest_1.2-2        
##  [31] stringr_1.4.0         digest_0.6.27         spatstat.utils_2.2-0 
##  [34] rmarkdown_2.5         pkgconfig_2.0.3       htmltools_0.5.1.1    
##  [37] parallelly_1.27.0     limma_3.46.0          fastmap_1.0.1        
##  [40] htmlwidgets_1.5.3     rlang_0.4.10          shiny_1.5.0          
##  [43] farver_2.0.3          generics_0.1.0        zoo_1.8-9            
##  [46] jsonlite_1.7.2        ica_1.0-2             magrittr_2.0.1       
##  [49] patchwork_1.1.1       fansi_0.4.1           Rcpp_1.0.7           
##  [52] munsell_0.5.0         abind_1.4-5           reticulate_1.20      
##  [55] lifecycle_1.0.0       stringi_1.5.3         yaml_2.2.1           
##  [58] MASS_7.3-53           Rtsne_0.15            plyr_1.8.6           
##  [61] grid_4.0.3            parallel_4.0.3        listenv_0.8.0        
##  [64] promises_1.1.1        ggrepel_0.9.1         crayon_1.3.4         
##  [67] deldir_0.2-10         miniUI_0.1.1.1        lattice_0.20-41      
##  [70] cowplot_1.1.1         splines_4.0.3         tensor_1.5           
##  [73] knitr_1.30            pillar_1.4.7          igraph_1.2.6         
##  [76] spatstat.geom_2.2-2   future.apply_1.8.1    reshape2_1.4.4       
##  [79] codetools_0.2-16      leiden_0.3.9          glue_1.4.2           
##  [82] evaluate_0.14         png_0.1-7             vctrs_0.3.5          
##  [85] httpuv_1.5.5          polyclip_1.10-0       gtable_0.3.0         
##  [88] RANN_2.6.1            purrr_0.3.4           spatstat.core_2.3-0  
##  [91] tidyr_1.1.2           scattermore_0.7       future_1.22.1        
##  [94] assertthat_0.2.1      ggplot2_3.3.3         xfun_0.19            
##  [97] mime_0.9              xtable_1.8-4          RSpectra_0.16-0      
## [100] later_1.1.0.1         survival_3.2-7        viridisLite_0.3.0    
## [103] tibble_3.0.4          cluster_2.1.0         globals_0.14.0       
## [106] fitdistrplus_1.1-5    ellipsis_0.3.1        ROCR_1.0-11