My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


Spatial single-cell quantification with alevin-fry

In this tutorial we will look at how to process a spatial gene expression (Visium) experiment using an alevin-fry based pipeline. Note : This tutorial is meant to mimic the original tutorial for spatial expression analysis with alevin written by Avi Srivastava. Thus, most of the descriptive text and commands are taken directly from that tutorial. However, here we will be analyzing the data using the alevin-fry pipeline instead of alevin.

af_spatial.utf8

Introduction

Slide-seq V1-V2 (Rodriques, Samuel G., et al., Stickels, Robert R., et al., 2020) and 10x Genomics’ Visium Gene Expression has enabled the transcriptome-wide measurements of the molecular signals in a tissue with spatial localization at single-cell level. From the perspective of quantification, these spatial single-cell technologies require the generation of a gene-count matrix for each spot on the tissue, which are identified by spatially marked cellular barcodes. Based on the size of a spot & other technological limitations, each spot potentially contains multiple cells and, unlike single-cell sequencing, a single barcode can represent multiple cells. In this tutorial we generate the spatially-resolved gene-count matrix for each spot using alevin-fry and we visualize it using Seurat.

In this tutorial, we describe how to preprocess this data using alevin-fry, and how to load the data into R using Seurat and perform some basic analyses. Alevin-fry makes it easy to get from the raw FASTQ files to an accurate count matrix fast. For example, processing the data for this tutorial took less than 15 minutes (using 16 threads, and excluding index construction). The sections below contain code chunks executed in different languages, indicated with different background colors. R code is represented by grey boxes and shell commands by blue boxes.

Index the reference sequences

In order to quantify the abundances with alevin-fry, we need to generate the index of the reference sequences. We already saw how to do this for the reference transcriptome in earlier tutorials, and here we will assume that you’ve built the splici transcriptome reference for mouse. Specifically, you can obtain the ref-gex-mm10-2020-A reference from the 10x website, and use the make_splici_txome function from usefulaf utilities to generate the transcriptome and 2 and 3 column transcript-to-gene mappings. Then, index the resulting transcriptome_splici.fa with salmon to obtain the index we’ll use for mapping. In subsequent steps, we’ll assume the index resides in refdata-gex-mm10-2020-A/sal_index, but just change the index path below if you have it somewhere different.

Download the raw spatial sequencing data

We download the FASTQ files for the Mouse Brain Serial Section 1 Visium experiment. The raw reads files are available on the website with the hyperlinked FASTQs. However, since there is a registration requirements from the source, we are not providing the direct download link. Once downloaded the reads would look like the following:

tar -xvf V1_Mouse_Brain_Sagittal_Anterior_fastqs.tar

We also need the spatial imaging data which contains the spatial coordinates for each cellular barcode on the tissue slide (similar data can be found in the Slides-seq V1-V2). The data can be found on the same web link under the name Spatial imaging data. Once downloaded the content of the file looks as follows:

tar -xvzf V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz

Map the reads with alevin

Next we’ll map the reads to the splici mouse index using alevin. Note: Visium uses 16 length cellular barcode (cb) and 12 length UMI, so we can just use the --chromiumV3 flag. The mapping command we will use looks like this:

salmon alevin -l ISR -i refdata-gex-mm10-2020-A/sal_index \
  -1 V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_R1_001.fastq.gz \
  V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_R1_001.fastq.gz \
  -2 V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_R2_001.fastq.gz \
  V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_R2_001.fastq.gz \
  --chromiumV3 \
  -o alevin_out -p 16 --sketch

Once this is completed, the mapping information that we need will exist in the directory alevin_out.

Quantify with alevin-fry

Next, we’ll follow the familiar steps to quantify the mapped reads with alevin-fry. We’ll use the knee method for generating the permit list and, since we are using USA mode quantifications to appropriately tag reads of intronic origin, we’ll use the cr-like UMI resolution scheme.

alevin-fry generate-permit-list -k -d fw -i alevin_out -o alevin_out/permitlist_knee/ 

alevin-fry collate -i alevin_out/permitlist_knee/ -r alevin_out -t 16 

alevin-fry quant -r cr-like --use-mtx -m mouse_ref/t2g_3col.tsv -i alevin_out/permitlist_knee/ -o alevin_out/quant_knee_cr-like.usa -t 16 

where, above refdata-gex-mm10-2020-A is the path to the directory where you generated the splici reference transcriptome, including the 2 and 3-column transcript to gene mappings. Note : Since we are using the splici transcriptome, we make use here of the 3-column format transcript to gene tsv file (passed via the -m option to quant) to allow us to differentiate between UMIs that likely originate from spliced or unspliced molecules.

At this point, we have mapped and quantified the data. The quantification output will live in alevin_out/quant_knee_cr-like.usa.

Loading data into R

We start by loading the required R packages.

suppressPackageStartupMessages({
    library(devtools)
    library(ggplot2)
    library(SingleCellExperiment)
    library(Seurat)
    library(patchwork)
})

# set the seed 
set.seed(271828)

This is the function we’ll use to load our quantification matrices into R.

#' Read alevin-fry quantifications into a SingleCellExperiment object
load_fry <- function(frydir, which_counts = c('S', 'A'), verbose = FALSE) {
  suppressPackageStartupMessages({
    library(rjson)
    library(Matrix)
    library(SingleCellExperiment)
  })
  
  # read in metadata
  qfile <- file.path(frydir, "quant.json")
  if (!file.exists(qfile)) {
    qfile <- file.path(frydir, "meta_info.json")
  }
  meta_info <- fromJSON(file = qfile)
  ng <- meta_info$num_genes
  usa_mode <- meta_info$usa_mode
  
  if (usa_mode) {
    if (length(which_counts) == 0) {
      stop("Please at least provide one status in 'U' 'S' 'A' ")
    }
    if (verbose) {
      message("processing input in USA mode, will return ", paste(which_counts, collapse = '+'))
    }
  } else if (verbose) {
    message("processing input in standard mode, will return spliced count")
  }

  # read in count matrix
  af_raw <- readMM(file = file.path(frydir, "alevin", "quants_mat.mtx"))
  # if usa mode, each gene gets 3 rows, so the actual number of genes is ng/3
  if (usa_mode) {
    if (ng %% 3 != 0) {
      stop("The number of quantified targets is not a multiple of 3")
    }
    ng <- as.integer(ng/3)
  }
  
  # read in gene name file and cell barcode file
  afg <- read.csv(file.path(frydir, "alevin", "quants_mat_cols.txt"), 
                  strip.white = TRUE, header = FALSE, nrows = ng, 
                  col.names = c("gene_ids"), row.names = 1)
  afc <- read.csv(file.path(frydir, "alevin", "quants_mat_rows.txt"), 
                  strip.white = TRUE, header = FALSE,
                  col.names = c("barcodes"), row.names = 1)

  # if in usa_mode, sum up counts in different status according to which_counts
  if (usa_mode) {
    rd <- list("S" = seq(1, ng), "U" =  seq(ng + 1, 2 * ng),
               "A" =  seq(2 * ng + 1, 3 * ng))
    o <- af_raw[, rd[[which_counts[1]]], drop = FALSE]
    for (wc in which_counts[-1]) {
      o <- o + af_raw[, rd[[wc]], drop = FALSE]
    }
  } else {
    o <- af_raw
  }
  
  # create SingleCellExperiment object
  sce <- SingleCellExperiment(list(counts = t(o)),
                              colData = afc,
                              rowData = afg
  )
  sce
}

Load the quantifications

Next, we’ll load the quantifications into a SingleCellExperiemnt object. The quantifications will be loaded from the USA-mode matrix, and we’ll keep the S (confidently-spliced) and A (ambiguous) UMIs for each gene.

quant_sce <- load_fry('alevin_out/quant_knee_cr-like.usa', verbose=TRUE)
## processing input in USA mode, will return S+A

Because we’ve named each gene with its Entrez ID, but for later analysis we’ll want the gene names, we perform the mapping here (before we put the data into a Seurat object). The gene id to name mapping was obtained directly from our GTF file, but you can find the tsv file that results here.

gid_to_gname <- read.table('mouse_gid_to_gname.tsv')
rownames(quant_sce) <- gid_to_gname$V2[match(rownames(quant_sce), gid_to_gname$V1)]

Now that everything is prepared as we need, we create the Seurat object.

assay <- "Spatial"
brain <- CreateSeuratObject(counts = counts(quant_sce), project = "SPATIAL", assay = assay)
## Warning: Non-unique features (rownames) present in the input matrix, making
## unique
## Warning: The following arguments are not used: row.names
brain
## An object of class Seurat 
## 32085 features across 2907 samples within 1 assay 
## Active assay: Spatial (32085 features, 0 variable features)

Next, we load the spatial image data to extract the spatial 2D coordinates of each cellular barcode, and add the metadata to the Seurat object.

image.data <- Read10X_Image("spatial/")
# Since the names of alevin cb is different from 10x
# we rename the cells and filter the image data
# to have the metadata for only quantified cells
rownames(image.data@coordinates) <- gsub("-1", "", rownames(image.data@coordinates))
common.cells <- intersect(Cells(x = brain), rownames(image.data@coordinates))

image.data <- image.data[common.cells]
brain <- subset(brain, cells = common.cells)

# adding image data to Seurat object
DefaultAssay(object = image.data) <- "Spatial"
brain[['slice']] <- image.data

Visualization

Once we have the Seurat object we can use the exciting features made available through the Seurat package. We follow the workflow of the spatial vignette provided by the Seurat package for visualizing the Spatial data.

plot1 <- VlnPlot(brain, features = "nCount_Spatial", pt.size = 0.1) + NoLegend()
plot2 <- SpatialFeaturePlot(brain, features = "nCount_Spatial") + theme(legend.position = "right")
wrap_plots(plot1, plot2)

We can also perform the gene expression visualization after normalizing the data through scTransform.

brain <- SCTransform(brain, assay = "Spatial", verbose = FALSE)
SpatialFeaturePlot(brain, features = c("Hpca", "Ttr"))

Next, we run dimensionality reduction, clustering, and UMAP visualization.

brain <- RunPCA(brain, assay = "SCT", verbose = FALSE)
brain <- FindNeighbors(brain, reduction = "pca", dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
brain <- FindClusters(brain, resolution = 0.45, verbose = FALSE)
brain <- RunUMAP(brain, reduction = "pca", dims = 1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 23:01:57 UMAP embedding parameters a = 0.9922 b = 1.112
## 23:01:57 Read 2592 rows and found 30 numeric columns
## 23:01:57 Using Annoy for neighbor search, n_neighbors = 30
## 23:01:57 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 23:01:57 Writing NN index file to temp file /var/folders/pf/9gdjl11d08x8n75s5c6j0jgr0000gn/T//RtmpXJIBLh/filea99244c1059
## 23:01:57 Searching Annoy index using 1 thread, search_k = 3000
## 23:01:58 Annoy recall = 100%
## 23:01:58 Commencing smooth kNN distance calibration using 1 thread
## 23:01:59 Initializing from normalized Laplacian + noise
## 23:01:59 Commencing optimization for 500 epochs, with 100792 positive edges
## 23:02:04 Optimization finished
p1 <- DimPlot(brain, reduction = "umap", label = TRUE)
p2 <- SpatialDimPlot(brain, label = TRUE, label.size = 3)
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
p1 + p2

One of the most exciting parts of this type of spatial data is the visualization of the spatial localization of the individual clusters. We can look at how a few of the discovered clusters are distributed spatially using the following commands.

SpatialDimPlot(brain, cells.highlight = CellsByIdentities(object = brain, 
                                                          idents = c(2, 3, 5, 0, 6, 7)), 
               facet.highlight = TRUE, ncol = 3)

Thanks for reading this far! At this point, the possibilities of how to slice, dice and analyze the spatial gene expression data are up to you.

Session info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Matrix_1.3-3                rjson_0.2.20               
##  [3] patchwork_1.1.1             SeuratObject_4.0.1         
##  [5] Seurat_4.0.1                SingleCellExperiment_1.12.0
##  [7] SummarizedExperiment_1.20.0 Biobase_2.50.0             
##  [9] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
## [11] IRanges_2.24.1              S4Vectors_0.28.1           
## [13] BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
## [15] matrixStats_0.57.0          ggplot2_3.3.2              
## [17] devtools_2.3.2              usethis_2.0.0              
## 
## loaded via a namespace (and not attached):
##   [1] plyr_1.8.6             igraph_1.2.6           lazyeval_0.2.2        
##   [4] splines_4.0.3          listenv_0.8.0          scattermore_0.7       
##   [7] digest_0.6.27          htmltools_0.5.1.1      fansi_0.4.1           
##  [10] magrittr_2.0.1         memoise_1.1.0          tensor_1.5            
##  [13] cluster_2.1.0          ROCR_1.0-11            remotes_2.2.0         
##  [16] globals_0.14.0         spatstat.sparse_2.0-0  prettyunits_1.1.1     
##  [19] colorspace_2.0-0       ggrepel_0.9.1          xfun_0.19             
##  [22] dplyr_1.0.2            callr_3.5.1            crayon_1.3.4          
##  [25] RCurl_1.98-1.2         jsonlite_1.7.2         spatstat.data_2.1-0   
##  [28] survival_3.2-7         zoo_1.8-9              glue_1.4.2            
##  [31] polyclip_1.10-0        gtable_0.3.0           zlibbioc_1.36.0       
##  [34] XVector_0.30.0         leiden_0.3.7           DelayedArray_0.16.0   
##  [37] pkgbuild_1.1.0         future.apply_1.7.0     abind_1.4-5           
##  [40] scales_1.1.1           DBI_1.1.0              miniUI_0.1.1.1        
##  [43] Rcpp_1.0.5             viridisLite_0.4.0      xtable_1.8-4          
##  [46] reticulate_1.20        spatstat.core_2.1-2    htmlwidgets_1.5.3     
##  [49] httr_1.4.2             RColorBrewer_1.1-2     ellipsis_0.3.1        
##  [52] ica_1.0-2              farver_2.0.3           pkgconfig_2.0.3       
##  [55] uwot_0.1.10            deldir_0.2-10          labeling_0.4.2        
##  [58] tidyselect_1.1.0       rlang_0.4.11           reshape2_1.4.4        
##  [61] later_1.1.0.1          munsell_0.5.0          tools_4.0.3           
##  [64] cli_2.2.0              generics_0.1.0         ggridges_0.5.3        
##  [67] evaluate_0.14          stringr_1.4.0          fastmap_1.1.0         
##  [70] yaml_2.2.1             goftest_1.2-2          processx_3.4.5        
##  [73] knitr_1.30             fs_1.5.0               fitdistrplus_1.1-3    
##  [76] purrr_0.3.4            RANN_2.6.1             pbapply_1.4-3         
##  [79] future_1.21.0          nlme_3.1-151           mime_0.9              
##  [82] compiler_4.0.3         plotly_4.9.3           png_0.1-7             
##  [85] testthat_3.0.0         spatstat.utils_2.1-0   tibble_3.0.4          
##  [88] stringi_1.5.3          ps_1.5.0               RSpectra_0.16-0       
##  [91] desc_1.2.0             lattice_0.20-41        vctrs_0.3.5           
##  [94] pillar_1.4.7           lifecycle_0.2.0        spatstat.geom_2.1-0   
##  [97] lmtest_0.9-38          RcppAnnoy_0.0.18       data.table_1.14.0     
## [100] cowplot_1.1.1          bitops_1.0-6           irlba_2.3.3           
## [103] httpuv_1.6.1           R6_2.5.0               promises_1.1.1        
## [106] KernSmooth_2.23-18     gridExtra_2.3          parallelly_1.22.0     
## [109] sessioninfo_1.1.1      codetools_0.2-18       MASS_7.3-53           
## [112] assertthat_0.2.1       pkgload_1.1.0          rprojroot_2.0.2       
## [115] withr_2.3.0            sctransform_0.3.2      GenomeInfoDbData_1.2.4
## [118] mgcv_1.8-33            grid_4.0.3             rpart_4.1-15          
## [121] tidyr_1.1.2            rmarkdown_2.6          Rtsne_0.15            
## [124] shiny_1.6.0