6 Apr 2020 • on alevin

Spatial Alevin

Spatial Single-Cell Quantification with alevin

Quantify Spatial Single-Cell Data with alevin

Slide-seq V1-V2 (Rodriques, Samuel G., et al., 2019, Stickels, Robert R., et al., 2020) and 10x Genomics’ Visium Gene Expression has enable the transcriptome-wide measurements of the molecular signals in a tissue with spatial localization at single-cell level. From a quantification point of view, these spatial single-cell technologies requires the generation of a gene-count matrix for each spot on the tissue, which are identified by spatially marked cellular barcodes. Based on the size of a spot & other technological limitations,each spot potentially contains multiple cells and unlike single-cell sequencing a single barcode can contains multiple cells. In this tutorial we generate the spatially-resolved gene-count matrix for each spot using alevin and we visualize it using Seurat.

The sections below contain code chunks executed in different languages, indicated with different background colors. R code is represented by grey boxes and shell commands by blue boxes.

Step 1. Index the reference sequences

In order to quantify the abundances with alevin, we need to generate the index of the reference sequences. Here, we will use prebuild indices from refgenie but in case the index of a reference species is not available, follow the Setting Up Resources tutorial to generate the index.

Please note, based on the availability of the computational resources indexing of the reference sequences can be done at multiple level of accuracy, for more information please check our preprint. In this tutorial we are using sagital mouse brain data from 10X Genomics Visium technology and the selective alignment (SA) index of salmon for mouse reference. The following other options are also available for salmon mouse reference index, these are in the increasing order of their size but also in the improvement of the accuracy:

1.) cDNA index: 455.5MB.
2.) SA index: 806.1MB.
3.) SAF index: 12.1GB.

We start by downloading the SA salmon index from the refgenie website and unpacking the index.

wget --content-disposition  -nv http://refgenomes.databio.org/v2/asset/mm10/salmon_partial_sa_index/archive?tag=default

tar -xvzf salmon_partial_sa_index__default.tgz

## 2020-04-06 13:16:35 URL:http://refgenomes.databio.org/v2/asset/mm10/salmon_partial_sa_index/archive?tag=default [845304904/845304904] -> "salmon_partial_sa_index__default.tgz.1" [1]
## salmon_partial_sa_index/
## salmon_partial_sa_index/complete_ref_lens.bin
## salmon_partial_sa_index/mphf.bin
## salmon_partial_sa_index/rank.bin
## salmon_partial_sa_index/duplicate_clusters.tsv
## salmon_partial_sa_index/gentrome.fa
## salmon_partial_sa_index/mm10.gtf
## salmon_partial_sa_index/pre_indexing.log
## salmon_partial_sa_index/reflengths.bin
## salmon_partial_sa_index/seq.bin
## salmon_partial_sa_index/refAccumLengths.bin
## salmon_partial_sa_index/ctg_offsets.bin
## salmon_partial_sa_index/pos.bin
## salmon_partial_sa_index/decoys.txt
## salmon_partial_sa_index/eqtable.bin
## salmon_partial_sa_index/versionInfo.json
## salmon_partial_sa_index/refseq.bin
## salmon_partial_sa_index/info.json
## salmon_partial_sa_index/ctable.bin
## salmon_partial_sa_index/ref_indexing.log

We also need a tsv file that maps the transcript names to the gene names. The refgenie SA index already contains the reference file used for indexing and we can generate the map file from that as follows:

grep "^>" salmon_partial_sa_index/gentrome.fa | cut -d " " -f 1,7 --output-delimiter=$'\t' - | sed 's/[>"gene_symbol:"]//g' > txp2gene.tsv

Step 2. Download the raw spatial sequencing data

We download the fastq files for the Mouse Brain Serial Section 1 Visium experiment. The raw reads files are available on the website with the hyperlink FASTQ. However, since there is a registration requirements from the source, we are not providing the direct download link. Once downloaded the reads would look like the following:

tar -xvf V1_Mouse_Brain_Sagittal_Anterior_fastqs.tar

## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_R1_001.fastq.gz
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_I2_001.fastq.gz
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_R2_001.fastq.gz
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_I1_001.fastq.gz
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_R2_001.fastq.gz
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_I1_001.fastq.gz
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_R1_001.fastq.gz
## V1_Mouse_Brain_Sagittal_Anterior_Section_1_fastqs/V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_I2_001.fastq.gz

We also need the spatial imaging data which contains the spatial coordinates for each cellular barcode on the tissue slide (similar data can be found in the Slides-seq V1-2). The data can be found on the same web link under the name Spatial imaging data. Once downloaded the content of the file looks as follows:

tar -xvzf V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz

## spatial/
## spatial/tissue_lowres_image.png
## spatial/tissue_positions_list.csv
## spatial/tissue_hires_image.png
## spatial/scalefactors_json.json
## spatial/detected_tissue_image.jpg
## spatial/aligned_fiducials.jpg

Step 3. Quantify with alevin

we run alevin to quantify the spatial gene abundances based on the index above.
Note: Visium uses 16 length cellular barcode (cb) and 12 length UMI which requires --chorimiumV3 flag, however this should be swapped with either --chromium for 16 length cb, 10 length UMI or --end 5 --barcodeLength X --umiLength Y in case the technology uses different legnths for cb,UMI; where X is the cb length and Y is the UMI length

module load Salmon/1.1.0 && 
salmon alevin -l ISR -i salmon_partial_sa_index \
-1 V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_R1_001.fastq.gz \
V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_R1_001.fastq.gz \
-2 V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L001_R2_001.fastq.gz \
V1_Mouse_Brain_Sagittal_Anterior_Section_1_S7_L002_R2_001.fastq.gz \
-o alevin_out -p 16 --tgMap txp2gene.tsv \
--chromiumV3 --dumpFeatures

Step 4. Loading data into R

We start by loading the required R packages.
Note: We need the spatial branch of Seurat package to support the spatial data processing.

suppressPackageStartupMessages({
    library(fishpond)
    library(tximport)
    library(devtools)
    library(ggplot2)
    library(patchwork)

    # use the below command to install spatial Seurat
    # devtools::install_github("satijalab/seurat", ref = "spatial")
    library(Seurat)
})

We load the alevin generated spot-gene count matrix using tximport package and create a Seurat object.

files <- file.path("~/alevin_out/alevin/quants_mat.gz")
file.exists(files)

## [1] TRUE

# tximport loads the alevin data into R
txi <- tximport(files = files, type = "alevin")

## reading in alevin gene-level counts across cells with fishpond

# Creating a Seurat object with spatial assay
assay <- "Spatial"
brain <- CreateSeuratObject(counts = txi$counts, project = "SPATIAL", assay = assay)
brain

## An object of class Seurat 
## 31460 features across 3728 samples within 1 assay 
## Active assay: Spatial (31460 features, 0 variable features)

We load the spatial image data to extract the spatial 2D coordinates of each cellular barcodes and add the metadata to the Seurat object.

# loading the 10x image data
image.data <- Read10X_Image("~/spatial/")

# Since the names of alevin cb is different from 10x
# we rename the cells and filter the image data
# to have the metadata for only quantified cells
rownames(image.data@coordinates) <- gsub("-1", "", rownames(image.data@coordinates))
common.cells <- intersect(Cells(x = brain), rownames(image.data@coordinates))

image.data <- image.data[common.cells]
brain <- subset(brain, cells = common.cells)

# adding image data to Seurat object
DefaultAssay(object = image.data) <- "Spatial"
brain[['slice']] <- image.data

Step 5. Visualization

Once we have the Seurat object we can use all the exciting features made available through the Seurat package. We use the spatial vignette provided by the Seurat package for visualizing the Spatial data.

plot1 <- VlnPlot(brain, features = "nCount_Spatial", pt.size = 0.1) + NoLegend()
plot2 <- SpatialFeaturePlot(brain, features = "nCount_Spatial") + theme(legend.position = "right")
wrap_plots(plot1, plot2)

We can also perform the gene expression visualization after normalizing the data through scTransform.

brain <- SCTransform(brain, assay = "Spatial", verbose = FALSE)
SpatialFeaturePlot(brain, features = c("Hpca", "Ttr"))

We run dimensionality reduction, clustering, and umap visualization

brain <- RunPCA(brain, assay = "SCT", verbose = FALSE)
brain <- FindNeighbors(brain, reduction = "pca", dims = 1:30)

## Computing nearest neighbor graph

## Computing SNN

brain <- FindClusters(brain, verbose = FALSE)
brain <- RunUMAP(brain, reduction = "pca", dims = 1:30)

## 13:17:59 UMAP embedding parameters a = 0.9922 b = 1.112

## 13:17:59 Read 2685 rows and found 30 numeric columns

## 13:17:59 Using Annoy for neighbor search, n_neighbors = 30

## 13:17:59 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 13:17:59 Writing NN index file to temp file /tmp/Rtmp9yB2lD/file46005565cc64
## 13:17:59 Searching Annoy index using 1 thread, search_k = 3000
## 13:18:00 Annoy recall = 100%
## 13:18:00 Commencing smooth kNN distance calibration using 1 thread
## 13:18:01 Initializing from normalized Laplacian + noise
## 13:18:01 Commencing optimization for 500 epochs, with 109368 positive edges
## 13:18:06 Optimization finished

p1 <- DimPlot(brain, reduction = "umap", label = TRUE)
p2 <- SpatialDimPlot(brain, label = TRUE, label.size = 3)
p1 + p2

The most exciting part of the spatial data is the visualization of the spatial localization of the individual clusters

SpatialDimPlot(brain, cells.highlight = CellsByIdentities(object = brain, 
                                                          idents = c(1, 2, 5, 3, 4, 8)), 
               facet.highlight = TRUE, ncol = 3)

## Scale for 'fill' is already present. Adding another scale for
## 'fill', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for
## 'fill', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for
## 'fill', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for
## 'fill', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for
## 'fill', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for
## 'fill', which will replace the existing scale.

Session info

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods  
## [7] base     
## 
## other attached packages:
## [1] Seurat_3.1.4.9903    patchwork_1.0.0.9000 ggplot2_3.2.1       
## [4] devtools_2.2.2       usethis_1.5.1        tximport_1.15.8     
## [7] fishpond_1.3.10     
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.5             plyr_1.8.5                 
##   [3] igraph_1.2.4.2              lazyeval_0.2.2             
##   [5] splines_3.6.2               BiocParallel_1.20.1        
##   [7] listenv_0.8.0               GenomeInfoDb_1.22.0        
##   [9] digest_0.6.24               htmltools_0.4.0            
##  [11] gdata_2.18.0                fansi_0.4.1                
##  [13] magrittr_1.5                memoise_1.1.0              
##  [15] tensor_1.5                  cluster_2.1.0              
##  [17] ROCR_1.0-7                  remotes_2.1.1              
##  [19] globals_0.12.5              RcppParallel_4.4.4         
##  [21] matrixStats_0.56.0          prettyunits_1.1.1          
##  [23] colorspace_1.4-1            rappdirs_0.3.1             
##  [25] ggrepel_0.8.1               xfun_0.12                  
##  [27] dplyr_0.8.4                 callr_3.4.2                
##  [29] crayon_1.3.4                RCurl_1.98-1.1             
##  [31] jsonlite_1.6.1              spatstat_1.63-3            
##  [33] spatstat.data_1.4-3         survival_3.1-8             
##  [35] zoo_1.8-7                   ape_5.3                    
##  [37] glue_1.3.1                  polyclip_1.10-0            
##  [39] gtable_0.3.0                zlibbioc_1.32.0            
##  [41] XVector_0.26.0              leiden_0.3.3               
##  [43] DelayedArray_0.12.2         pkgbuild_1.0.6             
##  [45] future.apply_1.4.0          BiocGenerics_0.32.0        
##  [47] abind_1.4-5                 scales_1.1.0               
##  [49] miniUI_0.1.1.1              Rcpp_1.0.3                 
##  [51] viridisLite_0.3.0           xtable_1.8-4               
##  [53] reticulate_1.14             rsvd_1.0.3                 
##  [55] stats4_3.6.2                tsne_0.1-3                 
##  [57] htmlwidgets_1.5.1           httr_1.4.1                 
##  [59] gplots_3.0.1.2              RColorBrewer_1.1-2         
##  [61] ellipsis_0.3.0              ica_1.0-2                  
##  [63] farver_2.0.3                pkgconfig_2.0.3            
##  [65] uwot_0.1.5                  deldir_0.1-25              
##  [67] labeling_0.3                tidyselect_1.0.0           
##  [69] rlang_0.4.4                 reshape2_1.4.3             
##  [71] later_1.0.0                 munsell_0.5.0              
##  [73] tools_3.6.2                 cli_2.0.1                  
##  [75] ggridges_0.5.2              evaluate_0.14              
##  [77] stringr_1.4.0               fastmap_1.0.1              
##  [79] yaml_2.2.1                  goftest_1.2-2              
##  [81] npsurv_0.4-0                knitr_1.28                 
##  [83] processx_3.4.2              fs_1.3.1                   
##  [85] fitdistrplus_1.0-14         caTools_1.18.0             
##  [87] purrr_0.3.3                 RANN_2.6.1                 
##  [89] pbapply_1.4-2               future_1.16.0              
##  [91] nlme_3.1-144                mime_0.9                   
##  [93] compiler_3.6.2              rstudioapi_0.11            
##  [95] curl_4.3                    plotly_4.9.2               
##  [97] png_0.1-7                   testthat_2.3.1             
##  [99] lsei_1.2-0                  spatstat.utils_1.17-0      
## [101] tibble_2.1.3                stringi_1.4.6              
## [103] ps_1.3.2                    RSpectra_0.16-0            
## [105] desc_1.2.0                  lattice_0.20-38            
## [107] Matrix_1.2-18               vctrs_0.2.2                
## [109] pillar_1.4.3                lifecycle_0.1.0            
## [111] lmtest_0.9-37               RcppAnnoy_0.0.14           
## [113] data.table_1.12.8           cowplot_1.0.0              
## [115] bitops_1.0-6                irlba_2.3.3                
## [117] httpuv_1.5.2                GenomicRanges_1.38.0       
## [119] R6_2.4.1                    promises_1.1.0             
## [121] KernSmooth_2.23-16          gridExtra_2.3              
## [123] IRanges_2.20.2              sessioninfo_1.1.1          
## [125] codetools_0.2-16            MASS_7.3-51.5              
## [127] gtools_3.8.1                assertthat_0.2.1           
## [129] pkgload_1.0.2               SummarizedExperiment_1.16.1
## [131] rprojroot_1.3-2             withr_2.1.2                
## [133] sctransform_0.2.1           S4Vectors_0.24.3           
## [135] GenomeInfoDbData_1.2.2      mgcv_1.8-31                
## [137] parallel_3.6.2              grid_3.6.2                 
## [139] rpart_4.1-15                tidyr_1.0.2                
## [141] rmarkdown_2.1               Rtsne_0.15                 
## [143] Biobase_2.46.0              shiny_1.4.0

Alevin-Tutorial