Feature Barcoding based Single-Cell Quantification with alevin
Avi Srivastava, Yuhan Hao
2020-04-20
CITE-seq (Stoeckius, Marlon, et al., 2017, Mimitou, Eleni P., et al., 2019) and 10x Genomics’ Feature Barcoding technology has enable the sequencing of the cell surface protein and gene expression from the same cell. From a quantification point of view, these single-cell technologies requires the generation of a count matrix for the features, which can be antibody barcodes for protein, hashing barcodes for sample multiplexing, guide-RNAs for CRISPR and genes for a RNA based experiments. In this tutorial we generate the gene-count (RNA), adt-count (protein) and hto-count (sample hashing) matrices for each cell using alevin, and we harmonize & visualize it using Seurat (Stuart, Tim, & Butler, Andrew et al., 2019).
The sections below contain code chunks executed in different languages, indicated with different background colors. R code is represented by grey boxes and shell commands by blue boxes.
Step 1. Index the reference sequences
In order to quantify the abundances with alevin, we need to generate the index of the reference sequences. Here, we will use prebuild indices from refgenie but in case the index of a reference species is not available, follow the Setting Up Resources tutorial to generate the index.
Please note, based on the availability of the computational resources indexing of the reference sequences can be done at multiple level of accuracy, for more information please check our preprint. In this tutorial we are using Bone Marrow Mononuclear Cells data and the selective alignment (SA) index of salmon for human reference. The following other options are also available for salmon human reference index, these are in the increasing order of their size but also in the improvement of the accuracy:
1.) cDNA index: 531.5MB.
2.) SA index: 951.3MB.
3.) SAF index: 14.2GB.
We start by downloading the SA salmon index from the refgenie website and unpacking the index.
wget --content-disposition -nv http://refgenomes.databio.org/v2/asset/hg38/salmon_partial_sa_index/archive?tag=default
tar -xvzf salmon_partial_sa_index__default.tgz
We also need a tsv file that maps the transcript names to the gene names. The refgenie SA index already contains the reference file used for indexing and we can generate the map file from that as follows:
grep "^>" salmon_partial_sa_index/gentrome.fa | cut -d " " -f 1,7 --output-delimiter=$'\t' - | sed 's/[>"gene_symbol:"]//g' > txp2gene.tsv
Step 2. Index the antibody sequences
In order to quantify the antibodies and the sample multiplexing based cell-hasing, we need to index the feature barcodes. We start by downloading the barcode sequences of the antibody derived tags (ADT) and the hash antibody tag oligos (HTO).
wget --content-disposition -nv https://ftp.ncbi.nlm.nih.gov/geo/series/GSE128nnn/GSE128639/suppl/GSE128639_MNC_ADT_Barcodes.csv.gz &&
wget --content-disposition -nv https://ftp.ncbi.nlm.nih.gov/geo/series/GSE128nnn/GSE128639/suppl/GSE128639_MNC_HTO_Barcodes.csv.gz &&
gunzip -c GSE128639_MNC_ADT_Barcodes.csv.gz | awk -F "," '{print $1"\t"$4}' | tail -n +2 > adt.tsv &&
gunzip -c GSE128639_MNC_HTO_Barcodes.csv.gz | awk -F "," '{print $1"\t"$4}' | sed 's/Hashtag /Hashtag_/g' | tail -n +2 > hto.tsv
In the latest release of salmon (>= 1.2.0), we add the --features
command line flag to the index
sub command of salmon which performs the indexing based on a tab separated file. Specifically, the flag expects an id of the reference sequence tab separated by the nucleotide sequence, one reference per line. We have already generated the two reference files, one for ADT and another for HTO, in the above command. Next, we index the features.
# We assume Salmon binary is available to the environment
# module load Salmon/1.2.1 &&
salmon index -t adt.tsv -i adt_index --features -k7 &&
salmon index -t hto.tsv -i hto_index --features -k7
Step 3. Download the raw RNA & antibody sequencing data
We download the fastq files for a CITE-seq based experiment on the Bone Marrow Mononuclear Cells. CITE-seq generates the raw FASTQ
files separately for each RNA, ADT and HTO experiments.
# RNA experiment
wget --content-disposition -nv ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/003/SRR8758323/SRR8758323_2.fastq.gz &&
wget --content-disposition -nv ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/003/SRR8758323/SRR8758323_1.fastq.gz &&
# ADT experiment
wget --content-disposition -nv ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/005/SRR8758325/SRR8758325_1.fastq.gz &&
wget --content-disposition -nv ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/005/SRR8758325/SRR8758325_2.fastq.gz &&
# HTO experiment
wget --content-disposition -nv ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/007/SRR8758327/SRR8758327_1.fastq.gz &&
wget --content-disposition -nv ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/007/SRR8758327/SRR8758327_2.fastq.gz
Step 4. Quantify with alevin
We first run alevin to quantify the gene abundances based on the relevant 3-prime sequencing technology (here it’s 10x V2), for more information on using alevin with different 3-prime sequencing technologies, check How to Run Alevin page.
salmon alevin -l ISR -i salmon_partial_sa_index \
-1 SRR8758323_1.fastq.gz -2 SRR8758323_2.fastq.gz \
-o alevin_rna -p 16 --tgMap txp2gene.tsv \
--chromium --dumpFeatures --expectCells 35000
Next, we quantify the ADT library using the previously generated adt_index
. This stage generally requires two new flags, featureStart
(0 for CITE-seq and 10 for 10x based feature barcoding) – the start index of the feature barcode on the R2 file and featureLength
(15 for both CITE-seq and 10x based feature barcoding) – the length of the feature barcode.
salmon alevin -l ISR -i adt_index \
-1 SRR8758325_1.fastq.gz -2 SRR8758325_2.fastq.gz \
-o alevin_adt -p 16 --citeseq --featureStart 0 \
--featureLength 15
*Note* a custom geometry can also be provded using the following command
--bc-geometry 1[1-16] --umi-geometry 1[17-26] --read-geometry 2[1-15] \
--tgMap adt_map.tsv --keepCBFraction 1.0 in place of
--citeseq --featureStart 0 --featureLength 15
Lastly, we quantify the HTO library using the previously generated hto_index
. NOTE: Generally, the number of features in a HTO experiment is significantly lower than the mRNA experiment and the ambiguity information in the equivalence-classes becomes trivially simple (mostly unique) to solve. Hence, we recommend using --naiveEqclass
flag for the UMI deduplication of the HTO data.
salmon alevin -l ISR -i hto_index \
-1 SRR8758327_1.fastq.gz -2 SRR8758327_2.fastq.gz \
-o alevin_hto -p 16 --citeseq --featureStart 0 \
--featureLength 15 --naiveEqclass
Step 5. Loading data into R
We start by loading the required R packages.
suppressPackageStartupMessages({
library(fishpond)
library(tximport)
library(devtools)
library(ggplot2)
library(Seurat)
})
We load the alevin generated count matrices using tximport package.
rna.files <- file.path("alevin_rna/alevin/quants_mat.gz")
adt.files <- file.path("alevin_adt/alevin/quants_mat.gz")
hto.files <- file.path("alevin_hto/alevin/quants_mat.gz")
file.exists(c(rna.files, adt.files, hto.files))
## [1] TRUE TRUE TRUE
# tximport loads the alevin data into R
rna.txi <- tximport(files = rna.files, type = "alevin")
## reading in alevin gene-level counts across cells with fishpond
adt.txi <- tximport(files = adt.files, type = "alevin")
## reading in alevin gene-level counts across cells with fishpond
hto.txi <- tximport(files = hto.files, type = "alevin")
## reading in alevin gene-level counts across cells with fishpond
We create the Seurat object using the RNA data and, add ADT / HTO data as a seprate assay to the object for the common cells.
common.cells <- intersect(colnames(rna.txi$counts), colnames(adt.txi$counts))
common.cells <- intersect(common.cells , colnames(hto.txi$counts))
length(common.cells)
## [1] 19172
[NOTE] If the feature barcoding protocol use TotalSeq B or C, as mentioned by 10x Genomics, the cellular barcodes of RNA and the feature barcode library would not be exactly the same. To match the cellular barcodes across technologies the cellular barcodes needs to be transformed based on a mapping file. The code to perform the mapping is here https://github.com/COMBINE-lab/salmon/discussions/576#discussioncomment-235459 and more information from 10x Genomic is at https://kb.10xgenomics.com/hc/en-us/articles/360031133451-Why-is-there-a-discrepancy-in-the-3M-february-2018-txt-barcode-whitelist-
object <- CreateSeuratObject(rna.txi$counts[, common.cells])
object[["ADT"]] <- CreateAssayObject(counts = adt.txi$counts[, common.cells])
object[["HTO"]] <- CreateAssayObject(counts = hto.txi$counts[, common.cells])
Step 6. Visualization
Processing Sample Hashing data (HTOs) to demultiplex samples
We start by first performing some basic quality check and filtering of the noisy barcodes on the HTO data. Later, we perform centered log-ratio (CLR) normalization for the HTO data.
# remove HTO protein aggregation cells
VlnPlot(object, features = c("nCount_HTO"), log = TRUE )
object <- subset(object, subset = nCount_HTO < 2e4)
# HTO Normalization
DefaultAssay(object) <- "HTO"
object <- NormalizeData(object, normalization.method = "CLR", margin = 2, verbose = F)
VariableFeatures(object) <- rownames(object[["HTO"]]@counts)
object <- ScaleData(object, assay = "HTO", verbose = F)
We use the function HTODemux from Seurat to demultiplex cellular barcodes based on the HTO enrichment and assign single cells back to their samples of origin
object <- HTODemux(object, assay = "HTO", positive.quantile = 0.99, verbose = F)
Idents(object) <- "HTO_classification.global"
VlnPlot(object, features = "nCount_HTO", pt.size = 0.1, log = TRUE)
VlnPlot(object, features = "nCount_RNA", pt.size = 0.1, log = TRUE)
VlnPlot(object, features = "nCount_ADT", pt.size = 0.1, log = TRUE)
[OPTIONAL] HTODemux function classifies the HTO barcodes into three categories – singlet, doublet and negatives. We remove the negatives and visualize the singlets and doublets from the object using UMAP and observe 10 visually separable clusters identifiable by the 10 different sample hashing barcodes (HTOs).
# subset the data
object <- subset(object, idents = "Negative", invert = TRUE)
# perform PCA and generate UMAP emdeddings
object <- RunPCA(object, reduction.name = "hto.pca", reduction.key = "HPC_", verbose = F)
object <- RunUMAP(object, reduction = "hto.pca", dims = 1:9, reduction.name = "hto.umap", reduction.key = "HUMAP_", verbose = F)
DimPlot(object, reduction = "hto.umap", label = F)
DimPlot(object, reduction = "hto.umap", label = T, group.by = "hash.ID" )
Processing Antibody Derived Protein (ADTs) data for clustering
We subset the object for cellular barcodes which are classified as Singlets by the HTODemux and remove low quality cells.
# subsetting the data to Singlets
object <- subset(object, idents = "Singlet")
# remove ADT protein aggregation cells
VlnPlot(object, features = c("nCount_ADT"), pt.size = 0.1)
object <- subset(object, subset = nCount_ADT < 1e4)
# remove low quality cells
Idents(object) <- "hash.ID"
VlnPlot(object, features = c("nFeature_RNA", "nFeature_ADT"), pt.size = 0.1)
object <- subset(object, subset = nFeature_RNA > 200 & nFeature_RNA < 3000 & nFeature_ADT > 18)
# remove high mito cells
DefaultAssay(object) <- "RNA"
object[["percent.mt"]] <- PercentageFeatureSet(object, pattern = "^MT-")
VlnPlot(object, "percent.mt", pt.size = 0.1)
object <- subset(object, subset = percent.mt < 10)
We normalize ADT data, peform PCA and generate UMAP embeddings for visualization.
# ADT Normalization
DefaultAssay(object) <- "ADT"
object <- NormalizeData(object, normalization.method = "CLR", margin = 2, verbose = F)
VariableFeatures(object) <- rownames(object[["ADT"]]@counts)
object <- ScaleData(object, assay = "ADT", verbose = F, do.scale = F)
# Perform PCA
object <- RunPCA(object, reduction.name = "apca", reduction.key = "apca_", verbose = F)
# generate UMAP
object <- RunUMAP(object, reduction.name = "aumap", reduction.key = "aumap_", dims = 1:18, reduction = "apca", verbose = F)
Processing RNA data for clustering and identifying marker genes.
We start by first normalizing the RNA data using SCTransform (Hafemeister, Christoph, and Rahul Satija, 2019).
# RNA Normalization
DefaultAssay(object) <- "RNA"
object <- NormalizeData(object, verbose = F)
object <- FindVariableFeatures(object, verbose = F)
object <- ScaleData(object, verbose = F)
# perform PCA
object <- RunPCA(object = object, verbose = F)
# generate UMAP embeddings
object <- RunUMAP(object = object, reduction = "pca", dims = 1:30, verbose = F)
Next, we perform clustering on the PCA embeddings and identify some marker genes.
object <- FindNeighbors(object, reduction = "pca", dims = 1:30, verbose = F)
object <- FindClusters(object, verbose = F)
## T and NK cells markers
FeaturePlot(object, reduction = "umap", features = c("adt_CD3","adt_CD4", "adt_CD8a", "adt_CD56", "adt_CD45RA", "adt_CD69","adt_CD45RO", "adt_CD161"), min.cutoff = "q05", max.cutoff = "q95", ncol = 3)
## Monocytes, DC, B cells and Progenitor cells markers
FeaturePlot(object, reduction = "umap", features = c("adt_CD14","adt_CD16", "adt_CD11c","adt_CD19", "adt_CD34", "rna_AVP", "rna_MPO", "rna_HMBS"), min.cutoff = "q05", max.cutoff = "q95", ncol = 3)
We manually assign cell types to the clusters based on the marker genes and plot the umap embeddings based on the RNA and ADT data.
# set scale to be FALSE
object <- RenameIdents(object, `0` = "CD4 Naive T", `1` = "CD14 Mono",
`2` = "CD8 Naive T", `3` = "CD4 Memory T",
`4` = "Naive B", `5` = "CD14 Mono",
`6` = "Memory B", `7` = "CD8 Effector T",
`8` = "NK", `9` = "Myeloid Progenitor",
`10` = "MAIT", `11` = "CD4 Naive T",
`12` = "HSC", `13` = "CD16 Mono",
`14` = "pDC", `15` = "RBC Progenitor",
`16` = "CD1C DC", `17` = "B Progenitor 1",
`18` = "B Progenitor 2")
# plotting umap embeddings for rna and adt data
DimPlot(object, reduction = "umap", label = T, repel = T) + NoLegend()
DimPlot(object, reduction = "aumap", label = T, repel = T) + NoLegend()
Session info
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods
## [7] base
##
## other attached packages:
## [1] Seurat_3.1.4.9903 ggplot2_3.2.1 devtools_2.2.2
## [4] usethis_1.5.1 tximport_1.15.13 fishpond_1.3.15
## [7] rmarkdown_2.1
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.5 plyr_1.8.5
## [3] igraph_1.2.4.2 lazyeval_0.2.2
## [5] splines_3.6.2 BiocParallel_1.20.1
## [7] listenv_0.8.0 GenomeInfoDb_1.22.0
## [9] digest_0.6.24 htmltools_0.4.0
## [11] gdata_2.18.0 fansi_0.4.1
## [13] magrittr_1.5 memoise_1.1.0
## [15] tensor_1.5 cluster_2.1.0
## [17] ROCR_1.0-7 remotes_2.1.1
## [19] globals_0.12.5 RcppParallel_4.4.4
## [21] matrixStats_0.56.0 prettyunits_1.1.1
## [23] colorspace_1.4-1 rappdirs_0.3.1
## [25] ggrepel_0.8.1 xfun_0.12
## [27] dplyr_0.8.4 callr_3.4.2
## [29] crayon_1.3.4 RCurl_1.98-1.1
## [31] jsonlite_1.6.1 spatstat_1.63-3
## [33] spatstat.data_1.4-3 survival_3.1-8
## [35] zoo_1.8-7 ape_5.3
## [37] glue_1.3.1 polyclip_1.10-0
## [39] gtable_0.3.0 zlibbioc_1.32.0
## [41] XVector_0.26.0 leiden_0.3.3
## [43] DelayedArray_0.12.2 pkgbuild_1.0.6
## [45] future.apply_1.4.0 BiocGenerics_0.32.0
## [47] abind_1.4-5 scales_1.1.0
## [49] miniUI_0.1.1.1 Rcpp_1.0.3
## [51] viridisLite_0.3.0 xtable_1.8-4
## [53] reticulate_1.14 rsvd_1.0.3
## [55] tsne_0.1-3 stats4_3.6.2
## [57] htmlwidgets_1.5.1 httr_1.4.1
## [59] gplots_3.0.1.2 RColorBrewer_1.1-2
## [61] ellipsis_0.3.0 ica_1.0-2
## [63] farver_2.0.3 pkgconfig_2.0.3
## [65] uwot_0.1.5 deldir_0.1-25
## [67] labeling_0.3 tidyselect_1.0.0
## [69] rlang_0.4.4 reshape2_1.4.3
## [71] later_1.0.0 munsell_0.5.0
## [73] tools_3.6.2 cli_2.0.1
## [75] ggridges_0.5.2 evaluate_0.14
## [77] stringr_1.4.0 fastmap_1.0.1
## [79] yaml_2.2.1 goftest_1.2-2
## [81] npsurv_0.4-0 processx_3.4.2
## [83] knitr_1.28 fs_1.3.1
## [85] fitdistrplus_1.0-14 caTools_1.18.0
## [87] purrr_0.3.3 RANN_2.6.1
## [89] pbapply_1.4-2 future_1.16.0
## [91] nlme_3.1-144 mime_0.9
## [93] compiler_3.6.2 plotly_4.9.2
## [95] png_0.1-7 testthat_2.3.1
## [97] lsei_1.2-0 spatstat.utils_1.17-0
## [99] tibble_2.1.3 stringi_1.4.6
## [101] ps_1.3.2 RSpectra_0.16-0
## [103] desc_1.2.0 lattice_0.20-38
## [105] Matrix_1.2-18 vctrs_0.2.2
## [107] pillar_1.4.3 lifecycle_0.1.0
## [109] lmtest_0.9-37 RcppAnnoy_0.0.14
## [111] data.table_1.12.8 cowplot_1.0.0
## [113] bitops_1.0-6 irlba_2.3.3
## [115] httpuv_1.5.2 patchwork_1.0.0.9000
## [117] GenomicRanges_1.38.0 R6_2.4.1
## [119] promises_1.1.0 KernSmooth_2.23-16
## [121] gridExtra_2.3 IRanges_2.20.2
## [123] sessioninfo_1.1.1 codetools_0.2-16
## [125] MASS_7.3-51.5 gtools_3.8.1
## [127] assertthat_0.2.1 pkgload_1.0.2
## [129] SummarizedExperiment_1.16.1 rprojroot_1.3-2
## [131] withr_2.1.2 sctransform_0.2.1
## [133] S4Vectors_0.24.3 GenomeInfoDbData_1.2.2
## [135] mgcv_1.8-31 parallel_3.6.2
## [137] rpart_4.1-15 grid_3.6.2
## [139] tidyr_1.0.2 Rtsne_0.15
## [141] Biobase_2.46.0 shiny_1.4.0