My Profile Photo My Profile Photo

Alevin-Tutorial


A support website for Alevin-tool (part of Salmon). How Tos and FAQs


Alevin w/ Feature Barcodes

Feature Barcoding based Single-Cell Quantification with alevin

Estimate antibody quantification with alevin

CITE-seq (Stoeckius, Marlon, et al., 2017, Mimitou, Eleni P., et al., 2019) and 10x Genomics’ Feature Barcoding technology has enable the sequencing of the cell surface protein and gene expression from the same cell. From a quantification point of view, these single-cell technologies requires the generation of a count matrix for the features, which can be antibody barcodes for protein, hashing barcodes for sample multiplexing, guide-RNAs for CRISPR and genes for a RNA based experiments. In this tutorial we generate the gene-count (RNA), adt-count (protein) and hto-count (sample hashing) matrices for each cell using alevin, and we harmonize & visualize it using Seurat (Stuart, Tim, & Butler, Andrew et al., 2019).

The sections below contain code chunks executed in different languages, indicated with different background colors. R code is represented by grey boxes and shell commands by blue boxes.

Step 1. Index the reference sequences

In order to quantify the abundances with alevin, we need to generate the index of the reference sequences. Here, we will use prebuild indices from refgenie but in case the index of a reference species is not available, follow the Setting Up Resources tutorial to generate the index.

Please note, based on the availability of the computational resources indexing of the reference sequences can be done at multiple level of accuracy, for more information please check our preprint. In this tutorial we are using Bone Marrow Mononuclear Cells data and the selective alignment (SA) index of salmon for human reference. The following other options are also available for salmon human reference index, these are in the increasing order of their size but also in the improvement of the accuracy:

1.) cDNA index: 531.5MB.
2.) SA index: 951.3MB.
3.) SAF index: 14.2GB.

We start by downloading the SA salmon index from the refgenie website and unpacking the index.

wget --content-disposition  -nv http://refgenomes.databio.org/v2/asset/hg38/salmon_partial_sa_index/archive?tag=default

tar -xvzf salmon_partial_sa_index__default.tgz

We also need a tsv file that maps the transcript names to the gene names. The refgenie SA index already contains the reference file used for indexing and we can generate the map file from that as follows:

grep "^>" salmon_partial_sa_index/gentrome.fa | cut -d " " -f 1,7 --output-delimiter=$'\t' - | sed 's/[>"gene_symbol:"]//g' > txp2gene.tsv

Step 2. Index the antibody sequences

In order to quantify the antibodies and the sample multiplexing based cell-hasing, we need to index the feature barcodes. We start by downloading the barcode sequences of the antibody derived tags (ADT) and the hash antibody tag oligos (HTO).

wget --content-disposition  -nv https://ftp.ncbi.nlm.nih.gov/geo/series/GSE128nnn/GSE128639/suppl/GSE128639_MNC_ADT_Barcodes.csv.gz &&
wget --content-disposition  -nv https://ftp.ncbi.nlm.nih.gov/geo/series/GSE128nnn/GSE128639/suppl/GSE128639_MNC_HTO_Barcodes.csv.gz &&

gunzip -c GSE128639_MNC_ADT_Barcodes.csv.gz | awk -F "," '{print $1"\t"$4}' | tail -n +2 > adt.tsv &&
gunzip -c GSE128639_MNC_HTO_Barcodes.csv.gz | awk -F "," '{print $1"\t"$4}' | sed 's/Hashtag /Hashtag_/g' | tail -n +2 > hto.tsv

In the latest release of salmon (>= 1.2.0), we add the --features command line flag to the index sub command of salmon which performs the indexing based on a tab separated file. Specifically, the flag expects an id of the reference sequence tab separated by the nucleotide sequence, one reference per line. We have already generated the two reference files, one for ADT and another for HTO, in the above command. Next, we index the features.

# We assume Salmon binary is available to the environment
# module load Salmon/1.2.1 && 
salmon index -t adt.tsv -i adt_index --features -k7 && 
salmon index -t hto.tsv -i hto_index --features -k7

Step 3. Download the raw RNA & antibody sequencing data

We download the fastq files for a CITE-seq based experiment on the Bone Marrow Mononuclear Cells. CITE-seq generates the raw FASTQ files separately for each RNA, ADT and HTO experiments.

# RNA experiment
wget --content-disposition  -nv  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/003/SRR8758323/SRR8758323_2.fastq.gz && 
wget --content-disposition  -nv  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/003/SRR8758323/SRR8758323_1.fastq.gz &&

# ADT experiment
wget --content-disposition  -nv  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/005/SRR8758325/SRR8758325_1.fastq.gz &&
wget --content-disposition  -nv  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/005/SRR8758325/SRR8758325_2.fastq.gz &&

# HTO experiment
wget --content-disposition  -nv  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/007/SRR8758327/SRR8758327_1.fastq.gz &&
wget --content-disposition  -nv  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/007/SRR8758327/SRR8758327_2.fastq.gz

Step 4. Quantify with alevin

We first run alevin to quantify the gene abundances based on the relevant 3-prime sequencing technology (here it’s 10x V2), for more information on using alevin with different 3-prime sequencing technologies, check How to Run Alevin page.

salmon alevin -l ISR -i salmon_partial_sa_index \
-1 SRR8758323_1.fastq.gz -2 SRR8758323_2.fastq.gz \
-o alevin_rna -p 16 --tgMap txp2gene.tsv \
--chromium --dumpFeatures --expectCells 35000 

Next, we quantify the ADT library using the previously generated adt_index. This stage generally requires two new flags, featureStart (0 for CITE-seq and 10 for 10x based feature barcoding) – the start index of the feature barcode on the R2 file and featureLength (15 for both CITE-seq and 10x based feature barcoding) – the length of the feature barcode.

salmon alevin -l ISR -i adt_index \
-1 SRR8758325_1.fastq.gz -2 SRR8758325_2.fastq.gz \
-o alevin_adt -p 16 --citeseq --featureStart 0 \
--featureLength 15
*Note* a custom geometry can also be provded using the following command
--bc-geometry 1[1-16] --umi-geometry 1[17-26] --read-geometry 2[1-15] \ 
--tgMap adt_map.tsv --keepCBFraction 1.0 in place of 
--citeseq --featureStart 0 --featureLength 15

Lastly, we quantify the HTO library using the previously generated hto_index. NOTE: Generally, the number of features in a HTO experiment is significantly lower than the mRNA experiment and the ambiguity information in the equivalence-classes becomes trivially simple (mostly unique) to solve. Hence, we recommend using --naiveEqclass flag for the UMI deduplication of the HTO data.

salmon alevin -l ISR -i hto_index \
-1 SRR8758327_1.fastq.gz -2 SRR8758327_2.fastq.gz \
-o alevin_hto -p 16 --citeseq --featureStart 0 \
--featureLength 15 --naiveEqclass

Step 5. Loading data into R

We start by loading the required R packages.

suppressPackageStartupMessages({
    library(fishpond)
    library(tximport)
    library(devtools)
    library(ggplot2)
    library(Seurat)
})

We load the alevin generated count matrices using tximport package.

rna.files <- file.path("alevin_rna/alevin/quants_mat.gz")
adt.files <- file.path("alevin_adt/alevin/quants_mat.gz")
hto.files <- file.path("alevin_hto/alevin/quants_mat.gz")
file.exists(c(rna.files, adt.files, hto.files))
## [1] TRUE TRUE TRUE
# tximport loads the alevin data into R
rna.txi <- tximport(files = rna.files, type = "alevin")
## reading in alevin gene-level counts across cells with fishpond
adt.txi <- tximport(files = adt.files, type = "alevin")
## reading in alevin gene-level counts across cells with fishpond
hto.txi <- tximport(files = hto.files, type = "alevin")
## reading in alevin gene-level counts across cells with fishpond

We create the Seurat object using the RNA data and, add ADT / HTO data as a seprate assay to the object for the common cells.

common.cells <- intersect(colnames(rna.txi$counts), colnames(adt.txi$counts))
common.cells <- intersect(common.cells , colnames(hto.txi$counts))
length(common.cells)
## [1] 19172
object <- CreateSeuratObject(rna.txi$counts[, common.cells])
object[["ADT"]] <- CreateAssayObject(counts = adt.txi$counts[, common.cells])
object[["HTO"]] <- CreateAssayObject(counts = hto.txi$counts[, common.cells])

Step 6. Visualization

Processing Sample Hashing data (HTOs) to demultiplex samples

We start by first performing some basic quality check and filtering of the noisy barcodes on the HTO data. Later, we perform centered log-ratio (CLR) normalization for the HTO data.

# remove HTO protein aggregation cells
VlnPlot(object, features = c("nCount_HTO"), log = TRUE )

object <- subset(object, subset = nCount_HTO < 2e4)

# HTO Normalization
DefaultAssay(object) <- "HTO"
object <- NormalizeData(object,   normalization.method = "CLR", margin = 2, verbose = F)
VariableFeatures(object) <- rownames(object[["HTO"]]@counts)
object <- ScaleData(object, assay = "HTO", verbose = F)

We use the function HTODemux from Seurat to demultiplex cellular barcodes based on the HTO enrichment and assign single cells back to their samples of origin

object <- HTODemux(object, assay = "HTO", positive.quantile = 0.99, verbose = F)

Idents(object) <- "HTO_classification.global"
VlnPlot(object, features = "nCount_HTO", pt.size = 0.1, log = TRUE)

VlnPlot(object, features = "nCount_RNA", pt.size = 0.1, log = TRUE)

VlnPlot(object, features = "nCount_ADT", pt.size = 0.1, log = TRUE)