My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


Processing feature barcoding data with alevin-fry

In this tutorial we will look at how to process a CITE-seq experiment (a type of feature barcoding experiment) using an alevin-fry based pipeline. Note : This tutorial is meant to mimic the original tutorial for feature barcode analysis with alevin written by Avi Srivastava and Yuhan Hao. Thus, most of the descriptive text and commands are taken directly from that tutorial. However, here we will be analyzing the data using the alevin-fry pipeline instead of alevin.

Note: Johannes Köster has been kind enough to turn most of this tutorial into a Snakemake workflow that you can find here.

Introduction

CITE-seq (Stoeckius, Marlon, et al., 2017, Mimitou, Eleni P., et al., 2019) and 10x Genomics’ Feature Barcoding technology has enabled the sequencing of the cell surface protein and gene expression from the same cell. From the perspecitve of quantification, these single-cell technologies requires the generation of a count matrix for the features, which can be antibody barcodes for protein, hashing barcodes for sample multiplexing, guide-RNAs for CRISPR and genes for a RNA-based experiments. In this tutorial we generate the gene-count (RNA), adt-count (protein) and hto-count (sample hashing) matrices for each cell using alevin, and we harmonize & visualize it using Seurat (Stuart, Tim, & Butler, Andrew et al., 2019).

The sections below contain code chunks executed in different languages, indicated with different background colors. R code is represented by grey boxes and shell commands by blue boxes.

[NOTE] If the feature barcoding protocol use TotalSeq B or C, as mentioned by 10x Genomics, the cellular barcodes of RNA and the feature barcode library would not be exactly the same. To match the cellular barcodes across technologies the cellular barcodes needs to be transformed based on a mapping file. The code to perform the mapping is here and more information from 10x Genomic is at https://kb.10xgenomics.com/hc/en-us/articles/360031133451-Why-is-there-a-discrepancy-in-the-3M-february-2018-txt-barcode-whitelist-.

Index the reference sequences

In order to quantify the abundances with alevin-fry, we need to generate the index of the reference sequences. We already saw how to do this for the reference transcriptome in earlier tutorials, and here we will assume that you’ve built the splici transcriptome reference following the previous tutorial.

Here, then, we will focus on constructing the indices for the ADT and HTO features.

Index the antibody sequences

In order to quantify the antibodies and the sample multiplexing based cell-hasing, we need to index the feature barcodes. We start by downloading the barcode sequences of the antibody derived tags (ADT) and the hash antibody tag oligos (HTO).

wget --content-disposition -nv https://ftp.ncbi.nlm.nih.gov/geo/series/GSE128nnn/GSE128639/suppl/GSE128639_MNC_ADT_Barcodes.csv.gz  &&
wget --content-disposition  -nv https://ftp.ncbi.nlm.nih.gov/geo/series/GSE128nnn/GSE128639/suppl/GSE128639_MNC_HTO_Barcodes.csv.gz &&

gunzip -c GSE128639_MNC_ADT_Barcodes.csv.gz | awk -F "," '{print $1"\t"$4}' | tail -n +2 > adt.tsv &&
gunzip -c GSE128639_MNC_HTO_Barcodes.csv.gz | awk -F "," '{print $1"\t"$4}' | sed 's/Hashtag /Hashtag_/g' | tail -n +2 > hto.tsv

We’ll use the --features command line flag to the index sub command of salmon to perform the indexing based on a tab separated file. Specifically, the flag expects an id of the reference sequence tab separated by the nucleotide sequence, one reference per line. We have already generated the two reference files, one for ADT and another for HTO, in the above command. Now, we index the features.

# We assume Salmon binary is available to the environment
salmon index -t adt.tsv -i adt_index --features -k7 && 
salmon index -t hto.tsv -i hto_index --features -k7

Download the raw RNA & antibody sequencing data

Having indexed the reference transcriptome and feature sets, we next download the FASTQ files for a CITE-seq based experiment targeting Bone Marrow Mononuclear Cells. CITE-seq generates the raw FASTQ files separately for each of the RNA, ADT and HTO experiments.

# RNA experiment
mkdir -p data/rna
wget --content-disposition  -nv  -O data/rna/MNC-A_R1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/003/SRR8758323/SRR8758323_1.fastq.gz
wget --content-disposition  -nv  -O data/rna/MNC-A_R2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/003/SRR8758323/SRR8758323_2.fastq.gz

# ADT experiment
mkdir -p data/adt
wget --content-disposition  -nv  -O data/adt/MNC-A-ADT_R1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/005/SRR8758325/SRR8758325_1.fastq.gz
wget --content-disposition  -nv  -O data/adt/MNC-A-ADT_R2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/005/SRR8758325/SRR8758325_2.fastq.gz

# HTO experiment
mkdir -p data/hto
wget --content-disposition  -nv  -O data/hto/MNC-A-HTO_R1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/007/SRR8758327/SRR8758327_1.fastq.gz
wget --content-disposition  -nv  -O data/hto/MNC-A-HTO_R2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/007/SRR8758327/SRR8758327_2.fastq.gz

Map the reads with alevin

We first run alevin to map the reads from the different experiments; the RNA data, the ADT data and the HTO data.
Below, take note of the use of alevin’s flexible specification of read, barcode and UMI geometry to extract the appropriate regions of each read for mapping under each technology. We perform the mapping here in sketch mode. First we map the RNA-seq data:

salmon alevin -l ISR -i grch38_splici_idx -1 data/rna/*_R1.fastq.gz \
  -2 data/rna/*_R2.fastq.gz \
  --read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26] \
  -o rna_mapping -p 16 --sketch

Then the ADT data.

salmon alevin -l ISR -i adt_index -1 data/adt/*_R1.fastq.gz \
  -2 data/adt/*_R2.fastq.gz \
  --read-geometry 2[1-15] --bc-geometry 1[1-16] --umi-geometry 1[17-26] \
  -o adt_mapping -p 16 --sketch

Finally, the HTO data.

salmon alevin -l ISR -i hto_index -1 data/hto/*_R1.fastq.gz \
  -2 data/hto/*_R2.fastq.gz \
  --read-geometry 2[1-15] --bc-geometry 1[1-16] --umi-geometry 1[17-26] \
  -o hto_mapping -p 16 --sketch

Quantifying the features and RNA using alevin-fry

Now that we have generated the RAD files for all of the different data types, we’ll quantify each of them with alevin-fry.

We’ll start with the familiar RNA-seq, following the standard steps of generate-permit-list, collate and quant.

alevin-fry generate-permit-list -d fw -i rna_mapping -o rna_quant -k

alevin-fry collate -r rna_mapping -i rna_quant -t 16

alevin-fry quant -m refdata-gex-GRCh38-2020-A/t2g_3col.tsv -i rna_quant -o rna_quant/crlike -r cr-like -t 16 --use-mtx

where refdata-gex-GRCh38-2020-A is the path to the directory where you generated the splici referece transcriptome. Note : Since we are using the splici transcriptome, we make use here of the 3-column format transcript to gene tsv file (passed via the -m option to quant) to allow us to differentiate between UMIs that likely originate from spliced or unspliced molecules.

Now we can do the same thing for the ADT and HTO data, but first we’ll need to “spoof” the relevant transcript-to-gene mapping files for these data types. Here, there is no meaningful distinction between transcripts and genes; reads map to feature tags and each feature tag is independnet. Hence, the mapping will just map each feature name to itself.

awk '{print $1"\t"$1;}' adt.tsv > t2g_adt.tsv

awk '{print $1"\t"$1;}' hto.tsv > t2g_hto.tsv

Now, we quantify the ADT data.

alevin-fry generate-permit-list -d fw -i adt_mapping -o adt_quant -k

alevin-fry collate -r adt_mapping -i adt_quant -t 16

alevin-fry quant -m t2g_adt.tsv -i adt_quant -o adt_quant/crlike -r cr-like -t 16 --use-mtx

Finally, we quantify the HTO data.

alevin-fry generate-permit-list -d fw -i hto_mapping -o hto_quant -k

alevin-fry collate -r hto_mapping -i hto_quant -t 16

alevin-fry quant -m t2g_hto.tsv -i hto_quant -o hto_quant/crlike -r cr-like -t 16 --use-mtx

At this point, we have mapped and quantified all of the data. Each of the relevant output directories rna_quant/crlike, adt_quant/crlike and hto_quant/crlike have the corresponding count matrices for the different experiments. Next, we’ll load the data into R to perform a basic analysis and visualization.

Loading data into R

First, we load the packages we need, and set the random seed.

suppressPackageStartupMessages({
    library(devtools)
    library(ggplot2)
    library(SingleCellExperiment)
    library(Seurat)
})

# set the seed 
set.seed(271828)

This is the function we’ll use to load our quantification matrices into R.

#' Read alevin-fry quantifications into a SingleCellExperiment object
load_fry <- function(frydir, which_counts = c('S', 'A'), verbose = FALSE) {
  suppressPackageStartupMessages({
    library(rjson)
    library(Matrix)
    library(SingleCellExperiment)
  })
  
  # read in metadata
  qfile <- file.path(frydir, "quant.json")
  if (!file.exists(qfile)) {
    qfile <- file.path(frydir, "meta_info.json")
  }
  meta_info <- fromJSON(file = qfile)
  ng <- meta_info$num_genes
  usa_mode <- meta_info$usa_mode
  
  if (usa_mode) {
    if (length(which_counts) == 0) {
      stop("Please at least provide one status in 'U' 'S' 'A' ")
    }
    if (verbose) {
      message("processing input in USA mode, will return ", paste(which_counts, collapse = '+'))
    }
  } else if (verbose) {
    message("processing input in standard mode, will return spliced count")
  }

  # read in count matrix
  af_raw <- readMM(file = file.path(frydir, "alevin", "quants_mat.mtx"))
  # if usa mode, each gene gets 3 rows, so the actual number of genes is ng/3
  if (usa_mode) {
    if (ng %% 3 != 0) {
      stop("The number of quantified targets is not a multiple of 3")
    }
    ng <- as.integer(ng/3)
  }
  
  # read in gene name file and cell barcode file
  afg <- read.csv(file.path(frydir, "alevin", "quants_mat_cols.txt"), 
                  strip.white = TRUE, header = FALSE, nrows = ng, 
                  col.names = c("gene_ids"), row.names = 1)
  afc <- read.csv(file.path(frydir, "alevin", "quants_mat_rows.txt"), 
                  strip.white = TRUE, header = FALSE,
                  col.names = c("barcodes"), row.names = 1)

  # if in usa_mode, sum up counts in different status according to which_counts
  if (usa_mode) {
    rd <- list("S" = seq(1, ng), "U" =  seq(ng + 1, 2 * ng),
               "A" =  seq(2 * ng + 1, 3 * ng))
    o <- af_raw[, rd[[which_counts[1]]], drop = FALSE]
    for (wc in which_counts[-1]) {
      o <- o + af_raw[, rd[[wc]], drop = FALSE]
    }
  } else {
    o <- af_raw
  }
  
  # create SingleCellExperiment object
  sce <- SingleCellExperiment(list(counts = t(o)),
                              colData = afc,
                              rowData = afg
  )
  sce
}

Load the quantifications

We’ll load each of the experiments into its own SingleCellExperiemnt object. The RNA features will be loaded from the USA-mode matrix, and we’ll keep the S (confidently-spliced) and A (ambiguous) UMIs for each gene.

hto_q <- load_fry('hto_quant/crlike', verbose=TRUE)
## processing input in standard mode, will return spliced count
adt_q <- load_fry('adt_quant/crlike', verbose=TRUE)
## processing input in standard mode, will return spliced count
rna_q <- load_fry('rna_quant/crlike', verbose=TRUE)
## processing input in USA mode, will return S+A

Next, we’ll intersect the different barcodes obtained from the assays to find a common set of cells.

common.cells <- intersect(colnames(rna_q), colnames(adt_q))
common.cells <- intersect(common.cells , colnames(hto_q))
length(common.cells)
## [1] 19634

Because we’ve named each gene with its Entrez ID, but for later analysis we’ll want the gene names, we perform the mapping here (before we put the data into a Seurat object). The gene id to name mapping was obtained directly from our GTF file, but you can find the tsv file that results here.

gid_to_gname <- read.table('gid_to_gname.tsv')
rownames(rna_q) <- gid_to_gname$V2[match(rownames(rna_q), gid_to_gname$V1)]

Now that everything is prepared as we need, we create the Seurat object using the RNA data and, add ADT / HTO data as a separate assay to the object for the common cells.

object <- CreateSeuratObject(counts(rna_q)[, which(colnames(rna_q) %in% common.cells)])
## Warning: Non-unique features (rownames) present in the input matrix, making
## unique
## Warning: The following arguments are not used: row.names
object[["ADT"]] <- CreateAssayObject(counts = counts(adt_q)[, which(colnames(adt_q) %in% common.cells)])
object[["HTO"]] <- CreateAssayObject(counts = counts(hto_q)[, which(colnames(hto_q) %in% common.cells)])
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

Visualization

Processing Sample Hashing data (HTOs) to demultiplex samples

We start by first performing some basic quality checks and filtering of the noisy barcodes on the HTO data. Later, we perform centered log-ratio (CLR) normalization for the HTO data.

# remove HTO protein aggregation cells
VlnPlot(object, features = c("nCount_HTO"), log = TRUE )

object <- subset(object, subset = nCount_HTO < 2e4)

# HTO Normalization
DefaultAssay(object) <- "HTO"
object <- NormalizeData(object,   normalization.method = "CLR", margin = 2, verbose = F)
VariableFeatures(object) <- rownames(object[["HTO"]]@counts)
object <- ScaleData(object, assay = "HTO", verbose = F)

We use the function HTODemux from Seurat to demultiplex cellular barcodes based on the HTO enrichment and assign single cells back to their samples of origin

object <- HTODemux(object, assay = "HTO", positive.quantile = 0.99, verbose = F)

Idents(object) <- "HTO_classification.global"
VlnPlot(object, features = "nCount_HTO", pt.size = 0.1, log = TRUE)

VlnPlot(object, features = "nCount_RNA", pt.size = 0.1, log = TRUE)