24 May 2018 • on alevin

Understanding Output

Output Format:

Traditionally single-cell tools dumps the Cell-v-Gene count matrix in various formats. Typical 10x experiment can range form hundreds to tens of thousand of cells – resulting in huge size of count-matrix. Although, this itself is an open area of research but by default alevin dumps a per-cell level gene-count matrix in a binary-compressed format with the row and column indexes in a separate file.

A typical run of alevin will generate 3 files:

quants_mat.gz – Compressed count matrix.
quants_mat_cols.txt – Column Header (Gene-ids) of the matrix.
quants_mat_rows.txt – Row Index (CB-ids) of the matrix.

Python Import

Alevin’s per-cell level count matrix can be imported directly into a python dataframe. The following python3 dependency is needed and can be installed using pip as follows:

pip3 install vpolo

Alternatively, vpolo can be installed using conda as follows:

conda install -c bioconda vpolo

If you want the python code to understand the schema of the binary output, it can be found here

Reading Binary format

Alevin’s quants_mat.gz file can be easily imported to generate Cell v Gene dataframe using the following piece of python3 code:

from vpolo.alevin import parser
alevin_df = parser.read_quants_bin("<PATH TO ALEVIN output folder>")

Pandas Compatibility:

Alevin can also dump the count-matrix in a human readable – matrix market exchange (MTX) format, if given flag --dumpMtx. The new output file quants_mat.mtx.gz can be easily read-in using the following python-pandas based function:

from scipy.io import mmread
import gzip
with gzip.open(mat_file) as f:
  alevin_df = mmread("<PATH TO ALEVIN quants_mat.mtx.gz file>").toarray()

The above function takes the path of the directory specified in –output directory ( while running the alevin-quantification tool ). Note that if alevin was finished successfully then the above specified directory will contain a directory inside of it with the name alevin . The function returns a python dataframe for the count matrix with Cellular-Barcodes as the index and Gene-id as the header which can be used for the downstream analysis.

Reading Alevin UMI graphs

When run with the command line flag --dumpUmiGraph alevin generates the per cell level Parsimonious Umi Graph (PUGs) into a compressed binary file cell_umi_graphs.gz. The file can be prased and generate a per cell level dot graph file using following python script:

from vpolo.alevin import parser
parser.read_umi_graph("<PATH to alevin output folder>", "<output folder>")

Reading Alevin’s bfh (big freaking hash) file

When run with the command line flag --dumpBfh alevin generates the big hash file used by alevin for performing the deduplication, along with the mapped equivalence classes. The file can be parsed by the following python script:

from vpolo.alevin import parser
parser.read_bfh("<PATH to alevin output folder>", "<PATH to t2g file>")