Introduction

The raw data for many single-cell and single-nucleus RNA-seq experiments is publicly-available. However, certain datasets are used again and again, to demonstrate data processing in tutorials, as benchmark datasets for novel methods (e.g. for clustering, dimensionality reduction, cell type identification, etc.). In particular, 10x Genomics hosts various publicly-available datasets generated using their technology and processed via their Cell Ranger software on their website for download.

We have created a Nextflow-based alevin-fry workflow that one can use to easily quantify single-cell RNA-sequencing data in a single workflow. The pipeline can be found here. To test out this initial pipeline, we have begun to reprocess the publicly-available datasets collected from the 10x website. We have focused the initial effort on standard single-cell and single-nucleus 3’ gene-expression data generated using the Chromium v2 and v3 chemistries, but hope to expand the pipeline to more complex protocols soon (e.g. feature barcoding experiments) and process those data as well. We note that these more complex protocols can already be processed with alevin-fry (see the alevin-fry tutorials), but these have just not yet been incorprated into the automated Nextflow-based workflow linked above.

Additionally, to make interacting with these data as simple as possible in standard R and Python environments, we have added functions to the roe and pyroe packages to easily and programatically download these data and make them available as a SingleCellExperiment or AnnData object respectively. See details about the R interface and Python interface below.

Processed datasets:

In this section, we list the datasets we have re-processed from the 10x website to date, as well as the link to the quantification result generated by alevin-fry. Note that the quantification results are stored as tar files that can be obtained from the corresponding Box link.

Dataset ID Dataset name link to re-processed data
1 500 Human PBMCs, 3’ LT v3.1, Chromium Controller Box link
2 500 Human PBMCs, 3’ LT v3.1, Chromium X Box link
3 1k PBMCs from a Healthy Donor (v3 chemistry) Box link
4 10k PBMCs from a Healthy Donor (v3 chemistry) Box link
5 10k Human PBMCs, 3’ v3.1, Chromium X Box link
6 10k Human PBMCs, 3’ v3.1, Chromium Controller Box link
7 10k Peripheral blood mononuclear cells (PBMCs) from a healthy donor, Single Indexed Box link
8 10k Peripheral blood mononuclear cells (PBMCs) from a healthy donor, Dual Indexed Box link
9 20k Human PBMCs, 3’ HT v3.1, Chromium X Box link
10 PBMCs from EDTA-Treated Blood Collection Tubes Isolated via SepMate-Ficoll Gradient (3’ v3.1 Chemistry) Box link
11 PBMCs from Heparin-Treated Blood Collection Tubes Isolated via SepMate-Ficoll Gradient (3’ v3.1 Chemistry) Box link
12 PBMCs from ACD-A Treated Blood Collection Tubes Isolated via SepMate-Ficoll Gradient (3’ v3.1 Chemistry) Box link
13 PBMCs from Citrate-Treated Blood Collection Tubes Isolated via SepMate-Ficoll Gradient (3’ v3.1 Chemistry) Box link
14 PBMCs from Citrate-Treated Cell Preparation Tubes (3’ v3.1 Chemistry) Box link
15 PBMCs from a Healthy Donor: Whole Transcriptome Analysis Box link
16 Whole Blood RBC Lysis for PBMCs and Neutrophils, Granulocytes, 3’ Box link
17 Peripheral blood mononuclear cells (PBMCs) from a healthy donor - Manual (channel 5) Box link
18 Peripheral blood mononuclear cells (PBMCs) from a healthy donor - Manual (channel 1) Box link
19 Peripheral blood mononuclear cells (PBMCs) from a healthy donor - Chromium Connect (channel 5) Box link
20 Peripheral blood mononuclear cells (PBMCs) from a healthy donor - Chromium Connect (channel 1) Box link
21 Hodgkin’s Lymphoma, Dissociated Tumor: Whole Transcriptome Analysis Box link
22 200 Sorted Cells from Human Glioblastoma Multiforme, 3’ LT v3.1 Box link
23 750 Sorted Cells from Human Invasive Ductal Carcinoma, 3’ LT v3.1 Box link
24 2k Sorted Cells from Human Glioblastoma Multiforme, 3’ v3.1 Box link
25 7.5k Sorted Cells from Human Invasive Ductal Carcinoma, 3’ v3.1 Box link
26 Human Glioblastoma Multiforme: 3’v3 Whole Transcriptome Analysis Box link
27 1k Brain Cells from an E18 Mouse (v3 chemistry) Box link
28 10k Brain Cells from an E18 Mouse (v3 chemistry) Box link
29 1k Heart Cells from an E18 mouse (v3 chemistry) Box link
30 10k Heart Cells from an E18 mouse (v3 chemistry) Box link
31 10k Mouse E18 Combined Cortex, Hippocampus and Subventricular Zone Cells, Single Indexed Box link
32 10k Mouse E18 Combined Cortex, Hippocampus and Subventricular Zone Cells, Dual Indexed Box link
33 1k PBMCs from a Healthy Donor (v2 chemistry) Box link
34 1k Brain Cells from an E18 Mouse (v2 chemistry) Box link
35 1k Heart Cells from an E18 mouse (v2 chemistry) Box link

R and Python interface

R interface

To ease the process of downloding the quantificaiton result of these processed datasets easily and automatically from within a programatic environment, we provide an interface in the roe R package exposing several useful methods. To install roe, please follow this.

  • print_available_datasets() prints out the id and name of the available datasets.
  • get_available_dataset_df() returns the details of the available datasets as a dataframe.
  • fetch_processed_quant(dataset_ids) takes a vector of dataset ids as the required input, and fetches the quantification result of these datasets, according to their id, to a local directory. Other optional parameters can be found here.
  • load_processed_quant(dataset_ids) also takes a vector of dataset ids as the required input, and loads the quantification result of these datasets into R as SingleCellExperiment objects after fetching them. Other optional parameters can be found at here.

The return type of both fetch_processed_quant() and load_processed_quant() is a list of ProcessedQuant class instances defined in the roe package. This class stores the details of a processed dataset, including the 10x chemistry, reference, dataset name, the MD5sum of the fastq.tar file, and the link to the preprocessed quantification result. It also contains the path to the fetched and decompressed quantification result, and the SingleCellExperiment object of the quantification if obtained by running load_processed_quant(dataset_ids). Below we show an example:

library(roe)

# fetch and decompress the quantification result of dataset #1 and #3 and load them into R as SingleCellExperiment objects.
# it returns a list of ProcessedQuant class objects, one for a fetched
# dataset
pq_list = load_processed_quant(c(1,3))

# get the ProcessedQuant object for dataset #1 and #3
pq_ds1 = pq_list[["1"]]
pq_ds3 = pq_list[["3"]]

# get the name of datset #1 and #3
pq_ds1@dataset_name

# get the link to the site storing the quantification result
pq_ds1@quant_link

# get the path to the quantification result of datset #1 and #3
pq_ds1@quant_path

# get the SingleCellExperiment object
pq_ds1@sce

# Notice that in the return object of fetch_processed_quant() 
# the sce slot is empty.

If one would like to fine control over the paths for saving the fetched files and decompressed folders, one can refer to the definition of the ProcessedQuant class and its functions. One can fetch, decompress and load the quntification result of a single dataset by running ProcessedQuant() fetch_quant(), decompress_quant() and then load_quant() in turn, or FDL(), which integrates the four functions into a single function.

python interface

We provide analogous functions and class methods described above for fetching and loading the quantification result of datasets in the pyroe python package, which can be installed via pip using the command:

pip install pyroe

or via conda using the command:

conda install -c Bioconda pyroe

Thanks to the flexibility of python, we offers a CLI, pyroe fetch-quant, for fetching and then decompressing the quantification results of any number of available datasets. The only required input is the dataset id of some available datasets. The complete list of dataset name and id is included in the help message (run pyroe fetch-quant -h). One can provide multiple dataset ids, separated by space, to fetch the quantification result of multiple datasets at once, for example, to fetch and decompress the quantification result of dataset #1, #3 and #6, run the command:

pyroe fetch-quant 1 3 6

or start python, and run

import pyroe

# fetch, decompress and load the quantification result of dastset #1, 3 and 6
pq_dict = pyroe.load_processed_quant([1,3,6])

# get the ProcessedQuant class object for dataset #1 and #3
pq_ds1 = pq_dict[1]
pq_ds3 = pq_dict[3]

# get the dataset name
pq_ds1.dataset_name

# get the link to the site storing the quantification result
pq_ds1.quant_link


# get the path to the quantification result
pq_ds1.quant_path

# get the AnnData
pq_ds1.anndata