My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


Processing 10x Flex Gene Expression data with simpleaf

Simpleaf is a command line toolkit written in rust that exposes a unified, simplified interface for processing single-cell and single-nucleus sequencing data using the alevin-fry pipeline. In this tutorial, you will learn how to use simpleaf multiplex-quant to process 10x Genomics Flex Gene Expression (Fixed RNA Profiling) data.

What is 10x Flex?

10x Chromium Fixed RNA Profiling (commonly called “Flex”) is a probe-based single-cell gene expression assay designed for FFPE (formalin-fixed, paraffin-embedded) and fixed frozen samples. Unlike standard 3’ or 5’ scRNA-seq, which sequences poly-A captured mRNA fragments against a full transcriptome reference, Flex uses a panel of short (~50bp) probes that hybridize directly to target transcripts.

A key feature of Flex is sample multiplexing: each sample in a multiplexed experiment is labeled with a unique probe barcode (8bp for Flex v1, 10bp for Flex v2), allowing multiple samples to be pooled in a single GEM well and sequenced together. Each read therefore carries two barcodes: a cell barcode (16bp, identifying the cell) and a sample/probe barcode (identifying which sample the cell belongs to). The simpleaf Flex pipeline handles demultiplexing, barcode correction, and quantification in a single command.

Installation

The recommended way to install simpleaf is via conda:

conda create -n simpleaf -y -c bioconda -c conda-forge simpleaf
conda activate simpleaf

This tutorial requires simpleaf version 0.22.0 or later, which includes the multiplex-quant command. (In earlier versions, this command was called flex-quant.)

Configuration

Simpleaf requires the ALEVIN_FRY_HOME environment variable to point to a directory where it stores configuration and cached resources.

# Create a working directory
AF_SAMPLE_DIR=$PWD/flex_tutorial
mkdir -p $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR

# Set the required environment variable
export ALEVIN_FRY_HOME="$AF_SAMPLE_DIR/af_home"

# Configure simpleaf (discovers piscem, alevin-fry, salmon in your PATH)
simpleaf set-paths

# Increase open file limit for piscem's intermediate files
ulimit -n 2048

Downloading the test data

For this tutorial, we use the 4-plex Human Colorectal Cancer and Normal Adjacent Colon dataset from 10x Genomics. This is a 4-plex Flex v1 experiment containing four samples:

Sample Tissue Prep Probe BC
BC001 Colorectal Manual ACTTTAGG
BC002 Colorectal Octo AACGGGAA
BC003 Kidney Manual AGTAGGCT
BC004 Kidney Octo ATGTTGAC

The compressed FASTQ files are approximately 25GB. Please ensure you have about 50GB of free disk space to complete this tutorial (FASTQs + index + output).

# Download the FASTQ files
FASTQ_DIR="$AF_SAMPLE_DIR/data/fastqs"
mkdir -p $FASTQ_DIR

wget -qO- https://cf.10xgenomics.com/samples/cell-exp/9.0.0/4plex_human_colorectal_kidney_scFFPE_multiplex/4plex_human_colorectal_kidney_scFFPE_multiplex_fastqs.tar \
  | tar xf - --strip-components=1 -C $FASTQ_DIR

Running simpleaf multiplex-quant

The simpleaf multiplex-quant command handles the entire pipeline in a single invocation: building a probe index, mapping reads, generating the permit list with barcode correction, collating, and quantifying.

First, gather the FASTQ file paths:

# Collect R1 and R2 files (comma-separated, sorted)
reads1_pat="_R1_"
reads2_pat="_R2_"

reads1="$(find -L ${FASTQ_DIR} -name "*${reads1_pat}*" -type f | sort | paste -sd, -)"
reads2="$(find -L ${FASTQ_DIR} -name "*${reads2_pat}*" -type f | sort | paste -sd, -)"

Now run the pipeline:

simpleaf multiplex-quant \
  --chemistry 10x-flexv1-gex-3p \
  --organism human \
  --reads1 $reads1 \
  --reads2 $reads2 \
  --output $AF_SAMPLE_DIR/flex_output \
  -t 8

Here is what each argument does:

  • --chemistry 10x-flexv1-gex-3p selects the Flex v1 protocol. This tells simpleaf the read geometry (1{b[16]u[12]x:}2{r[50]x[18]s[8]x:}), and enables automatic fetching of the cell barcode whitelist, sample barcode list (with rotation collapsing), and probe set.
  • --organism human selects the human probe set (Chromium_Human_Transcriptome_Probe_Set_v1.1.0_GRCh38-2024-A). Use mouse for the mouse probe set.
  • --reads1 / --reads2 are comma-separated lists of R1 and R2 FASTQ files.
  • --output is the output directory.
  • -t 8 uses 8 threads.

What happens under the hood

The multiplex-quant command executes the following steps:

  1. Probe set conversion: Downloads the probe set CSV from the 10x registry, converts it to a FASTA reference file and a transcript-to-gene (t2g) mapping.
  2. Index building: Builds a piscem index from the probe FASTA (k=23 by default).
  3. Read mapping: Maps reads with piscem using the Flex geometry, which extracts the cell barcode (R1:1-16), UMI (R1:17-28), biological sequence (R2:1-50), and sample barcode (R2:69-76).
  4. Permit list generation: Loads the 737K cell barcode whitelist, counts cell barcodes per sample (only whitelist matches), applies a minimum read threshold, and rescues non-matching barcodes via 1-edit-distance correction.
  5. Collation: Groups records by (sample, cell barcode) for efficient per-cell processing.
  6. Quantification: Resolves UMIs per gene using the cr-like algorithm, producing a combined count matrix for all samples.

On the 4-plex dataset with 8 threads, expect the pipeline to take approximately 5-10 minutes (depending on hardware), with mapping being the longest step.

Understanding the output

The output directory has the following structure:

flex_output/
  af_map/                  # Mapping output (RAD file)
  af_quant/
    alevin/
      quants_mat.mtx       # Count matrix (Matrix Market format)
      quants_mat_rows.txt  # Barcode labels (SAMPLE_cellbarcode)
      quants_mat_cols.txt  # Gene IDs
    featureDump.txt        # Per-cell QC metrics with sample_name column
    collation_manifest.bin # Sample-to-chunk mapping
    ...
  probe_conversion/        # Probe FASTA, t2g map, metadata
  simpleaf_multiplex_quant_info.json  # Pipeline metadata and timing

The count matrix contains all samples combined. Each barcode in quants_mat_rows.txt is prefixed with the sample name:

BC001_GGAGCGAAGCGAATCG
BC001_GCATGCTTCGGCTAGA
...
BC002_GGGAGTGAGGAAGGTA
...

The featureDump.txt file includes a sample_name column for easy per-sample filtering:

CB              sample_name  CorrectedReads  MappedReads  ...
GGAGCGAAGCGAATCG  BC001       551890          551890      ...

You can quickly check how many cells were found per sample:

awk -F'_' '{print $1}' flex_output/af_quant/alevin/quants_mat_rows.txt | sort | uniq -c | sort -rn

Using a pre-built index

If you have already built a probe index (or want to use a custom probe set or sample barcode list), you can skip the automatic resource fetching:

simpleaf multiplex-quant \
  --chemistry 10x-flexv1-gex-3p \
  --organism human \
  --index /path/to/probe_index/index \
  --probe-set /path/to/probe_set.csv \
  --sample-bc-list /path/to/sample_barcodes.txt \
  --reads1 $reads1 \
  --reads2 $reads2 \
  --output $AF_SAMPLE_DIR/flex_output_custom \
  -t 8

The --index flag points to the piscem index prefix (not the directory). The --probe-set provides a probe set CSV (which is also used to generate the t2g mapping). The --sample-bc-list is a 3-column TSV mapping observed probe barcodes to canonical barcodes and sample names (accounting for barcode rotation).

Splitting into per-sample matrices

The combined matrix is convenient for joint analysis, but many downstream workflows expect one matrix per sample. We provide a helper script that splits the combined output:

# Download the script
wget -qO split_flex_samples.py \
  https://combine-lab.github.io/alevin-fry-tutorials/scripts/split_flex_samples.py

# Split into per-sample directories
python split_flex_samples.py \
  --input flex_output/af_quant/alevin \
  --output per_sample_matrices

This creates one subdirectory per sample, each containing its own quants_mat.mtx, quants_mat_rows.txt (with plain cell barcodes, no sample prefix), and quants_mat_cols.txt.

A note on sample counts: Because alevin-fry processes the raw (unfiltered) permit list, the output includes all cell barcodes passing the minimum read threshold — not just those from the expected multiplexed samples. For this 4-plex dataset, you will see output directories for more than 4 sample barcode groups (up to 15 or so). This is expected: the Flex probe barcode panel contains 16 BC codes and 16 AB codes, and low levels of background or index-hopping signal can produce a small number of barcodes assigned to non-target sample groups.

The four true samples — BC001 (~165K cells), BC002 (~174K cells), BC003 (~112K cells), and BC004 (~154K cells) — dominate overwhelmingly. These correspond to the four samples for which 10x provides CellRanger output. The remaining sample groups typically contain only a handful of cells (often fewer than 100) and can be safely filtered out in downstream analysis. Providing the raw output allows users to apply their own filtering criteria and to inspect cross-sample contamination rates if desired.

Loading results in Python

Here is how to load the combined matrix in scanpy and annotate cells with their sample identity:

import scanpy as sc
import pandas as pd

# Read the combined count matrix
adata = sc.read_mtx("flex_output/af_quant/alevin/quants_mat.mtx").T

# Load gene and barcode names
genes = pd.read_csv("flex_output/af_quant/alevin/quants_mat_cols.txt",
                     header=None)[0].values
barcodes = pd.read_csv("flex_output/af_quant/alevin/quants_mat_rows.txt",
                        header=None)[0].values

adata.var_names = genes
adata.obs_names = barcodes

# Parse sample identity from barcode prefixes
adata.obs["sample"] = [bc.split("_")[0] for bc in adata.obs_names]
adata.obs["cell_barcode"] = [bc.split("_", 1)[1] for bc in adata.obs_names]

# Basic QC filtering
sc.pp.filter_cells(adata, min_counts=50)

print(adata)
print(adata.obs["sample"].value_counts())

If you prefer to work with individual sample matrices, use the split_flex_samples.py script from the previous section, then load each sample separately with sc.read_mtx().

Conclusion

In this tutorial, you learned how to process 10x Flex Gene Expression data using the simpleaf multiplex-quant command. The pipeline handles the complete workflow from FASTQs to a combined, sample-annotated count matrix in a single command, including probe barcode demultiplexing, cell barcode correction against the 10x whitelist, and UMI-level gene quantification.

For more information, see: