My Profile Photo

alevin-fry-tutorials


Tutorials for using the alevin-fry single-cell RNA-seq pipeline


Developing a custom simpleaf workflow from scratch

Simpleaf is a command-line toolkit written in Rust that exposes a unified and simplified interface for processing scRNA-seq datasets using the alevin-fry ecosystem of tools. Since Simpleaf version 0.15.0, we introduce the re-designed simpleaf workflow sub-program, which provides the ability to execute complex and highly-configurable single-cell data processing workflows consisting of Simpleaf commands and shell commands described by a simple user-provided Jsonnet program. One can fetch ready-made Simpleaf workflow templates from our protocol library, protocol estuary, using the simpleaf workflow get program, or develop custom workflows to achieve specific tasks. This tutorial will discuss how to build a valid Simpleaf workflow template from scratch. If you are interested in running an existing workflow, please check our tutorial about running Simpleaf workflows.

Here, we will assume that Simpleaf is available in our operating environment. If you still need to, you can follow this tutorial to install and set up Simpleaf. Besides, we might also need a Jsonnet executable to debug in real-time in practice. You can install it by following these instructions or use Jrsonnet, a Rust implementation of Jsonnet.

Preparation

Before we start, let’s set up the environment.

export AF_SAMPLE_DIR=$PWD/simpleaf_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR

Basics

First of all, let’s talk about some basic terminology in Simpleaf Workflow:

There are three phases of a Simpleaf Workflow:

  1. Workflow Template: A Jsonnet program that has the potential to generate a valid Simpleaf Workflow but with required fields missing.
  2. Instantiated Workflow Template or just Instantiate Template: A Workflow Template that contains enough information to generate a valid Workflow Manifest.
  3. Workflow Manifest: A JSON record that contains the definition of each command in the workflow and the information needed to invoke them.

We call a Jsonnet program that has the potential to generate a valid Simpleaf Workflow but with required fields missing a Workflow Template. All published workflows in protocol estuary are in this phase. Users must fill in enough information to make the template an Instantiated Template. An Instantiated Template is a Jsonnet program that contains enough information to generate a valid Workflow Manifest. In a valid Workflow Manifest, all command records must be parameterized completely and correctly.

Now, let’s talk about how Simpleaf Workflow works:

  1. The program simpleaf workflow run takes an Instantiated Template as the input and generates a Workflow Manifest from it. Any Workflow Manifest is also a valid Instantiated Template of itself and can be used as the input of simpleaf workflow run.
  2. Simpleaf requires the command records in the resulting Workflow Manifest, including Simpleaf commands and external shell commands, to follow a specific format because it always parses the Workflow Manifest evaluated from the provided Workflow Template or the Workflow Manifest itself to find and assemble the actual commands included in a workflow.
  3. When parsing an Instantiated Template, Simpleaf automatically passes many external variables and library searching paths to the Jsonnet engine. One important variable is the Simpleaf Workflow utility library, which contains many useful functions for developing a Simpleaf Workflow template. We should take advantage of these variables when developing our own Workflow Templates.

Developing a Workflow Template

This section will discuss developing a Simpleaf Workflow template using no, essential, and advanced features.

1. Define a Basic Workflow Template

As we discussed, Simpleaf Workflow requires that in the Workflow Manifest, the command records, including Simpleaf commands and external shell commands, must follow a specific format. The format consists of two parts: identity fields and argument field(s). The identity fields of these two types of commands are the same, but the argument fields are not.

There are three identity fields.

  • program_name: This field has to record a valid program name in the user’s operating environment.

    • For Simpleaf programs, this will be the name we used to call the Simpleaf programs.
      • For simpleaf index commands, this field must be program_name: "simpleaf index".
      • For simpleaf quant commands, this field must be program_name: "simpleaf quant".
    • For external shell commands, this field must represent a valid executable. For example, for a record representing an awk command:
      • if the awk program we want to call is not in our PATH variable, this field must be the (quoted) path to the desired awk executable, for example program_name: "/usr/bin/awk".
      • if the awk is in our PATH, this can be the path to the executable, or just program_name: "awk", because in this case, awk is accessible in both ways.
  • step: This field records which step this command constitutes in the workflow. This is the only allowed integer in the Simpleaf Workflow Manifest. Simpleaf will invoke the commands by order of their step.

  • active: This field indicates if the command is active in the workflow. Simpleaf will regard all commands without this field as active commands. This is the only allowed boolean field in the Workflow Manifest. Simpleaf will skip (neither parse nor invoke) all inactive commands (with "active": false).

As for the argument field(s), they are in a different format in Simpleaf command records and external command records:

  • In a Simpleaf command record, each argument, value pair must be provided as a field. The field name represents the argument name, and the value represents the argument value (or true or an empty string "" if the argument doesn’t take a value). For example, {"--threads": 16, "--use-piscem": true, "--ref-type": "splici"}. We recommend using the full name of arguments. For example, we suggest using --threads instead of -t.

  • In an external command record, all arguments must be listed in an ordered array as the arguments field. For example, the argument of the command ls -l -h . has to be provided as {"arguments": ["-l", "-h", "."]}.

With the format in mind, we can now create a simple workflow template! Here we will define a workflow that first creates a toy reference sequence file and calls simpleaf index to build a salmon index on this reference. The actual workflow template, instantiated template, and the resulting workflow manifest are identical and are shown in the following code chunk.

In the following code chunk, we create a workflow template Jsonnet program file and call simpleaf workflow run to evaluate it.

cd $AF_SAMPLE_DIR

# write the _workflow manifest_ into a file
cat <<EOF > workflow_template.jsonnet
{
    "create_ref" : {
        "program_name": "echo",
        "step": 1,
        "arguments": ["\">a\\nAACCAACACAAAC\\n>b\\nCCACAAACAACACAAC\"", ">", "./workflow_output/toy_ref.fastq"],
    },
    "simpleaf_index": {
        "program_name": "simpleaf index",
        step: 2,
        "--ref-seq": "toy_ref.fa",
        "--output": "./workflow_output",
        "--kmer-length": "5"
    },
}
EOF

simpleaf workflow run --template workflow_template.jsonnet -o ./workflow_output

The above workflow template (or instantiated template, or workflow manifest) defines a workflow that consists of two steps:

  1. We create a FASTA file with two sequence records using the echo command. The first record is a, and its sequence is AACCAACACAAAC, and the second record is b, and its sequence is CCACAAACAACACAAC. We use the > symbol to redirect the output of the echo command to a file named toy_ref.fa.

  2. We call simpleaf index to build a Salmon index on the toy_ref.fa file. We set the kmer length as 5 and the output directory as ./workflow_output.

# print toy reference
$cat $AF_SAMPLE_DIR/toy_ref.fa
>a
AACCAACACAAAC
>b
CCACAAACAACACAAC

# simpleaf index output directory
$ls $AF_SAMPLE_DIR/workflow_output
index  index_info.json  simpleaf_index_log.json  toy_ref.fastq

Although any Jsonnet program is a valid input for Simpleaf Workflow as long as it can generate a valid workflow manifest, we recommend dividing a workflow template into four sections:

  • meta_info: This section contains the meta information of the workflow, such as the name of the workflow, the version of the workflow, etc. It also contains the meta-variables, such as threads, output, and use_piscem, which are used by the utility library to assign values to some arguments in the workflow.
  • fast_config: This section contains the minimum information needed to generate a valid workflow manifest in the standard and recommended way. We call this section fast_config because it is the fastest way to generate a valid workflow manifest. However, the resulting workflow manifest might not be the most appropriate one for the users’ purpose.
  • advanced_config: This section contains the information for generating a valid workflow manifest in a more advanced and comprehensive way. Usually, this section contains the alternative configurations defined in the fast_config and allows the users to fine-tune the workflow.
  • workflow: This section contains the logics for generating the workflow manifest using the information provided in fast_config and advanced_config. Usually, it includes function calls to the utility library functions to combine the above three sections or some external command records for preprocessing the data.

3. A Workflow Template with More Jsonnet Features

Following the best practices, we can improve the above workflow template from multiple aspects:

  1. For the meta-variables that will be used throughout the workflow, such as threads, output, and use_piscem, we can define them in the meta_info section and use the utility library to assign them to the appropriate arguments in the workflow.
  2. To ease the users’ burden, we can include the most essential arguments, such as the path to the reference FASTA file, in the fast_config section and put it at the beginning of the workflow template. By doing this, the users can generate a valid workflow manifest by only completing this section.
  3. For some optional but important arguments, such as the --ker-length argument, we can put them in the advanced_config section and assign them a default value.
  4. Finally, we can assemble the above three sections in the workflow section to generate the command records of the workflow.

Here, we assume you have some basic knowledge about Jsonnet. Uncertain? Check their excellent tutorial! Let’s see how the redesigned workflow template looks like. Notice that the following example represents an uninstantiated workflow because the required field, --ref-seq is missing (with null). To instantiate the template, we need to fill in the output field in meta_info, and replace the null with the path to the toy_ref.fa file generated from the previous step, or any other valid FASTA file. Notice that Jsonnet doesn’t require quotes around field names, as long as the field names are valid identifiers.

# local variable definition starts with the `local` keyword 
local template = {
    // meta_info
    meta_info : {
        template_name: "example workflow",
        output : null,
    },
    fast_config : {
        ref_seq : null,
    }
    advanced_config : {
        arguments : {
            "--kmer–length" : 5,
        }
    },
    workflow : {
        simpleaf_index : {
            program_name: "simpleaf index",
            step : 2,
            "--output" : meta_info.output + "/simpleaf_index", # the + operator merges the two strings # the . operator accesses the field of an object
            "--ref-seq" : std.get(fast_config, "ref_seq"), # std.get() is a function call
        } + arguments # the + operator merges the two objects
        ,
        create_ref : {
            program_name : "echo",
            step : 1,
            arguments : ["\">a\\nAACCAACACAAAC\\n>b\\nCCACAAACAACACAAC\"", ">", meta_info.output + "/toy_ref.fastq"],
        },
    }
}; # local variable definition ends with a semicolon

# We output the template object
template 

In this example:

  • We define a local variable template to store the workflow template. The template variable is a Jsonnet object that contains four fields: meta_info, fast_config, advanced_config, and workflow. Each of these fields is also a Jsonnet object.

  • We specify meta-information in the meta_info field, such as the name of the workflow and the output directory. Notice that we use the null value to represent the missing value of the output field. We will fill in this field when instantiating the template.

  • We specify essential parameters in the fast_config field. In this example, we only need to provide the path to the reference FASTA file (other than meta_info.output) to generate a valid workflow manifest.

  • To make the workflow more flexible, we put some optional but important arguments in the advanced_config field. In this example, we assign a default value to the --kmer-length argument.

  • In the workflow section, we assemble the above three sections to generate the command records of the workflow. We used the following features from Jsonnet:

    • We used the dot operator, ., to access the fields of an object like in Python. For example, meta_info.output accesses the output field in the meta_info object. This can also be achieved by calling the get() function from the Jsonnet standard library, std.get(meta_info, "output"). For details about the Jsonnet std library, please check this page.

    • We used the + operator to combine the base object defined on the LHS and the arguments field in advanced_config. This operator can also be used to concatenate strings and calculate the sum of two numbers.

There are many more useful features in Jsonnet. To have a systematic understanding of Jsonnet, please check the tutorials provided by Jsonnet.

4. A Workflow Template Using the Simpleaf Workflow Utility Library

To ease the development of a workflow template, the Simpleaf team provides a utility library that contains many useful functions for building Simpleaf’s commands in the manifest at different resolutions. This library is passed to the template evaluation process as an external variable __utils, and can be loaded in any template by adding local utils = std.extVar("__utils"); at the beginning of the template. Then, we can access any function in the library by calling utils.function_name().

In this section, we will show an example of calling some functions from utils to build a simpleaf index command smoothly. Because this example is a little bit long, we will divide it into multiple parts, each representing a section discussed in the previous section.

Firstly, we will define the layout of the template file. We first load the utils library, and then we define a local variable template to store the workflow template. The template variable is a Jsonnet object that contains five fields: meta_info, fast_config, advanced_config, intermediate_steps, and workflow. Each of these fields is also a Jsonnet object. Here we use a section called intermediate_steps to take some intermediate configurations that will be used in the workflow section. We will show the usage of this section later.

local utils = std.extVar("__utils"); # import the utility library

# The actual definition of each section is omitted here
local template = {
    meta_info : {},
    fast_config : {},
    advanced_config : {},
    intermediate_steps : {}
    workflow : {},
}

Next, we will define the meta_info section. In this section, we will define the meta information of the workflow, such as the name of the workflow, the version of the workflow, etc. It also contains the meta-variables, such as threads, output, and use_piscem, which are used in later sections. The ... in the following code chunk means we omit the lines before and after the meta_info section.

meta_info : {
    template_name : "example workflow",
    output : null,
    threads : 16,
    use_piscem : false,
}

We then design the fast_config section. In this section, we will ask for a genome FASTA file and a gene annotation GTF file, which is the minimum input of making a spliced+intronic (splici) reference, which contains spliced transcripts’ sequence and intronic sequences of each gene. Details about splici can be found in the supplementary section S2 of the alevin-fry paper. We will also list a rlen field taking the numeric read length value of the sequencing data as the value, which is an optional argument for building the splici reference.

fast_config : {
    splici : {
        fasta : null,
        gtf : null,
        rlen : 91,
    }
}

Now we turn to the advanced_config section, we will list other reference types supported by simpleaf index, and all optional arguments and flags used for fine-tuning the behavior of simpleaf index. Specifically, in addition to splici, we will also support spliced+unspliced (spliceu) reference, which contains the spliced transcripts and the gene body (unspliced transcript) of each gene, direct_ref, which will build the index directly from the provided (usually transcriptome) FASTA file , and existing_index reference type, which tells the template to use an existing index instead of building a new one by calling simpleaf index .

advanced_config : {
    simpleaf_index : {
        ref_type : {
            type : "splici", # splici arguments are in the `fast_config` section.
            spliceu : {
                gtf : null, 
                fasta : null,
            },
            direct_ref : {
                ref_seq : null, 
            },
            existing_index : {
                index : null,
                t2g_map : null,
            },
        },
        arguments : {	
            active : true,
            "--spliced" : null,
            "--unspliced" : null,
            "--dedup" : false,
            "--sparse" : false,
            "--keep-duplicates" : false,
            "--gff3-fomrat" : false,
            "--threads" : $.meta_info.threads,
            "--use-pisem" : $.meta_info.use_piscem, 
            "--overwrite" : $.meta_info.use_piscem,
            "--kmer-length" :  31,
            "--minimizer-length" : 7,
            "--decoy-paths" : null,
        },
        output : $.meta_info.output + "/simpleaf_index",
    }
}

Finally, we will define the workflow section containing a simpleaf command record called simpleaf_index.

workflow : {
    simpleaf_index : utils.simpleaf_index(
        1, 
        utils.ref_type($.advanced_config.ref_type + $.fast_config), 
        $.advanced_config.arguments, 
        $.advanced_config.output,
    ),
}

In this section, we utilized two utils functions, utils.ref_type and utils.simpleaf_index. The utils.ref_type function takes a reference type object, like the one defined in the advanced_config section and returns a valid reference type object. This object must has a type field specifying which reference type will be used, and a field named by the value of the type field including the required arguments for building the reference. Here we merged advanced_config with fast_config because it contains the arguments used for building a splici reference. If we specify splici : {gtf: "genes.gtf", fasta: "genome.fa", rlen: 91}, after merging, the input object to utils.ref_type looks like the following:

{
    type : "splici", 
    splici : {
        fasta : "genome.fa",
        gtf : "genes.gtf",
        rlen : 91,
    },
    spliceu : {
        gtf : null, 
        fasta : null,
    },
    direct_ref : {
        ref_seq : null, 
    },
    existing_index : {
        index : null,
        t2g_map : null,
    },
},

The output of the utils.ref_type function will be passed to the utils.simpleaf_index function as the second argument. It looks like the following:

{
    type :: "splici",
    arguments :: {gtf: "genes.gtf", fasta: "genome.fa", rlen: 91},
    "--ref-type" : "splici",
    "--fasta" : "genome.fa",
    "--gtf" : "genes.gtf",
    "--rlen" : 91,
}

Notice that the type and arguments fields are defined using double colons. This is the syntax in Jsonnet for defining hidden fields. Those hidden fields will not be included in the output manifest, but are accessiable during the manifesting process.

Finally, the utils.simpleaf_index function takes 1. a step number, 2. a reference type object returned by utils.ref_type(), 3. optional arguments, and 4. the output path, and returns a valid simpleaf index command record, or an empty record if we use existing_index as the reference type. If we instantiate this template using meat_info.output : "./workflow_output" and the splici parameters showed above , our final manfest will looks like the following:

{
    "meta_info" : {
        "template_name" : "example workflow",
        "output" : "./workflow_output",
        "threads" : 16,
        "use_piscem" : false,
    },
    "workflow" : {
        "simpleaf_index" : {
            "program_name" : "simpleaf_index",
            "step" : 1,
            "active" : true,
            "--ref-type" : "splici",
            "--fasta" : "genome.fa",
            "--gtf" : "genes.gtf",
            "--rlen" : 91,
            "--dedup" : false,
            "--sparse" : false,
            "--keep-duplicates" : false,
            "--gff3-fomrat" : false,
            "--threads" : 16,
            "--use-pisem" : false, 
            "--overwrite" : false,
            "--kmer-length" :  31,
            "--minimizer-length" : 7
        }
    }
}

Notice that all flags with false will be ignored by Simpleaf when invoking the commands.

4. Utilizing Built-in Variables and Custom Library Search Paths in Custom Templates

When parsing a workflow template, Simpleaf automatically provides useful external variables to it. We can easily utilize these variables in our templates by receiving them using the std.extVar() function from the Jsonnet std library. Currently, Simpleaf provides the following variables and library searching paths when parsing a template, and for variables, we will show the code to receive them in our template directly:

  • std.extVar("__utils"): the __utils variable represents the Simpleaf workflow utility library, which contains useful functions for developing a Simpleaf workflow template. Usually, a template will take this variable at the beginning by adding local utils = std.extVar("__utils");.

  • std.extVar("__output"): The __output variable represents the --output argument provided from the command line. To use this variable, we need to add local output = std.extVar("__output"); at the beginning of the template.

  • std.extVar("__validate"): The __validate variable is set to false in simpleaf workflow get and true in other programs, like simpleaf workflow run. Usaually, this variable is used to turn on/off the validation of the template, for example, missing values. To use this variable, we need to add local validate = std.extVar("__validate"); at the beginning of the template.

Build a Workflow Template for Processing 10X Chromium 3’ v3 Data

In the above example, we discussed how to build a Simpleaf workflow template containing a simpleaf index command. In this section, we will show how to add a simpleaf quant command to the workflow template to process 10X Chromium 3’ v3 data. As we apply the same idea as in the previous section, we will briefly discuss the procedure of building a simpleaf quant command record using the utility library, and then show the final workflow template.

Overall, simpleaf quant requires at least five file/directory paths:

  1. The path to the index directory.
  2. The path to the transcript-to-gene mapping file.
  3. The path to the output directory.
  4. The comma-separated paths to the Reads1 FASTQ files.
  5. The comma-separated paths to the Reads2 FASTQ files.

As we can predict the first two paths using the output of the simpleaf index command and set the output directory according to the meta_info.output variable, we only need to ask for the paths to the FASTQ files in the fast_config section.

1. Fast Configuration Section

In this section, we need to ask for read1 and read2 FASTQ files. Therefore, we add one field in addition to the splici field discussed above:

fast_config: {
    splici: {fasta: null, gtf: null, rlen: 91},
    map_reads: {
        reads1: null, # comma-separated paths to the Reads1 FASTQ files 
        reads2: null, # comma-separated paths to the Reads2 FASTQ files
    }
}

2. Advanced Configuration Section

In this section, we’ll cover two crucial argument groups for running simpleaf quant: mapping mode (map_type) and cell filtration strategy (cell_filt_type). map_type offers two choices: providing paths to Reads1 and Reads2 FASTQ files, along with the index and transcript-to-gene-mapping file for read mapping, or simply supplying the path to existing mapping results. We include existing_map in the map_type field, along with a type field indicating the chosen mode. Similarly, we define a cell_filt_type field with five possible cell filtration options.

advanced_config : {
    simpleaf_quant : {
        map_type : {
            type : "map_reads",
            existing_mappings : {
                map_dir : null,
                t2g_map : null,
            },
        },
        cell_filt_type : {
            type : "cell_filt",

            unfiltered_pl : true,
            knee : false,
            expect_cells : null,
            forced_cells : null,
            explicit_pl : null,
        },
        argumetns : {
            active : true,
            "--min-reads" : 10,
            "--resolution" :  "cr-like",
            "--expected-ori" :  "fw",
            "--threads" :  $.meta_info.threads,
            "--chemistry" :  "10xv3",
            "--use-selective-alignment" : false, 
            "--use-piscem" : $.meta_info.use_piscem,
            "--struct-constraints" : false,
            "--ignore-ambig-hits" : false,
            "--no-poison" : false,
            "--skipping-strategy" : null,
            "--max-ec-card" : null,
            "--max-hit-occ" : null,
            "--max-hit-occ-recover" : null, 
            "--max-read-occ" : null,
        },
        output : $.meta_info.output + "/simpleaf_quant",
    }
}

3. workflow section

Finally, with all the information in the above three sections, we can call the utils.simpleaf_quant function, together with the utils.map_type and utils.cell_filt_type functions to generate a valid simpleaf quant command record.

workflow : {
    simpleaf_quant : utils.simpleaf_quant(
        2, 
        utils.map_type($.advanced_config.simpleaf_quant.map_type + $.fast_config, $.workflow.simpleaf_index),
        utils.cell_filt_type($.advanced_config.simpleaf_quant.cell_filt_type),
        $.advanced_config.simpleaf_quant.arguments, 
        $.advanced_config.simpleaf_quant.output,
    ),
}

In this section, we utilized three utils functions, utils.map_type, utils.cell_filt_type, and utils.simpleaf_quant. The utils.map_type function takes a map type object, like the one defined in the advanced_config section and returns a valid map type object. This object must has a type field specifying which map type will be used, and a field named by the value of the type field including the required arguments for mapping. Here we merged advanced_config with fast_config because it contains the arguments used for mapping. If we specify map_reads : {reads1: "reads1.fastq.gz", reads2: "reads2.fastq.gz"}, after merging, the input object to utils.map_type looks like the following:

{
    type : "map_reads", 
    map_reads : {
        reads1 : "reads1.fastq.gz",
        reads2 : "reads2.fastq.gz",
    },
    existing_mappings : {
        map_dir : null,
        t2g_map : null,
    },
},

The output of the utils.map_type function will be passed to the utils.simpleaf_quant function as the second argument. It looks like the following:

{
    type :: "map_reads",
    arguments :: {reads1: "reads1.fastq.gz", reads2: "reads2.fastq.gz"},
    "--reads1" : "reads1.fastq.gz",
    "--reads2" : "reads2.fastq.gz",
    "--index" : "./workflow_output/simpleaf_index/index",
    "--t2g-map": ./workflow_output/simpleaf_index/index/t2g_3col.tsv",
}

Notice that the path to the index directory and the transcript-to-gene mapping file are included in the simpleaf_index command record, which is the second argument of utils.map_reads.

Then, we use map_type, together with the simpleaf index command record, to call the utils.cell_filt_type function to generate a cell_filt_type object containing the simpleaf quant argument/flag for the selected cell filtering method. We will not show the detail here as it is very similar to map_type showed above.

Finally, we call the utils.simpleaf_quant function to generate a simpleaf quant command record using the results from the function calls introduced above. After evaluation, the manifest will look like the following :

"simpleaf_quant": {
"program_name": "simpleaf quant",
"step": 2,
"active": true,
"--chemistry": "10xv3",
"--expected-ori": "fw",
"--index": "./workflow_output/simpleaf_index/index",
"--min-reads": 10,
"--output": "./workflow_output/simpleaf_quant",
"--reads1": "reads1.fastq.gz",
"--reads2": "reads2.fastq.gz",
"--resolution": "cr-like",
"--t2g-map": "./workflow_output/simpleaf_index/index/t2g_3col.tsv",
"--threads": 16,
"--unfiltered-pl": true,
"--use-piscem": false
}

Finally, if we assemble the above sections, including those for simpleaf index and simpleaf quant together, we will get the final workflow template for processing 10X Chromium 3’ v3 data as shown in the protocol estuary GitHub repository.

Summary

In conclusion, we can quickly build complex and highly-configurable simpleaf workflow templates to guide simpleaf to run complicated single-cell data processing workflows. We can take advantage of the cool features and the standard library in Jsonnet, as well as the simpleaf workflow utility library when building a simpleaf workflow template. If you have further questions, please do not hesitate to start a GitHub issue!