Simpleaf is a command-line toolkit written in Rust that exposes a unified and simplified interface for processing scRNA-seq datasets using the alevin-fry ecosystem of tools. Since Simpleaf version 0.15.0, we introduce the re-designed simpleaf workflow sub-program, which provides the ability to execute complex and highly-configurable single-cell data processing workflows consisting of Simpleaf commands and shell commands described by a simple user-provided Jsonnet program. One can fetch ready-made Simpleaf workflow templates from our protocol library, protocol estuary, using the simpleaf workflow get program, or develop custom workflows to achieve specific tasks. This tutorial will discuss how to build a valid Simpleaf workflow template from scratch. If you are interested in running an existing workflow, please check our tutorial about running Simpleaf workflows.
Here, we will assume that Simpleaf is available in our operating environment. If you still need to, you can follow this tutorial to install and set up Simpleaf. Besides, we might also need a Jsonnet executable to debug in real-time in practice. You can install it by following these instructions or use Jrsonnet, a Rust implementation of Jsonnet.
Preparation
Before we start, let’s set up the environment.
export AF_SAMPLE_DIR=$PWD/simpleaf_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR
Basics
First of all, let’s talk about some basic terminology in Simpleaf Workflow:
There are three phases of a Simpleaf Workflow:
- Workflow Template: A Jsonnet program that has the potential to generate a valid Simpleaf Workflow but with required fields missing.
- Instantiated Workflow Template or just Instantiate Template: A Workflow Template that contains enough information to generate a valid Workflow Manifest.
- Workflow Manifest: A JSON record that contains the definition of each command in the workflow and the information needed to invoke them.
We call a Jsonnet program that has the potential to generate a valid Simpleaf Workflow but with required fields missing a Workflow Template. All published workflows in protocol estuary are in this phase. Users must fill in enough information to make the template an Instantiated Template. An Instantiated Template is a Jsonnet program that contains enough information to generate a valid Workflow Manifest. In a valid Workflow Manifest, all command records must be parameterized completely and correctly.
Now, let’s talk about how Simpleaf Workflow works:
- The program
simpleaf workflow runtakes an Instantiated Template as the input and generates a Workflow Manifest from it. Any Workflow Manifest is also a valid Instantiated Template of itself and can be used as the input ofsimpleaf workflow run. - Simpleaf requires the command records in the resulting Workflow Manifest, including Simpleaf commands and external shell commands, to follow a specific format because it always parses the Workflow Manifest evaluated from the provided Workflow Template or the Workflow Manifest itself to find and assemble the actual commands included in a workflow.
- When parsing an Instantiated Template, Simpleaf automatically passes many external variables and library searching paths to the Jsonnet engine. One important variable is the Simpleaf Workflow utility library, which contains many useful functions for developing a Simpleaf Workflow template. We should take advantage of these variables when developing our own Workflow Templates.
Developing a Workflow Template
This section will discuss developing a Simpleaf Workflow template using no, essential, and advanced features.
1. Define a Basic Workflow Template
As we discussed, Simpleaf Workflow requires that in the Workflow Manifest, the command records, including Simpleaf commands and external shell commands, must follow a specific format. The format consists of two parts: identity fields and argument field(s). The identity fields of these two types of commands are the same, but the argument fields are not.
There are three identity fields.
-
program_name: This field has to record a valid program name in the user’s operating environment.- For Simpleaf programs, this will be the name we used to call the Simpleaf programs.
- For
simpleaf indexcommands, this field must beprogram_name: "simpleaf index". - For
simpleaf quantcommands, this field must beprogram_name: "simpleaf quant".
- For
- For external shell commands, this field must represent a valid executable. For example, for a record representing an
awkcommand:- if the
awkprogram we want to call is not in ourPATHvariable, this field must be the (quoted) path to the desiredawkexecutable, for exampleprogram_name: "/usr/bin/awk". - if the
awkis in ourPATH, this can be the path to the executable, or justprogram_name: "awk", because in this case,awkis accessible in both ways.
- if the
- For Simpleaf programs, this will be the name we used to call the Simpleaf programs.
-
step: This field records which step this command constitutes in the workflow. This is the only allowed integer in the Simpleaf Workflow Manifest. Simpleaf will invoke the commands by order of theirstep. -
active: This field indicates if the command is active in the workflow. Simpleaf will regard all commands without this field as active commands. This is the only allowed boolean field in the Workflow Manifest. Simpleaf will skip (neither parse nor invoke) all inactive commands (with"active": false).
As for the argument field(s), they are in a different format in Simpleaf command records and external command records:
-
In a Simpleaf command record, each argument, value pair must be provided as a field. The field name represents the argument name, and the value represents the argument value (or
trueor an empty string""if the argument doesn’t take a value). For example,{"--threads": 16, "--use-piscem": true, "--ref-type": "splici"}. We recommend using the full name of arguments. For example, we suggest using--threadsinstead of-t. -
In an external command record, all arguments must be listed in an ordered array as the
argumentsfield. For example, the argument of the commandls -l -h .has to be provided as{"arguments": ["-l", "-h", "."]}.
With the format in mind, we can now create a simple workflow template! Here we will define a workflow that first creates a toy reference sequence file and calls simpleaf index to build a salmon index on this reference. The actual workflow template, instantiated template, and the resulting workflow manifest are identical and are shown in the following code chunk.
In the following code chunk, we create a workflow template Jsonnet program file and call simpleaf workflow run to evaluate it.
cd $AF_SAMPLE_DIR
# write the _workflow manifest_ into a file
cat <<EOF > workflow_template.jsonnet
{
"create_ref" : {
"program_name": "echo",
"step": 1,
"arguments": ["\">a\\nAACCAACACAAAC\\n>b\\nCCACAAACAACACAAC\"", ">", "./workflow_output/toy_ref.fastq"],
},
"simpleaf_index": {
"program_name": "simpleaf index",
step: 2,
"--ref-seq": "toy_ref.fa",
"--output": "./workflow_output",
"--kmer-length": "5"
},
}
EOF
simpleaf workflow run --template workflow_template.jsonnet -o ./workflow_output
The above workflow template (or instantiated template, or workflow manifest) defines a workflow that consists of two steps:
-
We create a FASTA file with two sequence records using the
echocommand. The first record isa, and its sequence isAACCAACACAAAC, and the second record isb, and its sequence isCCACAAACAACACAAC. We use the>symbol to redirect the output of theechocommand to a file namedtoy_ref.fa. -
We call
simpleaf indexto build a Salmon index on thetoy_ref.fafile. We set the kmer length as 5 and the output directory as./workflow_output.
# print toy reference
$cat $AF_SAMPLE_DIR/toy_ref.fa
>a
AACCAACACAAAC
>b
CCACAAACAACACAAC
# simpleaf index output directory
$ls $AF_SAMPLE_DIR/workflow_output
index index_info.json simpleaf_index_log.json toy_ref.fastq
2. The Recommended Layout in a Simpleaf Workflow Template
Although any Jsonnet program is a valid input for Simpleaf Workflow as long as it can generate a valid workflow manifest, we recommend dividing a workflow template into four sections:
- meta_info: This section contains the meta information of the workflow, such as the name of the workflow, the version of the workflow, etc. It also contains the meta-variables, such as
threads,output, anduse_piscem, which are used by the utility library to assign values to some arguments in the workflow. - fast_config: This section contains the minimum information needed to generate a valid workflow manifest in the standard and recommended way. We call this section fast_config because it is the fastest way to generate a valid workflow manifest. However, the resulting workflow manifest might not be the most appropriate one for the users’ purpose.
- advanced_config: This section contains the information for generating a valid workflow manifest in a more advanced and comprehensive way. Usually, this section contains the alternative configurations defined in the
fast_configand allows the users to fine-tune the workflow. - workflow: This section contains the logics for generating the workflow manifest using the information provided in
fast_configandadvanced_config. Usually, it includes function calls to the utility library functions to combine the above three sections or some external command records for preprocessing the data.
3. A Workflow Template with More Jsonnet Features
Following the best practices, we can improve the above workflow template from multiple aspects:
- For the meta-variables that will be used throughout the workflow, such as
threads,output, anduse_piscem, we can define them in the meta_info section and use the utility library to assign them to the appropriate arguments in the workflow. - To ease the users’ burden, we can include the most essential arguments, such as the path to the reference FASTA file, in the fast_config section and put it at the beginning of the workflow template. By doing this, the users can generate a valid workflow manifest by only completing this section.
- For some optional but important arguments, such as the
--ker-lengthargument, we can put them in the advanced_config section and assign them a default value. - Finally, we can assemble the above three sections in the workflow section to generate the command records of the workflow.
Here, we assume you have some basic knowledge about Jsonnet. Uncertain? Check their excellent tutorial! Let’s see how the redesigned workflow template looks like. Notice that the following example represents an uninstantiated workflow because the required field, --ref-seq is missing (with null). To instantiate the template, we need to fill in the output field in meta_info, and replace the null with the path to the toy_ref.fa file generated from the previous step, or any other valid FASTA file. Notice that Jsonnet doesn’t require quotes around field names, as long as the field names are valid identifiers.
# local variable definition starts with the `local` keyword
local template = {
// meta_info
meta_info : {
template_name: "example workflow",
output : null,
},
fast_config : {
ref_seq : null,
}
advanced_config : {
arguments : {
"--kmer–length" : 5,
}
},
workflow : {
simpleaf_index : {
program_name: "simpleaf index",
step : 2,
"--output" : meta_info.output + "/simpleaf_index", # the + operator merges the two strings # the . operator accesses the field of an object
"--ref-seq" : std.get(fast_config, "ref_seq"), # std.get() is a function call
} + arguments # the + operator merges the two objects
,
create_ref : {
program_name : "echo",
step : 1,
arguments : ["\">a\\nAACCAACACAAAC\\n>b\\nCCACAAACAACACAAC\"", ">", meta_info.output + "/toy_ref.fastq"],
},
}
}; # local variable definition ends with a semicolon
# We output the template object
template
In this example:
-
We define a local variable
templateto store the workflow template. Thetemplatevariable is a Jsonnet object that contains four fields:meta_info,fast_config,advanced_config, andworkflow. Each of these fields is also a Jsonnet object. -
We specify meta-information in the
meta_infofield, such as the name of the workflow and the output directory. Notice that we use thenullvalue to represent the missing value of theoutputfield. We will fill in this field when instantiating the template. -
We specify essential parameters in the
fast_configfield. In this example, we only need to provide the path to the reference FASTA file (other thanmeta_info.output) to generate a valid workflow manifest. -
To make the workflow more flexible, we put some optional but important arguments in the
advanced_configfield. In this example, we assign a default value to the--kmer-lengthargument. -
In the workflow section, we assemble the above three sections to generate the command records of the workflow. We used the following features from Jsonnet:
-
We used the dot operator,
., to access the fields of an object like in Python. For example,meta_info.outputaccesses theoutputfield in themeta_infoobject. This can also be achieved by calling theget()function from the Jsonnet standard library,std.get(meta_info, "output"). For details about the Jsonnet std library, please check this page. -
We used the
+operator to combine the base object defined on the LHS and theargumentsfield inadvanced_config. This operator can also be used to concatenate strings and calculate the sum of two numbers.
-
There are many more useful features in Jsonnet. To have a systematic understanding of Jsonnet, please check the tutorials provided by Jsonnet.
4. A Workflow Template Using the Simpleaf Workflow Utility Library
To ease the development of a workflow template, the Simpleaf team provides a utility library that contains many useful functions for building Simpleaf’s commands in the manifest at different resolutions. This library is passed to the template evaluation process as an external variable __utils, and can be loaded in any template by adding local utils = std.extVar("__utils"); at the beginning of the template. Then, we can access any function in the library by calling utils.function_name().
In this section, we will show an example of calling some functions from utils to build a simpleaf index command smoothly. Because this example is a little bit long, we will divide it into multiple parts, each representing a section discussed in the previous section.
Firstly, we will define the layout of the template file. We first load the utils library, and then we define a local variable template to store the workflow template. The template variable is a Jsonnet object that contains five fields: meta_info, fast_config, advanced_config, intermediate_steps, and workflow. Each of these fields is also a Jsonnet object. Here we use a section called intermediate_steps to take some intermediate configurations that will be used in the workflow section. We will show the usage of this section later.
local utils = std.extVar("__utils"); # import the utility library
# The actual definition of each section is omitted here
local template = {
meta_info : {},
fast_config : {},
advanced_config : {},
intermediate_steps : {}
workflow : {},
}
Next, we will define the meta_info section. In this section, we will define the meta information of the workflow, such as the name of the workflow, the version of the workflow, etc. It also contains the meta-variables, such as threads, output, and use_piscem, which are used in later sections. The ... in the following code chunk means we omit the lines before and after the meta_info section.
meta_info : {
template_name : "example workflow",
output : null,
threads : 16,
use_piscem : false,
}
We then design the fast_config section. In this section, we will ask for a genome FASTA file and a gene annotation GTF file, which is the minimum input of making a spliced+intronic (splici) reference, which contains spliced transcripts’ sequence and intronic sequences of each gene. Details about splici can be found in the supplementary section S2 of the alevin-fry paper. We will also list a rlen field taking the numeric read length value of the sequencing data as the value, which is an optional argument for building the splici reference.
fast_config : {
splici : {
fasta : null,
gtf : null,
rlen : 91,
}
}
Now we turn to the advanced_config section, we will list other reference types supported by simpleaf index, and all optional arguments and flags used for fine-tuning the behavior of simpleaf index. Specifically, in addition to splici, we will also support spliced+unspliced (spliceu) reference, which contains the spliced transcripts and the gene body (unspliced transcript) of each gene, direct_ref, which will build the index directly from the provided (usually transcriptome) FASTA file , and existing_index reference type, which tells the template to use an existing index instead of building a new one by calling simpleaf index .
advanced_config : {
simpleaf_index : {
ref_type : {
type : "splici", # splici arguments are in the `fast_config` section.
spliceu : {
gtf : null,
fasta : null,
},
direct_ref : {
ref_seq : null,
},
existing_index : {
index : null,
t2g_map : null,
},
},
arguments : {
active : true,
"--spliced" : null,
"--unspliced" : null,
"--dedup" : false,
"--sparse" : false,
"--keep-duplicates" : false,
"--gff3-fomrat" : false,
"--threads" : $.meta_info.threads,
"--use-pisem" : $.meta_info.use_piscem,
"--overwrite" : $.meta_info.use_piscem,
"--kmer-length" : 31,
"--minimizer-length" : 7,
"--decoy-paths" : null,
},
output : $.meta_info.output + "/simpleaf_index",
}
}
Finally, we will define the workflow section containing a simpleaf command record called simpleaf_index.
workflow : {
simpleaf_index : utils.simpleaf_index(
1,
utils.ref_type($.advanced_config.ref_type + $.fast_config),
$.advanced_config.arguments,
$.advanced_config.output,
),
}
In this section, we utilized two utils functions, utils.ref_type and utils.simpleaf_index. The utils.ref_type function takes a reference type object, like the one defined in the advanced_config section and returns a valid reference type object. This object must has a type field specifying which reference type will be used, and a field named by the value of the type field including the required arguments for building the reference. Here we merged advanced_config with fast_config because it contains the arguments used for building a splici reference. If we specify splici : {gtf: "genes.gtf", fasta: "genome.fa", rlen: 91}, after merging, the input object to utils.ref_type looks like the following:
{
type : "splici",
splici : {
fasta : "genome.fa",
gtf : "genes.gtf",
rlen : 91,
},
spliceu : {
gtf : null,
fasta : null,
},
direct_ref : {
ref_seq : null,
},
existing_index : {
index : null,
t2g_map : null,
},
},
The output of the utils.ref_type function will be passed to the utils.simpleaf_index function as the second argument. It looks like the following:
{
type :: "splici",
arguments :: {gtf: "genes.gtf", fasta: "genome.fa", rlen: 91},
"--ref-type" : "splici",
"--fasta" : "genome.fa",
"--gtf" : "genes.gtf",
"--rlen" : 91,
}
Notice that the type and arguments fields are defined using double colons. This is the syntax in Jsonnet for defining hidden fields. Those hidden fields will not be included in the output manifest, but are accessiable during the manifesting process.
Finally, the utils.simpleaf_index function takes 1. a step number, 2. a reference type object returned by utils.ref_type(), 3. optional arguments, and 4. the output path, and returns a valid simpleaf index command record, or an empty record if we use existing_index as the reference type. If we instantiate this template using meat_info.output : "./workflow_output" and the splici parameters showed above , our final manfest will looks like the following:
{
"meta_info" : {
"template_name" : "example workflow",
"output" : "./workflow_output",
"threads" : 16,
"use_piscem" : false,
},
"workflow" : {
"simpleaf_index" : {
"program_name" : "simpleaf_index",
"step" : 1,
"active" : true,
"--ref-type" : "splici",
"--fasta" : "genome.fa",
"--gtf" : "genes.gtf",
"--rlen" : 91,
"--dedup" : false,
"--sparse" : false,
"--keep-duplicates" : false,
"--gff3-fomrat" : false,
"--threads" : 16,
"--use-pisem" : false,
"--overwrite" : false,
"--kmer-length" : 31,
"--minimizer-length" : 7
}
}
}
Notice that all flags with false will be ignored by Simpleaf when invoking the commands.
4. Utilizing Built-in Variables and Custom Library Search Paths in Custom Templates
When parsing a workflow template, Simpleaf automatically provides useful external variables to it. We can easily utilize these variables in our templates by receiving them using the std.extVar() function from the Jsonnet std library. Currently, Simpleaf provides the following variables and library searching paths when parsing a template, and for variables, we will show the code to receive them in our template directly:
-
std.extVar("__utils"): the__utilsvariable represents the Simpleaf workflow utility library, which contains useful functions for developing a Simpleaf workflow template. Usually, a template will take this variable at the beginning by addinglocal utils = std.extVar("__utils");. -
std.extVar("__output"): The__outputvariable represents the--outputargument provided from the command line. To use this variable, we need to addlocal output = std.extVar("__output");at the beginning of the template. -
std.extVar("__validate"): The__validatevariable is set tofalseinsimpleaf workflow getandtruein other programs, likesimpleaf workflow run. Usaually, this variable is used to turn on/off the validation of the template, for example, missing values. To use this variable, we need to addlocal validate = std.extVar("__validate");at the beginning of the template.
Build a Workflow Template for Processing 10X Chromium 3’ v3 Data
In the above example, we discussed how to build a Simpleaf workflow template containing a simpleaf index command. In this section, we will show how to add a simpleaf quant command to the workflow template to process 10X Chromium 3’ v3 data. As we apply the same idea as in the previous section, we will briefly discuss the procedure of building a simpleaf quant command record using the utility library, and then show the final workflow template.
Overall, simpleaf quant requires at least five file/directory paths:
- The path to the index directory.
- The path to the transcript-to-gene mapping file.
- The path to the output directory.
- The comma-separated paths to the Reads1 FASTQ files.
- The comma-separated paths to the Reads2 FASTQ files.
As we can predict the first two paths using the output of the simpleaf index command and set the output directory according to the meta_info.output variable, we only need to ask for the paths to the FASTQ files in the fast_config section.
1. Fast Configuration Section
In this section, we need to ask for read1 and read2 FASTQ files. Therefore, we add one field in addition to the splici field discussed above:
fast_config: {
splici: {fasta: null, gtf: null, rlen: 91},
map_reads: {
reads1: null, # comma-separated paths to the Reads1 FASTQ files
reads2: null, # comma-separated paths to the Reads2 FASTQ files
}
}
2. Advanced Configuration Section
In this section, we’ll cover two crucial argument groups for running simpleaf quant: mapping mode (map_type) and cell filtration strategy (cell_filt_type). map_type offers two choices: providing paths to Reads1 and Reads2 FASTQ files, along with the index and transcript-to-gene-mapping file for read mapping, or simply supplying the path to existing mapping results. We include existing_map in the map_type field, along with a type field indicating the chosen mode. Similarly, we define a cell_filt_type field with five possible cell filtration options.
advanced_config : {
simpleaf_quant : {
map_type : {
type : "map_reads",
existing_mappings : {
map_dir : null,
t2g_map : null,
},
},
cell_filt_type : {
type : "cell_filt",
unfiltered_pl : true,
knee : false,
expect_cells : null,
forced_cells : null,
explicit_pl : null,
},
argumetns : {
active : true,
"--min-reads" : 10,
"--resolution" : "cr-like",
"--expected-ori" : "fw",
"--threads" : $.meta_info.threads,
"--chemistry" : "10xv3",
"--use-selective-alignment" : false,
"--use-piscem" : $.meta_info.use_piscem,
"--struct-constraints" : false,
"--ignore-ambig-hits" : false,
"--no-poison" : false,
"--skipping-strategy" : null,
"--max-ec-card" : null,
"--max-hit-occ" : null,
"--max-hit-occ-recover" : null,
"--max-read-occ" : null,
},
output : $.meta_info.output + "/simpleaf_quant",
}
}
3. workflow section
Finally, with all the information in the above three sections, we can call the utils.simpleaf_quant function, together with the utils.map_type and utils.cell_filt_type functions to generate a valid simpleaf quant command record.
workflow : {
simpleaf_quant : utils.simpleaf_quant(
2,
utils.map_type($.advanced_config.simpleaf_quant.map_type + $.fast_config, $.workflow.simpleaf_index),
utils.cell_filt_type($.advanced_config.simpleaf_quant.cell_filt_type),
$.advanced_config.simpleaf_quant.arguments,
$.advanced_config.simpleaf_quant.output,
),
}
In this section, we utilized three utils functions, utils.map_type, utils.cell_filt_type, and utils.simpleaf_quant. The utils.map_type function takes a map type object, like the one defined in the advanced_config section and returns a valid map type object. This object must has a type field specifying which map type will be used, and a field named by the value of the type field including the required arguments for mapping. Here we merged advanced_config with fast_config because it contains the arguments used for mapping. If we specify map_reads : {reads1: "reads1.fastq.gz", reads2: "reads2.fastq.gz"}, after merging, the input object to utils.map_type looks like the following:
{
type : "map_reads",
map_reads : {
reads1 : "reads1.fastq.gz",
reads2 : "reads2.fastq.gz",
},
existing_mappings : {
map_dir : null,
t2g_map : null,
},
},
The output of the utils.map_type function will be passed to the utils.simpleaf_quant function as the second argument. It looks like the following:
{
type :: "map_reads",
arguments :: {reads1: "reads1.fastq.gz", reads2: "reads2.fastq.gz"},
"--reads1" : "reads1.fastq.gz",
"--reads2" : "reads2.fastq.gz",
"--index" : "./workflow_output/simpleaf_index/index",
"--t2g-map": ./workflow_output/simpleaf_index/index/t2g_3col.tsv",
}
Notice that the path to the index directory and the transcript-to-gene mapping file are included in the simpleaf_index command record, which is the second argument of utils.map_reads.
Then, we use map_type, together with the simpleaf index command record, to call the utils.cell_filt_type function to generate a cell_filt_type object containing the simpleaf quant argument/flag for the selected cell filtering method. We will not show the detail here as it is very similar to map_type showed above.
Finally, we call the utils.simpleaf_quant function to generate a simpleaf quant command record using the results from the function calls introduced above. After evaluation, the manifest will look like the following :
"simpleaf_quant": {
"program_name": "simpleaf quant",
"step": 2,
"active": true,
"--chemistry": "10xv3",
"--expected-ori": "fw",
"--index": "./workflow_output/simpleaf_index/index",
"--min-reads": 10,
"--output": "./workflow_output/simpleaf_quant",
"--reads1": "reads1.fastq.gz",
"--reads2": "reads2.fastq.gz",
"--resolution": "cr-like",
"--t2g-map": "./workflow_output/simpleaf_index/index/t2g_3col.tsv",
"--threads": 16,
"--unfiltered-pl": true,
"--use-piscem": false
}
Finally, if we assemble the above sections, including those for simpleaf index and simpleaf quant together, we will get the final workflow template for processing 10X Chromium 3’ v3 data as shown in the protocol estuary GitHub repository.
Summary
In conclusion, we can quickly build complex and highly-configurable simpleaf workflow templates to guide simpleaf to run complicated single-cell data processing workflows. We can take advantage of the cool features and the standard library in Jsonnet, as well as the simpleaf workflow utility library when building a simpleaf workflow template. If you have further questions, please do not hesitate to start a GitHub issue!