Simpleaf is a command-line toolkit written in Rust that exposes a unified and simplified interface for processing scRNA-seq datasets using the alevin-fry ecosystem of tools. Since Simpleaf version 0.15.0, we introduce the re-designed simpleaf workflow
sub-program, which provides the ability to execute complex and highly-configurable single-cell data processing workflows consisting of Simpleaf commands and shell commands described by a simple user-provided Jsonnet program. One can fetch ready-made Simpleaf workflow templates from our protocol library, protocol estuary, using the simpleaf workflow get
program, or develop custom workflows to achieve specific tasks. This tutorial will discuss how to build a valid Simpleaf workflow template from scratch. If you are interested in running an existing workflow, please check our tutorial about running Simpleaf workflows.
Here, we will assume that Simpleaf is available in our operating environment. If you still need to, you can follow this tutorial to install and set up Simpleaf. Besides, we might also need a Jsonnet executable to debug in real-time in practice. You can install it by following these instructions or use Jrsonnet, a Rust implementation of Jsonnet.
Preparation
Before we start, let’s set up the environment.
export AF_SAMPLE_DIR=$PWD/simpleaf_workdir
mkdir $AF_SAMPLE_DIR
cd $AF_SAMPLE_DIR
Basics
First of all, let’s talk about some basic terminology in Simpleaf Workflow:
There are three phases of a Simpleaf Workflow:
- Workflow Template: A Jsonnet program that has the potential to generate a valid Simpleaf Workflow but with required fields missing.
- Instantiated Workflow Template or just Instantiate Template: A Workflow Template that contains enough information to generate a valid Workflow Manifest.
- Workflow Manifest: A JSON record that contains the definition of each command in the workflow and the information needed to invoke them.
We call a Jsonnet program that has the potential to generate a valid Simpleaf Workflow but with required fields missing a Workflow Template. All published workflows in protocol estuary are in this phase. Users must fill in enough information to make the template an Instantiated Template. An Instantiated Template is a Jsonnet program that contains enough information to generate a valid Workflow Manifest. In a valid Workflow Manifest, all command records must be parameterized completely and correctly.
Now, let’s talk about how Simpleaf Workflow works:
- The program
simpleaf workflow run
takes an Instantiated Template as the input and generates a Workflow Manifest from it. Any Workflow Manifest is also a valid Instantiated Template of itself and can be used as the input ofsimpleaf workflow run
. - Simpleaf requires the command records in the resulting Workflow Manifest, including Simpleaf commands and external shell commands, to follow a specific format because it always parses the Workflow Manifest evaluated from the provided Workflow Template or the Workflow Manifest itself to find and assemble the actual commands included in a workflow.
- When parsing an Instantiated Template, Simpleaf automatically passes many external variables and library searching paths to the Jsonnet engine. One important variable is the Simpleaf Workflow utility library, which contains many useful functions for developing a Simpleaf Workflow template. We should take advantage of these variables when developing our own Workflow Templates.
Developing a Workflow Template
This section will discuss developing a Simpleaf Workflow template using no, essential, and advanced features.
1. Define a Basic Workflow Template
As we discussed, Simpleaf Workflow requires that in the Workflow Manifest, the command records, including Simpleaf commands and external shell commands, must follow a specific format. The format consists of two parts: identity fields and argument field(s). The identity fields of these two types of commands are the same, but the argument fields are not.
There are three identity fields.
-
program_name
: This field has to record a valid program name in the user’s operating environment.- For Simpleaf programs, this will be the name we used to call the Simpleaf programs.
- For
simpleaf index
commands, this field must beprogram_name: "simpleaf index"
. - For
simpleaf quant
commands, this field must beprogram_name: "simpleaf quant"
.
- For
- For external shell commands, this field must represent a valid executable. For example, for a record representing an
awk
command:- if the
awk
program we want to call is not in ourPATH
variable, this field must be the (quoted) path to the desiredawk
executable, for exampleprogram_name: "/usr/bin/awk"
. - if the
awk
is in ourPATH
, this can be the path to the executable, or justprogram_name: "awk"
, because in this case,awk
is accessible in both ways.
- if the
- For Simpleaf programs, this will be the name we used to call the Simpleaf programs.
-
step
: This field records which step this command constitutes in the workflow. This is the only allowed integer in the Simpleaf Workflow Manifest. Simpleaf will invoke the commands by order of theirstep
. -
active
: This field indicates if the command is active in the workflow. Simpleaf will regard all commands without this field as active commands. This is the only allowed boolean field in the Workflow Manifest. Simpleaf will skip (neither parse nor invoke) all inactive commands (with"active": false
).
As for the argument field(s), they are in a different format in Simpleaf command records and external command records:
-
In a Simpleaf command record, each argument, value pair must be provided as a field. The field name represents the argument name, and the value represents the argument value (or
true
or an empty string""
if the argument doesn’t take a value). For example,{"--threads": 16, "--use-piscem": true, "--ref-type": "splici"}
. We recommend using the full name of arguments. For example, we suggest using--threads
instead of-t
. -
In an external command record, all arguments must be listed in an ordered array as the
arguments
field. For example, the argument of the commandls -l -h .
has to be provided as{"arguments": ["-l", "-h", "."]}
.
With the format in mind, we can now create a simple workflow template! Here we will define a workflow that first creates a toy reference sequence file and calls simpleaf index
to build a salmon index on this reference. The actual workflow template, instantiated template, and the resulting workflow manifest are identical and are shown in the following code chunk.
In the following code chunk, we create a workflow template Jsonnet program file and call simpleaf workflow run
to evaluate it.
cd $AF_SAMPLE_DIR
# write the _workflow manifest_ into a file
cat <<EOF > workflow_template.jsonnet
{
"create_ref" : {
"program_name": "echo",
"step": 1,
"arguments": ["\">a\\nAACCAACACAAAC\\n>b\\nCCACAAACAACACAAC\"", ">", "./workflow_output/toy_ref.fastq"],
},
"simpleaf_index": {
"program_name": "simpleaf index",
step: 2,
"--ref-seq": "toy_ref.fa",
"--output": "./workflow_output",
"--kmer-length": "5"
},
}
EOF
simpleaf workflow run --template workflow_template.jsonnet -o ./workflow_output
The above workflow template (or instantiated template, or workflow manifest) defines a workflow that consists of two steps:
-
We create a FASTA file with two sequence records using the
echo
command. The first record isa
, and its sequence isAACCAACACAAAC
, and the second record isb
, and its sequence isCCACAAACAACACAAC
. We use the>
symbol to redirect the output of theecho
command to a file namedtoy_ref.fa
. -
We call
simpleaf index
to build a Salmon index on thetoy_ref.fa
file. We set the kmer length as 5 and the output directory as./workflow_output
.
# print toy reference
$cat $AF_SAMPLE_DIR/toy_ref.fa
>a
AACCAACACAAAC
>b
CCACAAACAACACAAC
# simpleaf index output directory
$ls $AF_SAMPLE_DIR/workflow_output
index index_info.json simpleaf_index_log.json toy_ref.fastq
2. The Recommended Layout in a Simpleaf Workflow Template
Although any Jsonnet program is a valid input for Simpleaf Workflow as long as it can generate a valid workflow manifest, we recommend dividing a workflow template into four sections:
- meta_info: This section contains the meta information of the workflow, such as the name of the workflow, the version of the workflow, etc. It also contains the meta-variables, such as
threads
,output
, anduse_piscem
, which are used by the utility library to assign values to some arguments in the workflow. - fast_config: This section contains the minimum information needed to generate a valid workflow manifest in the standard and recommended way. We call this section fast_config because it is the fastest way to generate a valid workflow manifest. However, the resulting workflow manifest might not be the most appropriate one for the users’ purpose.
- advanced_config: This section contains the information for generating a valid workflow manifest in a more advanced and comprehensive way. Usually, this section contains the alternative configurations defined in the
fast_config
and allows the users to fine-tune the workflow. - workflow: This section contains the logics for generating the workflow manifest using the information provided in
fast_config
andadvanced_config
. Usually, it includes function calls to the utility library functions to combine the above three sections or some external command records for preprocessing the data.
3. A Workflow Template with More Jsonnet Features
Following the best practices, we can improve the above workflow template from multiple aspects:
- For the meta-variables that will be used throughout the workflow, such as
threads
,output
, anduse_piscem
, we can define them in the meta_info section and use the utility library to assign them to the appropriate arguments in the workflow. - To ease the users’ burden, we can include the most essential arguments, such as the path to the reference FASTA file, in the fast_config section and put it at the beginning of the workflow template. By doing this, the users can generate a valid workflow manifest by only completing this section.
- For some optional but important arguments, such as the
--ker-length
argument, we can put them in the advanced_config section and assign them a default value. - Finally, we can assemble the above three sections in the workflow section to generate the command records of the workflow.
Here, we assume you have some basic knowledge about Jsonnet. Uncertain? Check their excellent tutorial! Let’s see how the redesigned workflow template looks like. Notice that the following example represents an uninstantiated workflow because the required field, --ref-seq
is missing (with null
). To instantiate the template, we need to fill in the output
field in meta_info
, and replace the null
with the path to the toy_ref.fa
file generated from the previous step, or any other valid FASTA file. Notice that Jsonnet doesn’t require quotes around field names, as long as the field names are valid identifiers.
# local variable definition starts with the `local` keyword
local template = {
// meta_info
meta_info : {
template_name: "example workflow",
output : null,
},
fast_config : {
ref_seq : null,
}
advanced_config : {
arguments : {
"--kmer–length" : 5,
}
},
workflow : {
simpleaf_index : {
program_name: "simpleaf index",
step : 2,
"--output" : meta_info.output + "/simpleaf_index", # the + operator merges the two strings # the . operator accesses the field of an object
"--ref-seq" : std.get(fast_config, "ref_seq"), # std.get() is a function call
} + arguments # the + operator merges the two objects
,
create_ref : {
program_name : "echo",
step : 1,
arguments : ["\">a\\nAACCAACACAAAC\\n>b\\nCCACAAACAACACAAC\"", ">", meta_info.output + "/toy_ref.fastq"],
},
}
}; # local variable definition ends with a semicolon
# We output the template object
template
In this example:
-
We define a local variable
template
to store the workflow template. Thetemplate
variable is a Jsonnet object that contains four fields:meta_info
,fast_config
,advanced_config
, andworkflow
. Each of these fields is also a Jsonnet object. -
We specify meta-information in the
meta_info
field, such as the name of the workflow and the output directory. Notice that we use thenull
value to represent the missing value of theoutput
field. We will fill in this field when instantiating the template. -
We specify essential parameters in the
fast_config
field. In this example, we only need to provide the path to the reference FASTA file (other thanmeta_info.output
) to generate a valid workflow manifest. -
To make the workflow more flexible, we put some optional but important arguments in the
advanced_config
field. In this example, we assign a default value to the--kmer-length
argument. -
In the workflow section, we assemble the above three sections to generate the command records of the workflow. We used the following features from Jsonnet:
-
We used the dot operator,
.
, to access the fields of an object like in Python. For example,meta_info.output
accesses theoutput
field in themeta_info
object. This can also be achieved by calling theget()
function from the Jsonnet standard library,std.get(meta_info, "output")
. For details about the Jsonnet std library, please check this page. -
We used the
+
operator to combine the base object defined on the LHS and thearguments
field inadvanced_config
. This operator can also be used to concatenate strings and calculate the sum of two numbers.
-
There are many more useful features in Jsonnet. To have a systematic understanding of Jsonnet, please check the tutorials provided by Jsonnet.
4. A Workflow Template Using the Simpleaf Workflow Utility Library
To ease the development of a workflow template, the Simpleaf team provides a utility library that contains many useful functions for building Simpleaf’s commands in the manifest at different resolutions. This library is passed to the template evaluation process as an external variable __utils
, and can be loaded in any template by adding local utils = std.extVar("__utils");
at the beginning of the template. Then, we can access any function in the library by calling utils.function_name()
.
In this section, we will show an example of calling some functions from utils
to build a simpleaf index
command smoothly. Because this example is a little bit long, we will divide it into multiple parts, each representing a section discussed in the previous section.
Firstly, we will define the layout of the template file. We first load the utils
library, and then we define a local variable template
to store the workflow template. The template
variable is a Jsonnet object that contains five fields: meta_info
, fast_config
, advanced_config
, intermediate_steps
, and workflow
. Each of these fields is also a Jsonnet object. Here we use a section called intermediate_steps
to take some intermediate configurations that will be used in the workflow
section. We will show the usage of this section later.
local utils = std.extVar("__utils"); # import the utility library
# The actual definition of each section is omitted here
local template = {
meta_info : {},
fast_config : {},
advanced_config : {},
intermediate_steps : {}
workflow : {},
}
Next, we will define the meta_info
section. In this section, we will define the meta information of the workflow, such as the name of the workflow, the version of the workflow, etc. It also contains the meta-variables, such as threads
, output
, and use_piscem
, which are used in later sections. The ...
in the following code chunk means we omit the lines before and after the meta_info
section.
meta_info : {
template_name : "example workflow",
output : null,
threads : 16,
use_piscem : false,
}
We then design the fast_config
section. In this section, we will ask for a genome FASTA file and a gene annotation GTF file, which is the minimum input of making a spliced+intronic (splici) reference, which contains spliced transcripts’ sequence and intronic sequences of each gene. Details about splici can be found in the supplementary section S2 of the alevin-fry paper. We will also list a rlen
field taking the numeric read length value of the sequencing data as the value, which is an optional argument for building the splici reference.
fast_config : {
splici : {
fasta : null,
gtf : null,
rlen : 91,
}
}
Now we turn to the advanced_config
section, we will list other reference types supported by simpleaf index
, and all optional arguments and flags used for fine-tuning the behavior of simpleaf index
. Specifically, in addition to splici
, we will also support spliced+unspliced (spliceu) reference, which contains the spliced transcripts and the gene body (unspliced transcript) of each gene, direct_ref
, which will build the index directly from the provided (usually transcriptome) FASTA file , and existing_index
reference type, which tells the template to use an existing index instead of building a new one by calling simpleaf index
.
advanced_config : {
simpleaf_index : {
ref_type : {
type : "splici", # splici arguments are in the `fast_config` section.
spliceu : {
gtf : null,
fasta : null,
},
direct_ref : {
ref_seq : null,
},
existing_index : {
index : null,
t2g_map : null,
},
},
arguments : {
active : true,
"--spliced" : null,
"--unspliced" : null,
"--dedup" : false,
"--sparse" : false,
"--keep-duplicates" : false,
"--gff3-fomrat" : false,
"--threads" : $.meta_info.threads,
"--use-pisem" : $.meta_info.use_piscem,
"--overwrite" : $.meta_info.use_piscem,
"--kmer-length" : 31,
"--minimizer-length" : 7,
"--decoy-paths" : null,
},
output : $.meta_info.output + "/simpleaf_index",
}
}
Finally, we will define the workflow
section containing a simpleaf command record called simpleaf_index
.
workflow : {
simpleaf_index : utils.simpleaf_index(
1,
utils.ref_type($.advanced_config.ref_type + $.fast_config),
$.advanced_config.arguments,
$.advanced_config.output,
),
}
In this section, we utilized two utils
functions, utils.ref_type
and utils.simpleaf_index
. The utils.ref_type
function takes a reference type object, like the one defined in the advanced_config
section and returns a valid reference type object. This object must has a type
field specifying which reference type will be used, and a field named by the value of the type
field including the required arguments for building the reference. Here we merged advanced_config
with fast_config
because it contains the arguments used for building a splici reference. If we specify splici : {gtf: "genes.gtf", fasta: "genome.fa", rlen: 91}
, after merging, the input object to utils.ref_type
looks like the following:
{
type : "splici",
splici : {
fasta : "genome.fa",
gtf : "genes.gtf",
rlen : 91,
},
spliceu : {
gtf : null,
fasta : null,
},
direct_ref : {
ref_seq : null,
},
existing_index : {
index : null,
t2g_map : null,
},
},
The output of the utils.ref_type
function will be passed to the utils.simpleaf_index
function as the second argument. It looks like the following:
{
type :: "splici",
arguments :: {gtf: "genes.gtf", fasta: "genome.fa", rlen: 91},
"--ref-type" : "splici",
"--fasta" : "genome.fa",
"--gtf" : "genes.gtf",
"--rlen" : 91,
}
Notice that the type
and arguments
fields are defined using double colons. This is the syntax in Jsonnet for defining hidden fields. Those hidden fields will not be included in the output manifest, but are accessiable during the manifesting process.
Finally, the utils.simpleaf_index
function takes 1. a step number, 2. a reference type object returned by utils.ref_type()
, 3. optional arguments, and 4. the output path, and returns a valid simpleaf index
command record, or an empty record if we use existing_index
as the reference type. If we instantiate this template using meat_info.output : "./workflow_output"
and the splici parameters showed above , our final manfest will looks like the following:
{
"meta_info" : {
"template_name" : "example workflow",
"output" : "./workflow_output",
"threads" : 16,
"use_piscem" : false,
},
"workflow" : {
"simpleaf_index" : {
"program_name" : "simpleaf_index",
"step" : 1,
"active" : true,
"--ref-type" : "splici",
"--fasta" : "genome.fa",
"--gtf" : "genes.gtf",
"--rlen" : 91,
"--dedup" : false,
"--sparse" : false,
"--keep-duplicates" : false,
"--gff3-fomrat" : false,
"--threads" : 16,
"--use-pisem" : false,
"--overwrite" : false,
"--kmer-length" : 31,
"--minimizer-length" : 7
}
}
}
Notice that all flags with false
will be ignored by Simpleaf when invoking the commands.
4. Utilizing Built-in Variables and Custom Library Search Paths in Custom Templates
When parsing a workflow template, Simpleaf automatically provides useful external variables to it. We can easily utilize these variables in our templates by receiving them using the std.extVar()
function from the Jsonnet std library. Currently, Simpleaf provides the following variables and library searching paths when parsing a template, and for variables, we will show the code to receive them in our template directly:
-
std.extVar("__utils")
: the__utils
variable represents the Simpleaf workflow utility library, which contains useful functions for developing a Simpleaf workflow template. Usually, a template will take this variable at the beginning by addinglocal utils = std.extVar("__utils");
. -
std.extVar("__output")
: The__output
variable represents the--output
argument provided from the command line. To use this variable, we need to addlocal output = std.extVar("__output");
at the beginning of the template. -
std.extVar("__validate")
: The__validate
variable is set tofalse
insimpleaf workflow get
andtrue
in other programs, likesimpleaf workflow run
. Usaually, this variable is used to turn on/off the validation of the template, for example, missing values. To use this variable, we need to addlocal validate = std.extVar("__validate");
at the beginning of the template.
Build a Workflow Template for Processing 10X Chromium 3’ v3 Data
In the above example, we discussed how to build a Simpleaf workflow template containing a simpleaf index
command. In this section, we will show how to add a simpleaf quant
command to the workflow template to process 10X Chromium 3’ v3 data. As we apply the same idea as in the previous section, we will briefly discuss the procedure of building a simpleaf quant
command record using the utility library, and then show the final workflow template.
Overall, simpleaf quant
requires at least five file/directory paths:
- The path to the index directory.
- The path to the transcript-to-gene mapping file.
- The path to the output directory.
- The comma-separated paths to the Reads1 FASTQ files.
- The comma-separated paths to the Reads2 FASTQ files.
As we can predict the first two paths using the output of the simpleaf index
command and set the output directory according to the meta_info.output
variable, we only need to ask for the paths to the FASTQ files in the fast_config
section.
1. Fast Configuration Section
In this section, we need to ask for read1 and read2 FASTQ files. Therefore, we add one field in addition to the splici
field discussed above:
fast_config: {
splici: {fasta: null, gtf: null, rlen: 91},
map_reads: {
reads1: null, # comma-separated paths to the Reads1 FASTQ files
reads2: null, # comma-separated paths to the Reads2 FASTQ files
}
}
2. Advanced Configuration Section
In this section, we’ll cover two crucial argument groups for running simpleaf quant
: mapping mode (map_type
) and cell filtration strategy (cell_filt_type
). map_type
offers two choices: providing paths to Reads1 and Reads2 FASTQ files, along with the index and transcript-to-gene-mapping file for read mapping, or simply supplying the path to existing mapping results. We include existing_map
in the map_type
field, along with a type
field indicating the chosen mode. Similarly, we define a cell_filt_type
field with five possible cell filtration options.
advanced_config : {
simpleaf_quant : {
map_type : {
type : "map_reads",
existing_mappings : {
map_dir : null,
t2g_map : null,
},
},
cell_filt_type : {
type : "cell_filt",
unfiltered_pl : true,
knee : false,
expect_cells : null,
forced_cells : null,
explicit_pl : null,
},
argumetns : {
active : true,
"--min-reads" : 10,
"--resolution" : "cr-like",
"--expected-ori" : "fw",
"--threads" : $.meta_info.threads,
"--chemistry" : "10xv3",
"--use-selective-alignment" : false,
"--use-piscem" : $.meta_info.use_piscem,
"--struct-constraints" : false,
"--ignore-ambig-hits" : false,
"--no-poison" : false,
"--skipping-strategy" : null,
"--max-ec-card" : null,
"--max-hit-occ" : null,
"--max-hit-occ-recover" : null,
"--max-read-occ" : null,
},
output : $.meta_info.output + "/simpleaf_quant",
}
}
3. workflow section
Finally, with all the information in the above three sections, we can call the utils.simpleaf_quant
function, together with the utils.map_type
and utils.cell_filt_type
functions to generate a valid simpleaf quant
command record.
workflow : {
simpleaf_quant : utils.simpleaf_quant(
2,
utils.map_type($.advanced_config.simpleaf_quant.map_type + $.fast_config, $.workflow.simpleaf_index),
utils.cell_filt_type($.advanced_config.simpleaf_quant.cell_filt_type),
$.advanced_config.simpleaf_quant.arguments,
$.advanced_config.simpleaf_quant.output,
),
}
In this section, we utilized three utils
functions, utils.map_type
, utils.cell_filt_type
, and utils.simpleaf_quant
. The utils.map_type
function takes a map type object, like the one defined in the advanced_config
section and returns a valid map type object. This object must has a type
field specifying which map type will be used, and a field named by the value of the type
field including the required arguments for mapping. Here we merged advanced_config
with fast_config
because it contains the arguments used for mapping. If we specify map_reads : {reads1: "reads1.fastq.gz", reads2: "reads2.fastq.gz"}
, after merging, the input object to utils.map_type
looks like the following:
{
type : "map_reads",
map_reads : {
reads1 : "reads1.fastq.gz",
reads2 : "reads2.fastq.gz",
},
existing_mappings : {
map_dir : null,
t2g_map : null,
},
},
The output of the utils.map_type
function will be passed to the utils.simpleaf_quant
function as the second argument. It looks like the following:
{
type :: "map_reads",
arguments :: {reads1: "reads1.fastq.gz", reads2: "reads2.fastq.gz"},
"--reads1" : "reads1.fastq.gz",
"--reads2" : "reads2.fastq.gz",
"--index" : "./workflow_output/simpleaf_index/index",
"--t2g-map": ./workflow_output/simpleaf_index/index/t2g_3col.tsv",
}
Notice that the path to the index directory and the transcript-to-gene mapping file are included in the simpleaf_index
command record, which is the second argument of utils.map_reads
.
Then, we use map_type
, together with the simpleaf index command record, to call the utils.cell_filt_type
function to generate a cell_filt_type
object containing the simpleaf quant
argument/flag for the selected cell filtering method. We will not show the detail here as it is very similar to map_type
showed above.
Finally, we call the utils.simpleaf_quant
function to generate a simpleaf quant
command record using the results from the function calls introduced above. After evaluation, the manifest will look like the following :
"simpleaf_quant": {
"program_name": "simpleaf quant",
"step": 2,
"active": true,
"--chemistry": "10xv3",
"--expected-ori": "fw",
"--index": "./workflow_output/simpleaf_index/index",
"--min-reads": 10,
"--output": "./workflow_output/simpleaf_quant",
"--reads1": "reads1.fastq.gz",
"--reads2": "reads2.fastq.gz",
"--resolution": "cr-like",
"--t2g-map": "./workflow_output/simpleaf_index/index/t2g_3col.tsv",
"--threads": 16,
"--unfiltered-pl": true,
"--use-piscem": false
}
Finally, if we assemble the above sections, including those for simpleaf index
and simpleaf quant
together, we will get the final workflow template for processing 10X Chromium 3’ v3 data as shown in the protocol estuary GitHub repository.
Summary
In conclusion, we can quickly build complex and highly-configurable simpleaf workflow templates to guide simpleaf to run complicated single-cell data processing workflows. We can take advantage of the cool features and the standard library in Jsonnet, as well as the simpleaf workflow utility library when building a simpleaf workflow template. If you have further questions, please do not hesitate to start a GitHub issue!