Workflow Format#
The .oxoflow file format is oxo-flow's TOML-based workflow definition language. This page provides the complete specification, design philosophy, and syntax rules.
Design Principles#
The .oxoflow format is built on four core principles:
- Declarative over Imperative — Define what should happen (inputs, outputs, tools), not how to orchestrate it. The engine handles the execution logic.
- Explicit is better than Implicit — Every dependency and environment should be clearly visible. No hidden global state.
- Composition over Inheritance — Reuse logic through modular
includedirectives and rule templates rather than complex inheritance hierarchies. - Traceability by Default — The format structure directly supports generating clinical-grade provenance and audit trails.
TOML Primer#
oxo-flow uses the TOML (Tom's Obvious, Minimal Language) format. If you are new to TOML, here are the three essential concepts used in .oxoflow files:
- Key-Value Pairs:
key = "value". Strings must be in quotes. - Tables:
[name]defines a section (an object/map). - Arrays of Tables:
[[name]]defines a list of sections. In oxo-flow, rules are defined using double brackets because a workflow contains multiple rules.
For more details, see the Official TOML Specification.
File Extension#
Workflow files must use the .oxoflow extension (e.g., qc_pipeline.oxoflow).
Top-level Structure#
[workflow] # Required: metadata
[config] # Optional: user variables
[defaults] # Optional: rule defaults
[report] # Optional: report configuration
[[include]] # Optional: include external workflow files
[[rules]] # Required: one or more rules
[[pairs]] # Optional: experiment-control pairs (WC-01)
[[sample_groups]] # Optional: multi-sample groups (WC-02)
[resource_budget] # Optional: resource limits
[env_groups] # Optional: named reusable environment specs
[resource_groups] # Optional: shared resource pools (API limits, DB connections)
[wildcard_constraints] # Optional: regex patterns to constrain wildcard values
[[execution_group]] # Optional: explicit sequential/parallel rule ordering
[cluster] # Optional: HPC cluster profile (SLURM, PBS, SGE, LSF)
[[reference_db]] # Optional: tracked reference database versions
[citation] # Optional: citation metadata (DOI, authors, etc.)
[plugins] # Optional: plugin configuration
[[include]] — Modular Workflow Composition#
Include external workflow files to enable modular, reusable workflow design:
| Field | Type | Required | Description |
|---|---|---|---|
path |
String | Yes | Path to the included .oxoflow file |
namespace |
String | No | Optional namespace prefix for included rule names |
Namespace Behavior#
When a namespace is specified:
- All rule names from the included file are prefixed:
namespace::rule_name - Internal
depends_onreferences within the included file are automatically prefixed - External
depends_onreferences (to rules outside the included file) remain unchanged
Example:
# qc.oxoflow
[[rules]]
name = "fastqc"
input = ["{sample}.fastq.gz"]
output = ["qc/{sample}_fastqc.html"]
shell = "fastqc {input}"
[[rules]]
name = "trim"
input = ["{sample}.fastq.gz"]
output = ["trimmed/{sample}.fastq.gz"]
depends_on = ["fastqc"] # Internal reference - will become "qc::fastqc"
shell = "fastp {input} -o {output}"
# main.oxoflow
[[include]]
path = "qc.oxoflow"
namespace = "qc"
[[rules]]
name = "align"
input = ["trimmed/{sample}.fastq.gz"]
depends_on = ["qc::trim"] # Reference to included rule with namespace
shell = "bwa mem ref.fa {input} > aligned/{sample}.bam"
Resulting rules: qc::fastqc, qc::trim, align
[workflow] — Metadata#
[workflow]
name = "my-pipeline"
version = "1.0.0"
description = "A short description"
author = "Your Name"
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
String | Yes | — | Pipeline name |
version |
String | No | "0.1.0" |
Semantic version |
description |
String | No | — | Human-readable description |
author |
String | No | — | Author name or email |
interpreter_map |
Table | No | {} |
Custom interpreter mapping for script extensions |
genome_build |
String | No | — | Genome reference build identifier (e.g., "GRCh38", "hg38") |
min_version |
String | No | — | Minimum oxo-flow version required to run this workflow |
format_version |
String | No | — | Format specification version for compatibility |
pairs_file |
String | No | — | External TSV/CSV/JSON file defining experiment-control pairs |
sample_groups_file |
String | No | — | External TSV/JSON file defining sample groups |
pairs_pattern |
String | No | — | File glob pattern for auto-discovering pairs (e.g., "aligned/{pair_id}/{exp}_vs_{ctrl}.bam") |
sample_pattern |
String | No | — | File glob pattern for auto-discovering samples (e.g., "raw/{sample}_R1.fastq.gz") |
format_version |
String | No | — | .oxoflow format specification version for compatibility checking |
Custom Interpreters (interpreter_map)#
By default, oxo-flow auto-detects interpreters based on file extensions:
.py→python3.R,.r→Rscript.sh→sh.jl→julia
You can override or extend this mapping in the [workflow] section:
[workflow]
name = "custom-interpreters"
[workflow.interpreter_map]
".m" = "octave"
".sas" = "sas"
".py" = "/opt/conda/bin/python" # Override default
This mapping applies only to the script field.
[config] — Configuration Variables#
User-defined key-value pairs accessible in rules as {config.<key>}:
[config]
reference = "/data/ref/hg38.fa"
samples_dir = "raw_data"
results_dir = "results"
min_quality = "30"
Values are TOML strings, integers, booleans, or arrays. String interpolation in rules uses {config.key} syntax.
[defaults] — Default Settings#
Applied to all rules unless explicitly overridden:
| Field | Type | Description |
|---|---|---|
threads |
Integer | Default CPU thread count |
memory |
String | Default memory allocation |
environment |
Table | Default environment specification |
[report] — Report Configuration#
[report]
template = "clinical"
format = ["html", "json"]
sections = ["summary", "variants", "quality"]
| Field | Type | Description |
|---|---|---|
template |
String | Report template name |
format |
Array | Output formats to generate |
sections |
Array | Report sections to include |
[[rules]] — Rule Definitions#
Each [[rules]] entry defines a pipeline step. The double brackets indicate a TOML array of tables.
Basic example#
[[rules]]
name = "align"
input = ["{sample}_R1.fastq.gz", "{sample}_R2.fastq.gz"]
output = ["aligned/{sample}.bam"]
threads = 16
memory = "32G"
environment = { conda = "envs/alignment.yaml" }
shell = "bwa mem -t {threads} {config.reference} {input} | samtools sort -o {output}"
All fields#
| Field | Type | Required | Description |
|---|---|---|---|
name |
String | Yes | Unique rule identifier |
input |
Array of strings | Yes | Input file paths |
output |
Array of strings | Yes | Output file paths |
shell |
String | No | Shell command to execute |
script |
String | No | Script file path (auto-detects interpreter) |
description |
String | No | Human-readable description of what this rule does |
threads |
Integer | No | (Deprecated) CPU threads — use resources.threads instead |
memory |
String | No | (Deprecated) Memory allocation — use resources.memory instead |
resources |
Table | No | Full resource specification (threads, memory, gpu, disk, time_limit, partition, groups) |
environment |
Table | No | Environment specification |
transform |
Table | No | Unified scatter-gather operator (split → map → combine) |
when |
String | No | Conditional expression — skip rule when false |
envvars |
Table | No | Dictionary of environment variables to inject |
params |
Table | No | User-defined parameters for shell templates |
pre_exec |
String | No | Command to run before the main shell command |
on_success |
String | No | Command to run after rule succeeds |
on_failure |
String | No | Command to run after rule fails (all retries exhausted) |
retries |
Integer | No | Number of retry attempts on failure (default: 0) |
interpreter |
String | No | Explicit interpreter for script execution |
checkpoint |
Boolean | No | Rebuild DAG after this rule completes |
scatter |
Table | No | Fan-out parallel execution over a variable with optional gather |
expand_inputs |
Table | No | Cartesian product expansion of input patterns |
priority |
Integer | No | Execution priority (higher = runs first; default: 0) |
target |
Boolean | No | Mark as default target (built when no explicit -t given) |
required |
Boolean | No | Pipeline fails if this rule fails, even without downstream deps |
optional |
Boolean | No | Rule is skipped if its inputs don't exist (no error) |
benchmark |
String | No | Benchmark output path for performance data |
log |
String | No | Log file path for rule execution output |
group |
String | No | Job group label for cluster submission grouping |
cache_key |
String | No | Content-based cache key for output reuse |
input_function |
String | No | Dynamic input resolver function name |
rule_metadata |
Table | No | Arbitrary domain-specific metadata (assay, organism, etc.) |
env_group |
String | No | Reference to a named environment in [env_groups] |
depends_on |
Array | No | Explicit rule-level dependencies (by rule name) |
extends |
String | No | Inherit settings from a base rule |
retry_delay |
String | No | Delay between retries (e.g., "5s", "30s", "2m") |
workdir |
String | No | Per-rule working directory override |
temp_output |
Array | No | Temporary outputs cleaned up after downstream rules complete |
protected_output |
Array | No | Outputs that must never be overwritten or deleted |
tags |
Array | No | Categorization tags (e.g., ["qc", "alignment"]) |
shadow |
String | No | Shadow directory mode: "minimal", "shallow", or "full" |
ancient |
Array | No | Inputs that never trigger re-execution (e.g., reference files) |
localrule |
Boolean | No | Always run locally — never submit to a cluster scheduler |
format_hint |
Array | No | File format hints for I/O optimization ("bam", "vcf", "fastq.gz") |
pipe |
Boolean | No | Enable FIFO streaming mode for input/output |
checksum |
String | No | Output integrity verification ("md5" or "sha256") |
resource_hint |
Table | No | Resource estimation hints for dynamic scheduling |
Note: At least one of shell or script must be provided. If both are defined, they execute sequentially: shell first, then script.
Environment specification#
# Conda
environment = { conda = "envs/tools.yaml" }
# Pixi
environment = { pixi = "envs/pixi.toml" }
# Docker
environment = { docker = "biocontainers/bwa:0.7.17" }
# Singularity
environment = { singularity = "docker://biocontainers/bwa:0.7.17" }
# Python venv
environment = { venv = "envs/requirements.txt" }
# HPC modules
environment = { modules = ["gcc/11.2.0", "openmpi/4.1.1"] }
# Conda with custom prefix
environment = { conda = "envs/qc.yaml", conda_prefix = ".oxo-flow/envs" }
# venv with custom requirements file
environment = { venv = ".venv/", venv_requirements = "envs/dev-requirements.txt" }
# Reference a named environment group (defined in [env_groups])
env_group = "qc_env"
Named Environment Groups ([env_groups])#
Instead of repeating the same environment spec across multiple rules, define
named groups once in [env_groups] and reference them via env_group:
[env_groups.qc_env]
conda = "envs/qc.yaml"
[env_groups.align_env]
conda = "envs/alignment.yaml"
docker = "biocontainers/bwa:0.7.17" # fallback
[[rules]]
name = "fastqc"
env_group = "qc_env"
input = ["raw/{sample}.fastq.gz"]
output = ["qc/{sample}_fastqc.html"]
shell = "fastqc {input} -o qc/"
Rules using env_group inherit the full environment specification from the
named group. If a rule also defines an inline [rules.environment], the
inline spec takes precedence.
Environment Variables (envvars)#
Inject rule-specific environment variables directly into the execution context:
[[rules]]
name = "deep_learning"
shell = "python train.py"
[rules.envvars]
CUDA_VISIBLE_DEVICES = "0"
PYTHONPATH = "./src"
Variables defined here are available to the main shell command as well as all lifecycle hooks (pre_exec, etc.).
Parameters (params)#
Define custom variables for use in shell templates. Unlike [config], which is global, params are specific to a single rule and take precedence during interpolation:
[[rules]]
name = "count_reads"
shell = "samtools view -c -q {params.min_qual} {input} > {output}"
[rules.params]
min_qual = 20
Script Execution (script)#
The script field allows you to execute external script files (Python, R, etc.) with automatic interpreter detection.
[[rules]]
name = "analyze"
script = "scripts/analysis.py --min-quality {params.q}"
interpreter = "python3" # Optional: overrides auto-detection
Interpreter Detection Order:
1. Explicit interpreter field on the rule.
2. Custom [workflow.interpreter_map] in the metadata.
3. Built-in defaults based on file extension.
4. Shebang line (if file is executable).
Lifecycle Hooks#
Hooks allow you to run auxiliary logic at different stages of a rule's life:
[[rules]]
name = "process_data"
shell = "python process.py"
pre_exec = "mkdir -p tmp_workspace"
on_success = "echo 'Success!' | slack-notify"
on_failure = "rm -rf tmp_workspace && echo 'Cleanup done'"
retries = 3
pre_exec: Runs before the main command. If it fails, the rule is aborted.on_success: Runs only after the main command completes with exit code 0.on_failure: Runs if the main command fails, after allretrieshave been exhausted.
Resources (extended)#
For rules needing GPU, disk, or time limits, use the resources sub-table:
[[rules]]
name = "gpu_task"
input = ["data.h5"]
output = ["model.pt"]
threads = 8
memory = "64G"
shell = "python train.py"
[rules.resources]
gpu = 1
disk = "200G"
time_limit = "48h"
| Field | Type | Example | Description |
|---|---|---|---|
threads |
Integer | 8 |
Number of CPU threads |
memory |
String | "16G" |
Memory allocation |
gpu |
Integer | 1 |
Number of GPUs |
disk |
String | "200G" |
Local disk space |
time_limit |
String | "48h" |
Wall-time limit |
partition |
String | "gpu" |
HPC partition/queue to submit to |
groups |
Table | {db_conn = 1} |
Resource group consumption tracking |
Resource Management#
Declaration vs Enforcement#
oxo-flow tracks declared resources for scheduling but does not strictly enforce them in local execution. On HPC clusters, resources are enforced by the scheduler.
Local execution: - Resources are tracked to prevent over-allocation - Warnings emitted when declaring resources exceeding system capacity - Jobs may oversubscribe if user intentionally requests more than available
HPC clusters: - Resources translated to scheduler directives (SLURM, PBS, SGE, LSF) - Scheduler enforces limits - jobs requesting more than allocated will fail
Platform Detection#
| Platform | Thread Detection | Memory Detection |
|---|---|---|
| Linux | num_cpus crate |
sysinfo crate |
| macOS | num_cpus crate |
sysinfo crate |
Validation Warnings#
When a rule declares resources exceeding system capacity, oxo-flow emits warnings during validation but does not block execution:
⚠️ rule 'bwa_align' requests 128 threads but system has 64 (will oversubscribe)
⚠️ rule 'big_sort' requests 128GB but system has 32GB (may OOM)
This allows intentional oversubscription for testing or when user knows better.
Cleanup Behavior#
oxo-flow automatically cleans up temporary outputs:
| Scenario | Cleanup |
|---|---|---|
| Success + temp_output | Cleaned after successful completion |
| Failure + temp_output | Cleaned to prevent stale partial files |
| Transform with cleanup=true | Chunk files cleaned after combine succeeds |
Timeout Enforcement#
On Unix systems (Linux, macOS), timeout kills the entire process group, ensuring child processes don't survive:
GPU Specification#
For detailed GPU requirements:
[rules.resources.gpu_spec]
count = 2
model = "A100" # SLURM: --gres=gpu:a100:2
memory_gb = 40 # SLURM: --mem-per-gpu=40G
compute_capability = "8.0" # For filtering (not scheduler directive)
Note: PBS/SGE GPU syntax varies by site. Use extra_args for site-specific flags.
Resource Hints#
When exact requirements unknown, provide hints for estimation:
[rules.resource_hint]
input_size = "medium" # small (~1GB), medium (~10GB), large (~100GB), xlarge (~500GB)
memory_scale = 2.0 # Estimated memory = input_size × scale
runtime = "slow" # fast (<10min), medium (10min-1h), slow (>1h)
io_bound = true # true = I/O bound, false = CPU bound
Memory estimation formula: estimated_mb = input_size_mb × memory_scale
Script Execution#
Script Field#
Execute a script file instead of (or in addition to) a shell command:
[[rules]]
name = "analysis"
input = ["data.csv"]
output = ["results.json"]
script = "scripts/analyze.py" # Auto-detects interpreter from extension
When both shell and script are defined, they execute sequentially: shell first, then script.
[[rules]]
name = "qc_and_report"
shell = "fastqc {input} -o qc/"
script = "reports/qc_report.qmd" # Runs after shell completes
Interpreter Detection#
oxo-flow automatically detects the interpreter from script file extension:
| Extension | Interpreter | Notes |
|---|---|---|
.py |
python |
Python script |
.R / .r |
Rscript |
R script |
.jl |
julia |
Julia script |
.sh / .bash |
bash |
Shell script |
.pl |
perl |
Perl script |
.rb |
ruby |
Ruby script |
.qmd |
quarto render |
Quarto document |
.Rmd / .rmd |
quarto render |
R Markdown |
.ipynb |
jupyter nbconvert --to notebook --execute |
Jupyter notebook |
.smk |
snakemake |
Snakemake workflow |
.nextflow |
nextflow run |
Nextflow script |
.wdl |
miniwdl run |
WDL workflow |
Explicit Interpreter Override#
Override auto-detection with interpreter field:
[[rules]]
name = "custom_python"
script = "analyze.py3"
interpreter = "python3.11" # Override default python
Custom Interpreter Map#
Configure custom interpreter mappings at workflow level:
[workflow]
name = "pipeline"
[workflow.interpreter_map]
".m" = "octave" # MATLAB/Octave
".sas" = "sas" # SAS
".do" = "stata-mp" # Stata
".stan" = "cmdstan" # Stan
Additional Rule Fields#
Output Management#
| Field | Type | Description |
|---|---|---|
temp_output |
Array | Temporary outputs cleaned after downstream rules complete |
protected_output |
Array | Protected outputs never overwritten or deleted |
[[rules]]
name = "align"
output = ["aligned/{sample}.bam", "aligned/{sample}.bam.bai"]
temp_output = ["aligned/{sample}.tmp.bam"] # Cleaned after downstream use
Execution Control#
| Field | Type | Default | Description |
|---|---|---|---|
depends_on |
Array | — | Explicit rule dependencies (not inferred from files) |
localrule |
Boolean | false |
Always run locally, never submit to cluster |
workdir |
String | — | Per-rule working directory override |
shadow |
String | — | Atomic execution mode: "minimal", "shallow", "full" |
checkpoint |
Boolean | false |
Enable dynamic DAG modification |
[[rules]]
name = "setup"
shell = "mkdir -p results"
depends_on = [] # Run first, before file-based dependencies
[[rules]]
name = "local_only"
shell = "echo 'local task'"
localrule = true # Never submitted to HPC cluster
Retry Configuration#
| Field | Type | Default | Description |
|---|---|---|---|
retries |
Integer | 0 | Number of automatic retry attempts |
retry_delay |
String | — | Delay between retries ("5s", "30s", "2m") |
[[rules]]
name = "network_task"
shell = "curl https://api.example.com/data"
retries = 3
retry_delay = "30s"
Input/Output Hints#
| Field | Type | Description |
|---|---|---|
ancient |
Array | Inputs that never trigger re-execution (reference files) |
format_hint |
Array | File format hints for I/O optimization ("bam", "vcf") |
pipe |
Boolean | Enable FIFO streaming mode for inputs |
checksum |
String | Output checksum algorithm ("md5", "sha256") |
[[rules]]
name = "align"
input = ["reads/{sample}.fastq.gz", "ref/hg38.fa"]
ancient = ["ref/hg38.fa"] # Reference never triggers rebuild
format_hint = ["bam"]
checksum = "sha256"
Organization#
| Field | Type | Description |
|---|---|---|
tags |
Array | Categorization tags (["qc", "alignment"]) |
extends |
String | Base rule to inherit settings from |
[[rules]]
name = "align_default"
threads = 8
memory = "32G"
tags = ["alignment", "production"]
[[rules]]
name = "align_fast"
extends = "align_default" # Inherits threads, memory, tags
threads = 16 # Override inherited value
Priority and Targeting#
| Field | Type | Description |
|---|---|---|
priority |
Integer | Execution priority (higher runs first; default: 0) |
target |
Boolean | Mark as default target — built when no explicit -t given |
[[rules]]
name = "critical_step"
priority = 10 # Runs ahead of lower-priority rules
target = true # Included when running without -t
Optional and Required Rules#
| Field | Type | Description |
|---|---|---|
optional |
Boolean | If true, missing inputs become warnings instead of errors |
required |
Boolean | If true, pipeline fails if this rule fails even without dependents |
[[rules]]
name = "experimental"
optional = true # Skip if input data is absent
required = true # But if it runs, failure stops the pipeline
Logging and Benchmarking#
| Field | Type | Description |
|---|---|---|
log |
String | File path for capturing rule stdout/stderr |
benchmark |
String | File path for performance metrics (wall-time, memory, CPU) |
[[rules]]
name = "align"
log = "logs/align_{sample}.log"
benchmark = "benchmarks/align_{sample}.tsv"
Job Grouping and Caching#
| Field | Type | Description |
|---|---|---|
group |
String | Job group label for cluster submission grouping |
cache_key |
String | Content-based cache key for reusing previous outputs |
[[rules]]
name = "variant_call"
group = "variant_calling" # Submit as a group on cluster
cache_key = "vc_v2.0" # Cache key for output reuse
Dynamic Input Resolution#
| Field | Type | Description |
|---|---|---|
input_function |
String | Name of a dynamic input resolver function called at runtime |
Arbitrary Metadata#
| Field | Type | Description |
|---|---|---|
rule_metadata |
Table | Domain-specific metadata (assay type, organism, protocol, etc.) |
[[rules]]
name = "wgs_align"
[rules.rule_metadata]
assay = "WGS"
organism = "Homo sapiens"
protocol = "Illumina_NovaSeq_6000"
Scatter-Gather (Legacy)#
The scatter field provides fan-out parallelism over a variable with optional
gather. For new workflows, prefer the unified transform
operator.
| Field | Type | Description |
|---|---|---|
scatter.variable |
String | Variable to scatter over (e.g., "chr") |
scatter.values |
Array | Values to scatter across |
scatter.values_from |
String | Config variable reference for values |
scatter.gather |
String | Name of the gather rule |
Expand Inputs#
The expand_inputs field generates additional input combinations via Cartesian
product expansion.
| Field | Type | Description |
|---|---|---|
expand_inputs[].pattern |
String | Input pattern with variables |
expand_inputs[].variables |
Table | Variable name → list of values or config reference |
[[rules]]
name = "multi_ref_align"
expand_inputs = [
{ pattern = "refs/{ref_genome}.fa", variables = { ref_genome = ["hg38", "t2t"] } }
]
Wildcards#
Wildcards enable dynamic, pattern-based pipeline definitions. For a detailed guide on how they are discovered, expanded, and constrained, see the Wildcards Reference.
Basic Syntax#
Use {name} in file paths for dynamic expansion:
Built-in Placeholders#
Built-in placeholders use the same syntax but have reserved meanings:
| Placeholder | Expands to |
|---|---|
{input} |
Space-separated list of all input files |
{input[N]} |
The Nth input file (0-indexed) |
{input.name} |
The input file named name from named_input |
{output} |
Space-separated list of all output files |
{output[N]} |
The Nth output file (0-indexed) |
{output.name} |
The output file named name from named_output |
{threads} |
Thread count assigned to this rule |
{memory} |
Memory allocation assigned to this rule |
{config.*} |
Value from the [config] section |
Named Input & Output#
For complex rules with many files, use named_input and named_output to improve readability:
[[rules]]
name = "align"
[rules.named_input]
reads1 = "raw/{sample}_R1.fastq.gz"
reads2 = "raw/{sample}_R2.fastq.gz"
[rules.named_output]
bam = "aligned/{sample}.bam"
shell = "bwa mem {input.reads1} {input.reads2} > {output.bam}"
Custom Wildcards#
Any {name} pattern not matching a built-in placeholder is treated as a wildcard. oxo-flow expands these based on:
1. File discovery: Scanning for matching files in the input path.
2. Explicit lists: Defined in [[pairs]] or [[sample_groups]].
[[pairs]] — Experiment-Control Pairing (WC-01)#
[[pairs]] defines experiment-control sample pairs for somatic variant calling and other comparative analyses.
[[pairs]]
pair_id = "CASE_001"
experiment = "EXP_01"
control = "CTRL_01"
[[pairs]]
pair_id = "CASE_002"
experiment = "EXP_02"
control = "CTRL_02"
| Field | Type | Required | Description |
|---|---|---|---|
pair_id |
String | Yes | Unique identifier for this pair |
experiment |
String | Yes | Experiment sample name (alias: tumor) |
control |
String | Yes | Matched control sample name (alias: normal) |
experiment_type |
String | No | Optional cohort label (alias: tumor_type) |
metadata |
Table | No | Arbitrary key-value pairs (each key becomes a wildcard) |
Any rule that references {experiment}, {control}, or {pair_id} in its input, output, or shell fields is automatically expanded into one concrete rule instance per pair. Rules that do not reference any pair wildcard are kept as-is.
Expanded rule naming: {rule_name}_{pair_id} (e.g., mutect2_CASE_001).
Loading pairs from external file#
For large cohort studies with hundreds or thousands of pairs, use pairs_file in [workflow]:
TSV format (tab-separated, header required):
pair_id experiment control experiment_type
CASE_001 EXP_01 CTRL_01 lung_adenocarcinoma
CASE_002 EXP_02 CTRL_02 colorectal
CASE_003 EXP_03 CTRL_03 breast_cancer
CSV format (comma-separated):
pair_id,experiment,control,experiment_type
CASE_001,EXP_01,CTRL_01,lung_adenocarcinoma
CASE_002,EXP_02,CTRL_02,colorectal
JSON format:
[
{"pair_id": "CASE_001", "experiment": "EXP_01", "control": "CTRL_01"},
{"pair_id": "CASE_002", "experiment": "EXP_02", "control": "CTRL_02"}
]
Inline [[pairs]] and pairs_file can be used together; entries from both sources are merged.
Auto-discovery from file pattern#
For workflows with existing paired files, use pairs_pattern in [workflow] to auto-discover pairs by scanning the filesystem:
[workflow]
name = "somatic-calling"
pairs_pattern = "aligned/{pair_id}/{experiment}_vs_{control}.bam"
oxo-flow scans files matching this pattern and extracts wildcards from paths. For a file:
Creates pair:
pair_id = CASE_001experiment = EXP_01control = CTRL_01
Pattern requirements:
- Must contain {pair_id}, {experiment}, and {control} wildcards
- Optional {experiment_type} wildcard also extracted
- Pattern is converted to glob (*) for filesystem scan
This eliminates the need for manual pair lists or external files when working with pre-organized directory structures.
Example#
[[pairs]]
pair_id = "CASE_001"
experiment = "EXP_01"
control = "CTRL_01"
[[rules]]
name = "mutect2"
input = ["aligned/{experiment}.bam", "aligned/{control}.bam"]
output = ["variants/{pair_id}.vcf.gz"]
shell = "gatk Mutect2 -I {input[0]} -I {input[1]} -normal {control} -O {output[0]}"
Produces rule mutect2_CASE_001 with concrete file paths.
See examples/paired_experiment_control_pairs.oxoflow for a full clinical somatic calling pipeline.
[[sample_groups]] — Multi-Sample Cohorts (WC-02)#
[[sample_groups]] organises samples into named groups (e.g., case vs. control) for cohort studies.
[[sample_groups]]
name = "control"
samples = ["CTRL_001", "CTRL_002", "CTRL_003"]
[[sample_groups]]
name = "case"
samples = ["CASE_001", "CASE_002"]
| Field | Type | Required | Description |
|---|---|---|---|
name |
String | Yes | Group name |
samples |
Array of strings | Yes | Sample identifiers in this group |
metadata |
Table | No | Arbitrary group-level metadata |
Any rule that references {sample} or {group} is expanded once per (group, sample) pair across all groups.
Expanded rule naming: {rule_name}_{group}_{sample} (e.g., align_control_CTRL_001).
Loading groups from external file#
For large cohorts, use sample_groups_file in [workflow]:
TSV format (samples can be comma-separated within the field):
name samples
control CTRL_001,CTRL_002,CTRL_003
case CASE_001,CASE_002,CASE_003
treatment TX_001,TX_002
JSON format:
[
{"name": "control", "samples": ["CTRL_001", "CTRL_002"]},
{"name": "case", "samples": ["CASE_001", "CASE_002"]}
]
Example#
[[sample_groups]]
name = "treatment"
samples = ["S001", "S002"]
[[rules]]
name = "align"
input = ["raw/{sample}_R1.fq.gz"]
output = ["aligned/{sample}.bam"]
shell = "bwa mem ref.fa {input[0]} > {output[0]}"
Produces align_treatment_S001 and align_treatment_S002.
See examples/cohort_analysis.oxoflow for a complete cohort study pipeline.
when — Conditional Rule Execution (WF-01)#
The optional when field on a rule contains an expression evaluated against [config] values. When the expression evaluates to false the rule is skipped entirely and removed from the DAG.
[[rules]]
name = "fastqc"
when = "config.run_qc"
input = ["raw/sample_R1.fq.gz"]
output = ["qc/sample_fastqc.html"]
shell = "fastqc {input[0]} -o qc/"
Expression syntax#
| Form | Example | Description |
|---|---|---|
config.<key> |
config.run_qc |
Truthy check (true, non-zero, non-empty string) |
config.<key> == "value" |
config.mode == "WGS" |
String equality |
config.<key> != "value" |
config.mode != "WES" |
String inequality |
config.<key> == true\|false |
config.skip == false |
Boolean equality |
config.<key> > N |
config.min_cov >= 20 |
Numeric comparison (>, >=, <, <=) |
file_exists("path") |
file_exists("panel.bed") |
File existence test |
!<expr> |
!config.skip |
Logical NOT |
<expr> && <expr> |
config.run_qc && config.min_cov >= 20 |
Logical AND |
<expr> \|\| <expr> |
config.wgs \|\| config.wes |
Logical OR |
(<expr>) |
(config.a && config.b) \|\| config.c |
Grouping |
Example#
[config]
run_annotation = true
min_coverage = 30
mode = "WGS"
[[rules]]
name = "vep_annotate"
when = 'config.run_annotation && config.min_coverage >= 20'
# ...
[[rules]]
name = "wgs_coverage"
when = 'config.mode == "WGS"'
# ...
See examples/conditional_workflow.oxoflow for a full example.
Dependency Resolution#
Dependencies are inferred automatically: if rule B lists a file in its input that appears in rule A's output, then B depends on A.
[[rules]]
name = "step1"
output = ["intermediate.txt"]
# ...
[[rules]]
name = "step2"
input = ["intermediate.txt"] # depends on step1
# ...
No explicit dependency declaration is needed.
transform — Unified Scatter-Gather Operator#
The transform operator unifies split → map → combine patterns into a single rule declaration, similar to dplyr's group_by() %>% summarize() or pandas' groupby().apply().
Structure#
[[rules]]
name = "variant_calling"
input = ["aligned/sample.bam"]
output = ["variants/sample.vcf.gz"]
[rules.transform.split]
by = "chr"
values_from = "config.chromosomes"
[rules.transform]
map = "gatk HaplotypeCaller -I {input} -L {chr} -O .oxo-flow/chunks/{chr}.g.vcf.gz"
cleanup = true
[rules.transform.combine]
shell = "gatk GatherVcfs {chunks} -O {output}"
Split Configuration#
| Field | Type | Description |
|---|---|---|
by |
String | Required. Variable name for splitting (e.g., "chr", "sample") |
values |
Array | Direct list of split values |
values_from |
String | Reference to config variable (e.g., "config.chromosomes") |
n |
String | Number of chunks (generates indices 0, 1, ..., n-1) |
glob |
String | Glob pattern to find split values from files |
Priority: values → values_from → n → glob
Combine Configuration#
| Field | Type | Description |
|---|---|---|
shell |
String | Shell command to combine chunks |
aggregate |
Boolean | Enable automatic aggregation |
method |
String | Aggregation method: "concat" or "json_merge" |
header |
String | Header line for concat aggregation |
Built-in Variables#
| Variable | Expands to |
|---|---|
{split_var} |
Current split value (e.g., {chr} → "chr1") |
{chunks} |
Space-separated list of all chunk outputs |
{input} |
Original rule input (in combine) |
{output} |
Original rule output (in combine) |
Modes#
Mode A: Split → Map → Combine
Classic scatter-gather with explicit combine command:
[rules.transform.split]
by = "chr"
values_from = "config.chromosomes"
[rules.transform]
map = "gatk HaplotypeCaller -I {input} -L {chr} -O .oxo-flow/chunks/{chr}.g.vcf.gz"
[rules.transform.combine]
shell = "gatk GatherVcfs {chunks} -O {output}"
Mode B: Split → Map → Aggregate
Automatic aggregation (concat or json_merge):
[rules.transform.split]
by = "chunk"
n = "5"
[rules.transform]
map = "process {input} > .oxo-flow/chunks/{chunk}.txt"
[rules.transform.combine]
aggregate = true
method = "concat"
Mode C: Split → Map (No Combine)
Parallel processing without merging — each split produces independent output:
[rules.transform.split]
by = "chr"
values_from = "config.chromosomes"
[rules.transform]
map = "samtools flagstat {input} > qc/{chr}.flagstat.txt"
# No combine section
Cleanup#
When cleanup = true, chunk files are automatically cleaned up after combine succeeds:
Failure and Retry Logic#
In a scatter-gather process, failures are handled at the chunk (map) level:
- If a single chunk fails, only that specific chunk is retried according to the rule's
retriessetting. - Sibling chunks continue to process in parallel.
- The combine step will not execute until all chunks succeed. If any chunk fails exhaustively (after all retries), the combine step is cancelled.
Expanded Rule Naming#
Transform rules expand into:
- Map rules:
{rule_name}_{split_value}(e.g.,variant_calling_chr1) - Combine rule:
{rule_name}_combine(e.g.,variant_calling_combine)
Multi-line Strings#
Use triple quotes for multi-line shell commands:
shell = """
mkdir -p results
bwa mem -t {threads} ref.fa {input} | \
samtools sort -@ {threads} -o {output}
"""
Complete Example#
[workflow]
name = "ngs-pipeline"
version = "2.0.0"
description = "Complete NGS analysis pipeline"
author = "Genomics Core <core@example.org>"
[config]
reference = "/data/ref/hg38.fa"
known_sites = "/data/ref/known_sites.vcf.gz"
results = "results"
[defaults]
threads = 4
memory = "8G"
environment = { conda = "envs/base.yaml" }
[report]
format = ["html"]
[[rules]]
name = "fastqc"
input = ["raw/{sample}_R1.fastq.gz", "raw/{sample}_R2.fastq.gz"]
output = ["{config.results}/qc/{sample}_R1_fastqc.html"]
shell = "fastqc {input} -o {config.results}/qc/ -t {threads}"
[[rules]]
name = "trim"
input = ["raw/{sample}_R1.fastq.gz", "raw/{sample}_R2.fastq.gz"]
output = ["{config.results}/trimmed/{sample}_R1.fastq.gz"]
environment = { docker = "biocontainers/fastp:0.23.4" }
shell = "fastp --in1 {input[0]} --in2 {input[1]} --out1 {output[0]} --thread {threads}"
[[rules]]
name = "align"
input = ["{config.results}/trimmed/{sample}_R1.fastq.gz"]
output = ["{config.results}/aligned/{sample}.bam"]
threads = 16
memory = "32G"
environment = { conda = "envs/alignment.yaml" }
shell = "bwa mem -t {threads} {config.reference} {input} | samtools sort -o {output}"
JSON Schema#
oxo-flow provides a comprehensive JSON Schema for the .oxoflow format. This can be used for automated validation in your CI/CD pipelines or for real-time autocompletion and error checking in your IDE (like VS Code or IntelliJ).
Getting the Schema#
You can output the schema directly from the CLI:
IDE Configuration (VS Code)#
To enable validation in VS Code, add the following to your settings.json:
"yaml.schemas": {
"https://traitome.github.io/oxo-flow/schema/oxoflow-v1.schema.json": "*.oxoflow"
}
(Note: Although .oxoflow is TOML, many VS Code extensions can apply JSON schemas to multiple formats).
See Also#
- Create a Workflow — practical authoring guide
- DAG Engine — how dependencies are resolved
- Environment System — environment specification details