Skip to content

Workflow Format#

The .oxoflow file format is oxo-flow's TOML-based workflow definition language. This page provides the complete specification, design philosophy, and syntax rules.


Design Principles#

The .oxoflow format is built on four core principles:

  1. Declarative over Imperative — Define what should happen (inputs, outputs, tools), not how to orchestrate it. The engine handles the execution logic.
  2. Explicit is better than Implicit — Every dependency and environment should be clearly visible. No hidden global state.
  3. Composition over Inheritance — Reuse logic through modular include directives and rule templates rather than complex inheritance hierarchies.
  4. Traceability by Default — The format structure directly supports generating clinical-grade provenance and audit trails.

TOML Primer#

oxo-flow uses the TOML (Tom's Obvious, Minimal Language) format. If you are new to TOML, here are the three essential concepts used in .oxoflow files:

  1. Key-Value Pairs: key = "value". Strings must be in quotes.
  2. Tables: [name] defines a section (an object/map).
  3. Arrays of Tables: [[name]] defines a list of sections. In oxo-flow, rules are defined using double brackets because a workflow contains multiple rules.

For more details, see the Official TOML Specification.


File Extension#

Workflow files must use the .oxoflow extension (e.g., qc_pipeline.oxoflow).


Top-level Structure#

[workflow]          # Required: metadata
[config]            # Optional: user variables
[defaults]          # Optional: rule defaults
[report]            # Optional: report configuration
[[include]]         # Optional: include external workflow files
[[rules]]           # Required: one or more rules
[[pairs]]           # Optional: experiment-control pairs (WC-01)
[[sample_groups]]   # Optional: multi-sample groups (WC-02)
[resource_budget]   # Optional: resource limits
[env_groups]        # Optional: named reusable environment specs
[resource_groups]   # Optional: shared resource pools (API limits, DB connections)
[wildcard_constraints] # Optional: regex patterns to constrain wildcard values
[[execution_group]] # Optional: explicit sequential/parallel rule ordering
[cluster]           # Optional: HPC cluster profile (SLURM, PBS, SGE, LSF)
[[reference_db]]    # Optional: tracked reference database versions
[citation]          # Optional: citation metadata (DOI, authors, etc.)
[plugins]           # Optional: plugin configuration

[[include]] — Modular Workflow Composition#

Include external workflow files to enable modular, reusable workflow design:

[[include]]
path = "common/qc.oxoflow"
namespace = "qc"

[[include]]
path = "align.oxoflow"
Field Type Required Description
path String Yes Path to the included .oxoflow file
namespace String No Optional namespace prefix for included rule names

Namespace Behavior#

When a namespace is specified:

  1. All rule names from the included file are prefixed: namespace::rule_name
  2. Internal depends_on references within the included file are automatically prefixed
  3. External depends_on references (to rules outside the included file) remain unchanged

Example:

# qc.oxoflow
[[rules]]
name = "fastqc"
input = ["{sample}.fastq.gz"]
output = ["qc/{sample}_fastqc.html"]
shell = "fastqc {input}"

[[rules]]
name = "trim"
input = ["{sample}.fastq.gz"]
output = ["trimmed/{sample}.fastq.gz"]
depends_on = ["fastqc"]  # Internal reference - will become "qc::fastqc"
shell = "fastp {input} -o {output}"
# main.oxoflow
[[include]]
path = "qc.oxoflow"
namespace = "qc"

[[rules]]
name = "align"
input = ["trimmed/{sample}.fastq.gz"]
depends_on = ["qc::trim"]  # Reference to included rule with namespace
shell = "bwa mem ref.fa {input} > aligned/{sample}.bam"

Resulting rules: qc::fastqc, qc::trim, align


[workflow] — Metadata#

[workflow]
name = "my-pipeline"
version = "1.0.0"
description = "A short description"
author = "Your Name"
Field Type Required Default Description
name String Yes Pipeline name
version String No "0.1.0" Semantic version
description String No Human-readable description
author String No Author name or email
interpreter_map Table No {} Custom interpreter mapping for script extensions
genome_build String No Genome reference build identifier (e.g., "GRCh38", "hg38")
min_version String No Minimum oxo-flow version required to run this workflow
format_version String No Format specification version for compatibility
pairs_file String No External TSV/CSV/JSON file defining experiment-control pairs
sample_groups_file String No External TSV/JSON file defining sample groups
pairs_pattern String No File glob pattern for auto-discovering pairs (e.g., "aligned/{pair_id}/{exp}_vs_{ctrl}.bam")
sample_pattern String No File glob pattern for auto-discovering samples (e.g., "raw/{sample}_R1.fastq.gz")
format_version String No .oxoflow format specification version for compatibility checking

Custom Interpreters (interpreter_map)#

By default, oxo-flow auto-detects interpreters based on file extensions:

  • .pypython3
  • .R, .rRscript
  • .shsh
  • .jljulia

You can override or extend this mapping in the [workflow] section:

[workflow]
name = "custom-interpreters"

[workflow.interpreter_map]
".m" = "octave"
".sas" = "sas"
".py" = "/opt/conda/bin/python"  # Override default

This mapping applies only to the script field.


[config] — Configuration Variables#

User-defined key-value pairs accessible in rules as {config.<key>}:

[config]
reference = "/data/ref/hg38.fa"
samples_dir = "raw_data"
results_dir = "results"
min_quality = "30"

Values are TOML strings, integers, booleans, or arrays. String interpolation in rules uses {config.key} syntax.


[defaults] — Default Settings#

Applied to all rules unless explicitly overridden:

[defaults]
threads = 4
memory = "8G"
environment = { conda = "envs/base.yaml" }
Field Type Description
threads Integer Default CPU thread count
memory String Default memory allocation
environment Table Default environment specification

[report] — Report Configuration#

[report]
template = "clinical"
format = ["html", "json"]
sections = ["summary", "variants", "quality"]
Field Type Description
template String Report template name
format Array Output formats to generate
sections Array Report sections to include

[[rules]] — Rule Definitions#

Each [[rules]] entry defines a pipeline step. The double brackets indicate a TOML array of tables.

Basic example#

[[rules]]
name = "align"
input = ["{sample}_R1.fastq.gz", "{sample}_R2.fastq.gz"]
output = ["aligned/{sample}.bam"]
threads = 16
memory = "32G"
environment = { conda = "envs/alignment.yaml" }
shell = "bwa mem -t {threads} {config.reference} {input} | samtools sort -o {output}"

All fields#

Field Type Required Description
name String Yes Unique rule identifier
input Array of strings Yes Input file paths
output Array of strings Yes Output file paths
shell String No Shell command to execute
script String No Script file path (auto-detects interpreter)
description String No Human-readable description of what this rule does
threads Integer No (Deprecated) CPU threads — use resources.threads instead
memory String No (Deprecated) Memory allocation — use resources.memory instead
resources Table No Full resource specification (threads, memory, gpu, disk, time_limit, partition, groups)
environment Table No Environment specification
transform Table No Unified scatter-gather operator (split → map → combine)
when String No Conditional expression — skip rule when false
envvars Table No Dictionary of environment variables to inject
params Table No User-defined parameters for shell templates
pre_exec String No Command to run before the main shell command
on_success String No Command to run after rule succeeds
on_failure String No Command to run after rule fails (all retries exhausted)
retries Integer No Number of retry attempts on failure (default: 0)
interpreter String No Explicit interpreter for script execution
checkpoint Boolean No Rebuild DAG after this rule completes
scatter Table No Fan-out parallel execution over a variable with optional gather
expand_inputs Table No Cartesian product expansion of input patterns
priority Integer No Execution priority (higher = runs first; default: 0)
target Boolean No Mark as default target (built when no explicit -t given)
required Boolean No Pipeline fails if this rule fails, even without downstream deps
optional Boolean No Rule is skipped if its inputs don't exist (no error)
benchmark String No Benchmark output path for performance data
log String No Log file path for rule execution output
group String No Job group label for cluster submission grouping
cache_key String No Content-based cache key for output reuse
input_function String No Dynamic input resolver function name
rule_metadata Table No Arbitrary domain-specific metadata (assay, organism, etc.)
env_group String No Reference to a named environment in [env_groups]
depends_on Array No Explicit rule-level dependencies (by rule name)
extends String No Inherit settings from a base rule
retry_delay String No Delay between retries (e.g., "5s", "30s", "2m")
workdir String No Per-rule working directory override
temp_output Array No Temporary outputs cleaned up after downstream rules complete
protected_output Array No Outputs that must never be overwritten or deleted
tags Array No Categorization tags (e.g., ["qc", "alignment"])
shadow String No Shadow directory mode: "minimal", "shallow", or "full"
ancient Array No Inputs that never trigger re-execution (e.g., reference files)
localrule Boolean No Always run locally — never submit to a cluster scheduler
format_hint Array No File format hints for I/O optimization ("bam", "vcf", "fastq.gz")
pipe Boolean No Enable FIFO streaming mode for input/output
checksum String No Output integrity verification ("md5" or "sha256")
resource_hint Table No Resource estimation hints for dynamic scheduling

Note: At least one of shell or script must be provided. If both are defined, they execute sequentially: shell first, then script.

Environment specification#

# Conda
environment = { conda = "envs/tools.yaml" }

# Pixi
environment = { pixi = "envs/pixi.toml" }

# Docker
environment = { docker = "biocontainers/bwa:0.7.17" }

# Singularity
environment = { singularity = "docker://biocontainers/bwa:0.7.17" }

# Python venv
environment = { venv = "envs/requirements.txt" }

# HPC modules
environment = { modules = ["gcc/11.2.0", "openmpi/4.1.1"] }

# Conda with custom prefix
environment = { conda = "envs/qc.yaml", conda_prefix = ".oxo-flow/envs" }

# venv with custom requirements file
environment = { venv = ".venv/", venv_requirements = "envs/dev-requirements.txt" }

# Reference a named environment group (defined in [env_groups])
env_group = "qc_env"

Named Environment Groups ([env_groups])#

Instead of repeating the same environment spec across multiple rules, define named groups once in [env_groups] and reference them via env_group:

[env_groups.qc_env]
conda = "envs/qc.yaml"

[env_groups.align_env]
conda = "envs/alignment.yaml"
docker = "biocontainers/bwa:0.7.17"  # fallback

[[rules]]
name = "fastqc"
env_group = "qc_env"
input = ["raw/{sample}.fastq.gz"]
output = ["qc/{sample}_fastqc.html"]
shell = "fastqc {input} -o qc/"

Rules using env_group inherit the full environment specification from the named group. If a rule also defines an inline [rules.environment], the inline spec takes precedence.

Environment Variables (envvars)#

Inject rule-specific environment variables directly into the execution context:

[[rules]]
name = "deep_learning"
shell = "python train.py"

[rules.envvars]
CUDA_VISIBLE_DEVICES = "0"
PYTHONPATH = "./src"

Variables defined here are available to the main shell command as well as all lifecycle hooks (pre_exec, etc.).

Parameters (params)#

Define custom variables for use in shell templates. Unlike [config], which is global, params are specific to a single rule and take precedence during interpolation:

[[rules]]
name = "count_reads"
shell = "samtools view -c -q {params.min_qual} {input} > {output}"

[rules.params]
min_qual = 20

Script Execution (script)#

The script field allows you to execute external script files (Python, R, etc.) with automatic interpreter detection.

[[rules]]
name = "analyze"
script = "scripts/analysis.py --min-quality {params.q}"
interpreter = "python3" # Optional: overrides auto-detection

Interpreter Detection Order: 1. Explicit interpreter field on the rule. 2. Custom [workflow.interpreter_map] in the metadata. 3. Built-in defaults based on file extension. 4. Shebang line (if file is executable).

Lifecycle Hooks#

Hooks allow you to run auxiliary logic at different stages of a rule's life:

[[rules]]
name = "process_data"
shell = "python process.py"
pre_exec = "mkdir -p tmp_workspace"
on_success = "echo 'Success!' | slack-notify"
on_failure = "rm -rf tmp_workspace && echo 'Cleanup done'"
retries = 3
  • pre_exec: Runs before the main command. If it fails, the rule is aborted.
  • on_success: Runs only after the main command completes with exit code 0.
  • on_failure: Runs if the main command fails, after all retries have been exhausted.

Resources (extended)#

For rules needing GPU, disk, or time limits, use the resources sub-table:

[[rules]]
name = "gpu_task"
input = ["data.h5"]
output = ["model.pt"]
threads = 8
memory = "64G"
shell = "python train.py"

[rules.resources]
gpu = 1
disk = "200G"
time_limit = "48h"
Field Type Example Description
threads Integer 8 Number of CPU threads
memory String "16G" Memory allocation
gpu Integer 1 Number of GPUs
disk String "200G" Local disk space
time_limit String "48h" Wall-time limit
partition String "gpu" HPC partition/queue to submit to
groups Table {db_conn = 1} Resource group consumption tracking

Resource Management#

Declaration vs Enforcement#

oxo-flow tracks declared resources for scheduling but does not strictly enforce them in local execution. On HPC clusters, resources are enforced by the scheduler.

Local execution: - Resources are tracked to prevent over-allocation - Warnings emitted when declaring resources exceeding system capacity - Jobs may oversubscribe if user intentionally requests more than available

HPC clusters: - Resources translated to scheduler directives (SLURM, PBS, SGE, LSF) - Scheduler enforces limits - jobs requesting more than allocated will fail

Platform Detection#

Platform Thread Detection Memory Detection
Linux num_cpus crate sysinfo crate
macOS num_cpus crate sysinfo crate

Validation Warnings#

When a rule declares resources exceeding system capacity, oxo-flow emits warnings during validation but does not block execution:

⚠️  rule 'bwa_align' requests 128 threads but system has 64 (will oversubscribe)
⚠️  rule 'big_sort' requests 128GB but system has 32GB (may OOM)

This allows intentional oversubscription for testing or when user knows better.

Cleanup Behavior#

oxo-flow automatically cleans up temporary outputs:

| Scenario | Cleanup | |---|---|---| | Success + temp_output | Cleaned after successful completion | | Failure + temp_output | Cleaned to prevent stale partial files | | Transform with cleanup=true | Chunk files cleaned after combine succeeds |

Timeout Enforcement#

On Unix systems (Linux, macOS), timeout kills the entire process group, ensuring child processes don't survive:

[rules.resources]
time_limit = "4h"  # SIGKILL sent to process group after 4 hours

GPU Specification#

For detailed GPU requirements:

[rules.resources.gpu_spec]
count = 2
model = "A100"           # SLURM: --gres=gpu:a100:2
memory_gb = 40           # SLURM: --mem-per-gpu=40G
compute_capability = "8.0"  # For filtering (not scheduler directive)

Note: PBS/SGE GPU syntax varies by site. Use extra_args for site-specific flags.

Resource Hints#

When exact requirements unknown, provide hints for estimation:

[rules.resource_hint]
input_size = "medium"     # small (~1GB), medium (~10GB), large (~100GB), xlarge (~500GB)
memory_scale = 2.0        # Estimated memory = input_size × scale
runtime = "slow"          # fast (<10min), medium (10min-1h), slow (>1h)
io_bound = true           # true = I/O bound, false = CPU bound

Memory estimation formula: estimated_mb = input_size_mb × memory_scale


Script Execution#

Script Field#

Execute a script file instead of (or in addition to) a shell command:

[[rules]]
name = "analysis"
input = ["data.csv"]
output = ["results.json"]
script = "scripts/analyze.py"  # Auto-detects interpreter from extension

When both shell and script are defined, they execute sequentially: shell first, then script.

[[rules]]
name = "qc_and_report"
shell = "fastqc {input} -o qc/"
script = "reports/qc_report.qmd"  # Runs after shell completes

Interpreter Detection#

oxo-flow automatically detects the interpreter from script file extension:

Extension Interpreter Notes
.py python Python script
.R / .r Rscript R script
.jl julia Julia script
.sh / .bash bash Shell script
.pl perl Perl script
.rb ruby Ruby script
.qmd quarto render Quarto document
.Rmd / .rmd quarto render R Markdown
.ipynb jupyter nbconvert --to notebook --execute Jupyter notebook
.smk snakemake Snakemake workflow
.nextflow nextflow run Nextflow script
.wdl miniwdl run WDL workflow

Explicit Interpreter Override#

Override auto-detection with interpreter field:

[[rules]]
name = "custom_python"
script = "analyze.py3"
interpreter = "python3.11"  # Override default python

Custom Interpreter Map#

Configure custom interpreter mappings at workflow level:

[workflow]
name = "pipeline"

[workflow.interpreter_map]
".m" = "octave"        # MATLAB/Octave
".sas" = "sas"         # SAS
".do" = "stata-mp"     # Stata
".stan" = "cmdstan"    # Stan

Additional Rule Fields#

Output Management#

Field Type Description
temp_output Array Temporary outputs cleaned after downstream rules complete
protected_output Array Protected outputs never overwritten or deleted
[[rules]]
name = "align"
output = ["aligned/{sample}.bam", "aligned/{sample}.bam.bai"]
temp_output = ["aligned/{sample}.tmp.bam"]  # Cleaned after downstream use

Execution Control#

Field Type Default Description
depends_on Array Explicit rule dependencies (not inferred from files)
localrule Boolean false Always run locally, never submit to cluster
workdir String Per-rule working directory override
shadow String Atomic execution mode: "minimal", "shallow", "full"
checkpoint Boolean false Enable dynamic DAG modification
[[rules]]
name = "setup"
shell = "mkdir -p results"
depends_on = []  # Run first, before file-based dependencies

[[rules]]
name = "local_only"
shell = "echo 'local task'"
localrule = true  # Never submitted to HPC cluster

Retry Configuration#

Field Type Default Description
retries Integer 0 Number of automatic retry attempts
retry_delay String Delay between retries ("5s", "30s", "2m")
[[rules]]
name = "network_task"
shell = "curl https://api.example.com/data"
retries = 3
retry_delay = "30s"

Input/Output Hints#

Field Type Description
ancient Array Inputs that never trigger re-execution (reference files)
format_hint Array File format hints for I/O optimization ("bam", "vcf")
pipe Boolean Enable FIFO streaming mode for inputs
checksum String Output checksum algorithm ("md5", "sha256")
[[rules]]
name = "align"
input = ["reads/{sample}.fastq.gz", "ref/hg38.fa"]
ancient = ["ref/hg38.fa"]  # Reference never triggers rebuild
format_hint = ["bam"]
checksum = "sha256"

Organization#

Field Type Description
tags Array Categorization tags (["qc", "alignment"])
extends String Base rule to inherit settings from
[[rules]]
name = "align_default"
threads = 8
memory = "32G"
tags = ["alignment", "production"]

[[rules]]
name = "align_fast"
extends = "align_default"  # Inherits threads, memory, tags
threads = 16  # Override inherited value

Priority and Targeting#

Field Type Description
priority Integer Execution priority (higher runs first; default: 0)
target Boolean Mark as default target — built when no explicit -t given
[[rules]]
name = "critical_step"
priority = 10   # Runs ahead of lower-priority rules
target = true   # Included when running without -t

Optional and Required Rules#

Field Type Description
optional Boolean If true, missing inputs become warnings instead of errors
required Boolean If true, pipeline fails if this rule fails even without dependents
[[rules]]
name = "experimental"
optional = true    # Skip if input data is absent
required = true    # But if it runs, failure stops the pipeline

Logging and Benchmarking#

Field Type Description
log String File path for capturing rule stdout/stderr
benchmark String File path for performance metrics (wall-time, memory, CPU)
[[rules]]
name = "align"
log = "logs/align_{sample}.log"
benchmark = "benchmarks/align_{sample}.tsv"

Job Grouping and Caching#

Field Type Description
group String Job group label for cluster submission grouping
cache_key String Content-based cache key for reusing previous outputs
[[rules]]
name = "variant_call"
group = "variant_calling"       # Submit as a group on cluster
cache_key = "vc_v2.0"           # Cache key for output reuse

Dynamic Input Resolution#

Field Type Description
input_function String Name of a dynamic input resolver function called at runtime

Arbitrary Metadata#

Field Type Description
rule_metadata Table Domain-specific metadata (assay type, organism, protocol, etc.)
[[rules]]
name = "wgs_align"
[rules.rule_metadata]
assay = "WGS"
organism = "Homo sapiens"
protocol = "Illumina_NovaSeq_6000"

Scatter-Gather (Legacy)#

The scatter field provides fan-out parallelism over a variable with optional gather. For new workflows, prefer the unified transform operator.

Field Type Description
scatter.variable String Variable to scatter over (e.g., "chr")
scatter.values Array Values to scatter across
scatter.values_from String Config variable reference for values
scatter.gather String Name of the gather rule
[[rules]]
name = "per_chr"
scatter = { variable = "chr", values = ["chr1", "chr2", "chr3"] }

Expand Inputs#

The expand_inputs field generates additional input combinations via Cartesian product expansion.

Field Type Description
expand_inputs[].pattern String Input pattern with variables
expand_inputs[].variables Table Variable name → list of values or config reference
[[rules]]
name = "multi_ref_align"
expand_inputs = [
  { pattern = "refs/{ref_genome}.fa", variables = { ref_genome = ["hg38", "t2t"] } }
]

Wildcards#

Wildcards enable dynamic, pattern-based pipeline definitions. For a detailed guide on how they are discovered, expanded, and constrained, see the Wildcards Reference.

Basic Syntax#

Use {name} in file paths for dynamic expansion:

input = ["raw/{sample}.fastq.gz"]
output = ["aligned/{sample}.bam"]

Built-in Placeholders#

Built-in placeholders use the same syntax but have reserved meanings:

Placeholder Expands to
{input} Space-separated list of all input files
{input[N]} The Nth input file (0-indexed)
{input.name} The input file named name from named_input
{output} Space-separated list of all output files
{output[N]} The Nth output file (0-indexed)
{output.name} The output file named name from named_output
{threads} Thread count assigned to this rule
{memory} Memory allocation assigned to this rule
{config.*} Value from the [config] section

Named Input & Output#

For complex rules with many files, use named_input and named_output to improve readability:

[[rules]]
name = "align"

[rules.named_input]
reads1 = "raw/{sample}_R1.fastq.gz"
reads2 = "raw/{sample}_R2.fastq.gz"

[rules.named_output]
bam = "aligned/{sample}.bam"

shell = "bwa mem {input.reads1} {input.reads2} > {output.bam}"

Custom Wildcards#

Any {name} pattern not matching a built-in placeholder is treated as a wildcard. oxo-flow expands these based on: 1. File discovery: Scanning for matching files in the input path. 2. Explicit lists: Defined in [[pairs]] or [[sample_groups]].


[[pairs]] — Experiment-Control Pairing (WC-01)#

[[pairs]] defines experiment-control sample pairs for somatic variant calling and other comparative analyses.

[[pairs]]
pair_id = "CASE_001"
experiment = "EXP_01"
control    = "CTRL_01"

[[pairs]]
pair_id = "CASE_002"
experiment = "EXP_02"
control    = "CTRL_02"
Field Type Required Description
pair_id String Yes Unique identifier for this pair
experiment String Yes Experiment sample name (alias: tumor)
control String Yes Matched control sample name (alias: normal)
experiment_type String No Optional cohort label (alias: tumor_type)
metadata Table No Arbitrary key-value pairs (each key becomes a wildcard)

Any rule that references {experiment}, {control}, or {pair_id} in its input, output, or shell fields is automatically expanded into one concrete rule instance per pair. Rules that do not reference any pair wildcard are kept as-is.

Expanded rule naming: {rule_name}_{pair_id} (e.g., mutect2_CASE_001).

Loading pairs from external file#

For large cohort studies with hundreds or thousands of pairs, use pairs_file in [workflow]:

[workflow]
name = "somatic-calling"
pairs_file = "metadata/pairs.tsv"  # or .csv, .json

TSV format (tab-separated, header required):

pair_id    experiment    control    experiment_type
CASE_001   EXP_01        CTRL_01    lung_adenocarcinoma
CASE_002   EXP_02        CTRL_02    colorectal
CASE_003   EXP_03        CTRL_03    breast_cancer

CSV format (comma-separated):

pair_id,experiment,control,experiment_type
CASE_001,EXP_01,CTRL_01,lung_adenocarcinoma
CASE_002,EXP_02,CTRL_02,colorectal

JSON format:

[
  {"pair_id": "CASE_001", "experiment": "EXP_01", "control": "CTRL_01"},
  {"pair_id": "CASE_002", "experiment": "EXP_02", "control": "CTRL_02"}
]

Inline [[pairs]] and pairs_file can be used together; entries from both sources are merged.

Auto-discovery from file pattern#

For workflows with existing paired files, use pairs_pattern in [workflow] to auto-discover pairs by scanning the filesystem:

[workflow]
name = "somatic-calling"
pairs_pattern = "aligned/{pair_id}/{experiment}_vs_{control}.bam"

oxo-flow scans files matching this pattern and extracts wildcards from paths. For a file:

aligned/CASE_001/EXP_01_vs_CTRL_01.bam

Creates pair:

  • pair_id = CASE_001
  • experiment = EXP_01
  • control = CTRL_01

Pattern requirements: - Must contain {pair_id}, {experiment}, and {control} wildcards - Optional {experiment_type} wildcard also extracted - Pattern is converted to glob (*) for filesystem scan

This eliminates the need for manual pair lists or external files when working with pre-organized directory structures.

Example#

[[pairs]]
pair_id = "CASE_001"
experiment = "EXP_01"
control    = "CTRL_01"

[[rules]]
name   = "mutect2"
input  = ["aligned/{experiment}.bam", "aligned/{control}.bam"]
output = ["variants/{pair_id}.vcf.gz"]
shell  = "gatk Mutect2 -I {input[0]} -I {input[1]} -normal {control} -O {output[0]}"

Produces rule mutect2_CASE_001 with concrete file paths.

See examples/paired_experiment_control_pairs.oxoflow for a full clinical somatic calling pipeline.


[[sample_groups]] — Multi-Sample Cohorts (WC-02)#

[[sample_groups]] organises samples into named groups (e.g., case vs. control) for cohort studies.

[[sample_groups]]
name    = "control"
samples = ["CTRL_001", "CTRL_002", "CTRL_003"]

[[sample_groups]]
name    = "case"
samples = ["CASE_001", "CASE_002"]
Field Type Required Description
name String Yes Group name
samples Array of strings Yes Sample identifiers in this group
metadata Table No Arbitrary group-level metadata

Any rule that references {sample} or {group} is expanded once per (group, sample) pair across all groups.

Expanded rule naming: {rule_name}_{group}_{sample} (e.g., align_control_CTRL_001).

Loading groups from external file#

For large cohorts, use sample_groups_file in [workflow]:

[workflow]
name = "cohort-analysis"
sample_groups_file = "metadata/groups.tsv"  # or .csv, .json

TSV format (samples can be comma-separated within the field):

name       samples
control    CTRL_001,CTRL_002,CTRL_003
case       CASE_001,CASE_002,CASE_003
treatment  TX_001,TX_002

JSON format:

[
  {"name": "control", "samples": ["CTRL_001", "CTRL_002"]},
  {"name": "case", "samples": ["CASE_001", "CASE_002"]}
]

Example#

[[sample_groups]]
name    = "treatment"
samples = ["S001", "S002"]

[[rules]]
name   = "align"
input  = ["raw/{sample}_R1.fq.gz"]
output = ["aligned/{sample}.bam"]
shell  = "bwa mem ref.fa {input[0]} > {output[0]}"

Produces align_treatment_S001 and align_treatment_S002.

See examples/cohort_analysis.oxoflow for a complete cohort study pipeline.


when — Conditional Rule Execution (WF-01)#

The optional when field on a rule contains an expression evaluated against [config] values. When the expression evaluates to false the rule is skipped entirely and removed from the DAG.

[[rules]]
name  = "fastqc"
when  = "config.run_qc"
input = ["raw/sample_R1.fq.gz"]
output = ["qc/sample_fastqc.html"]
shell = "fastqc {input[0]} -o qc/"

Expression syntax#

Form Example Description
config.<key> config.run_qc Truthy check (true, non-zero, non-empty string)
config.<key> == "value" config.mode == "WGS" String equality
config.<key> != "value" config.mode != "WES" String inequality
config.<key> == true\|false config.skip == false Boolean equality
config.<key> > N config.min_cov >= 20 Numeric comparison (>, >=, <, <=)
file_exists("path") file_exists("panel.bed") File existence test
!<expr> !config.skip Logical NOT
<expr> && <expr> config.run_qc && config.min_cov >= 20 Logical AND
<expr> \|\| <expr> config.wgs \|\| config.wes Logical OR
(<expr>) (config.a && config.b) \|\| config.c Grouping

Example#

[config]
run_annotation = true
min_coverage   = 30
mode           = "WGS"

[[rules]]
name = "vep_annotate"
when = 'config.run_annotation && config.min_coverage >= 20'
# ...

[[rules]]
name = "wgs_coverage"
when = 'config.mode == "WGS"'
# ...

See examples/conditional_workflow.oxoflow for a full example.


Dependency Resolution#

Dependencies are inferred automatically: if rule B lists a file in its input that appears in rule A's output, then B depends on A.

[[rules]]
name = "step1"
output = ["intermediate.txt"]
# ...

[[rules]]
name = "step2"
input = ["intermediate.txt"]   # depends on step1
# ...

No explicit dependency declaration is needed.


transform — Unified Scatter-Gather Operator#

The transform operator unifies split → map → combine patterns into a single rule declaration, similar to dplyr's group_by() %>% summarize() or pandas' groupby().apply().

Structure#

[[rules]]
name = "variant_calling"
input = ["aligned/sample.bam"]
output = ["variants/sample.vcf.gz"]

[rules.transform.split]
by = "chr"
values_from = "config.chromosomes"

[rules.transform]
map = "gatk HaplotypeCaller -I {input} -L {chr} -O .oxo-flow/chunks/{chr}.g.vcf.gz"
cleanup = true

[rules.transform.combine]
shell = "gatk GatherVcfs {chunks} -O {output}"

Split Configuration#

Field Type Description
by String Required. Variable name for splitting (e.g., "chr", "sample")
values Array Direct list of split values
values_from String Reference to config variable (e.g., "config.chromosomes")
n String Number of chunks (generates indices 0, 1, ..., n-1)
glob String Glob pattern to find split values from files

Priority: valuesvalues_fromnglob

Combine Configuration#

Field Type Description
shell String Shell command to combine chunks
aggregate Boolean Enable automatic aggregation
method String Aggregation method: "concat" or "json_merge"
header String Header line for concat aggregation

Built-in Variables#

Variable Expands to
{split_var} Current split value (e.g., {chr}"chr1")
{chunks} Space-separated list of all chunk outputs
{input} Original rule input (in combine)
{output} Original rule output (in combine)

Modes#

Mode A: Split → Map → Combine

Classic scatter-gather with explicit combine command:

[rules.transform.split]
by = "chr"
values_from = "config.chromosomes"

[rules.transform]
map = "gatk HaplotypeCaller -I {input} -L {chr} -O .oxo-flow/chunks/{chr}.g.vcf.gz"

[rules.transform.combine]
shell = "gatk GatherVcfs {chunks} -O {output}"

Mode B: Split → Map → Aggregate

Automatic aggregation (concat or json_merge):

[rules.transform.split]
by = "chunk"
n = "5"

[rules.transform]
map = "process {input} > .oxo-flow/chunks/{chunk}.txt"

[rules.transform.combine]
aggregate = true
method = "concat"

Mode C: Split → Map (No Combine)

Parallel processing without merging — each split produces independent output:

[rules.transform.split]
by = "chr"
values_from = "config.chromosomes"

[rules.transform]
map = "samtools flagstat {input} > qc/{chr}.flagstat.txt"
# No combine section

Cleanup#

When cleanup = true, chunk files are automatically cleaned up after combine succeeds:

[rules.transform]
cleanup = true

Failure and Retry Logic#

In a scatter-gather process, failures are handled at the chunk (map) level:

  • If a single chunk fails, only that specific chunk is retried according to the rule's retries setting.
  • Sibling chunks continue to process in parallel.
  • The combine step will not execute until all chunks succeed. If any chunk fails exhaustively (after all retries), the combine step is cancelled.

Expanded Rule Naming#

Transform rules expand into:

  • Map rules: {rule_name}_{split_value} (e.g., variant_calling_chr1)
  • Combine rule: {rule_name}_combine (e.g., variant_calling_combine)

Multi-line Strings#

Use triple quotes for multi-line shell commands:

shell = """
mkdir -p results
bwa mem -t {threads} ref.fa {input} | \
  samtools sort -@ {threads} -o {output}
"""

Complete Example#

[workflow]
name = "ngs-pipeline"
version = "2.0.0"
description = "Complete NGS analysis pipeline"
author = "Genomics Core <core@example.org>"

[config]
reference = "/data/ref/hg38.fa"
known_sites = "/data/ref/known_sites.vcf.gz"
results = "results"

[defaults]
threads = 4
memory = "8G"
environment = { conda = "envs/base.yaml" }

[report]
format = ["html"]

[[rules]]
name = "fastqc"
input = ["raw/{sample}_R1.fastq.gz", "raw/{sample}_R2.fastq.gz"]
output = ["{config.results}/qc/{sample}_R1_fastqc.html"]
shell = "fastqc {input} -o {config.results}/qc/ -t {threads}"

[[rules]]
name = "trim"
input = ["raw/{sample}_R1.fastq.gz", "raw/{sample}_R2.fastq.gz"]
output = ["{config.results}/trimmed/{sample}_R1.fastq.gz"]
environment = { docker = "biocontainers/fastp:0.23.4" }
shell = "fastp --in1 {input[0]} --in2 {input[1]} --out1 {output[0]} --thread {threads}"

[[rules]]
name = "align"
input = ["{config.results}/trimmed/{sample}_R1.fastq.gz"]
output = ["{config.results}/aligned/{sample}.bam"]
threads = 16
memory = "32G"
environment = { conda = "envs/alignment.yaml" }
shell = "bwa mem -t {threads} {config.reference} {input} | samtools sort -o {output}"

JSON Schema#

oxo-flow provides a comprehensive JSON Schema for the .oxoflow format. This can be used for automated validation in your CI/CD pipelines or for real-time autocompletion and error checking in your IDE (like VS Code or IntelliJ).

Getting the Schema#

You can output the schema directly from the CLI:

oxo-flow schema > oxoflow.schema.json

IDE Configuration (VS Code)#

To enable validation in VS Code, add the following to your settings.json:

"yaml.schemas": {
    "https://traitome.github.io/oxo-flow/schema/oxoflow-v1.schema.json": "*.oxoflow"
}

(Note: Although .oxoflow is TOML, many VS Code extensions can apply JSON schemas to multiple formats).


See Also#