Workflow Builder Tutorial#
This tutorial teaches you how to use the oxo-call native workflow engine to build, validate, and run reproducible multi-sample bioinformatics pipelines. You will convert the manual RNA-seq steps from the previous tutorial into a single automated .oxo.toml file.
Time to complete: 20–30 minutes
Prerequisites: oxo-call configured, RNA-seq walkthrough completed (recommended)
You will learn: .oxo.toml format, wildcards, dependencies, dry-run, DAG visualization
Why Use the Workflow Engine?#
Running commands manually works for a single sample. For a real experiment with 10–100 samples, you need:
- Reproducibility: every sample processed identically
- Parallelism: independent samples run at the same time
- Caching: skip steps whose outputs already exist
- Auditability: a single file describes the entire pipeline
The native .oxo.toml workflow engine provides all of this with no external dependencies.
The Workflow File Format#
A .oxo.toml file has four sections:
[workflow] # name and description
[wildcards] # variables that expand per sample
[params] # shared configuration values
[[step]] # repeated for each pipeline step
A minimal example#
[workflow]
name = "my-pipeline"
description = "A simple two-step pipeline"
[wildcards]
sample = ["sample1", "sample2"]
[params]
threads = "4"
[[step]]
name = "qc"
cmd = "fastp --in1 data/{sample}_R1.fq.gz --in2 data/{sample}_R2.fq.gz \
--out1 trimmed/{sample}_R1.fq.gz --out2 trimmed/{sample}_R2.fq.gz \
--thread {params.threads} --html qc/{sample}.html"
inputs = ["data/{sample}_R1.fq.gz", "data/{sample}_R2.fq.gz"]
outputs = ["trimmed/{sample}_R1.fq.gz", "qc/{sample}.html"]
[[step]]
name = "align"
depends_on = ["qc"]
cmd = "STAR --genomeDir /data/star_index \
--readFilesIn trimmed/{sample}_R1.fq.gz trimmed/{sample}_R2.fq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix aligned/{sample}/ \
--runThreadN {params.threads}"
inputs = ["trimmed/{sample}_R1.fq.gz", "trimmed/{sample}_R2.fq.gz"]
outputs = ["aligned/{sample}/Aligned.sortedByCoord.out.bam"]
When you run this with sample = ["sample1", "sample2"]:
qcruns for both samples in parallelalignruns for each sample after itsqcstep completes
Step 1: Explore the Built-in RNA-seq Template#
Start by examining what a production-ready template looks like:
This prints the full .oxo.toml for the built-in RNA-seq template. Notice:
[wildcards]withsample = [...][params]forthreads,star_index, andgtf- Steps:
fastp→star→multiqc(gather) →featurecounts - The
multiqcstep hasgather = true— it runs once after all samples finish
Visualize the dependency graph:
Output:
◆ workflow 'rnaseq' — 4 step(s), 1 wildcard(s)
Phase 1 (parallel):
fastp [per-sample: sample1, sample2, sample3]
Phase 2 (parallel):
star [per-sample: sample1, sample2, sample3]
Phase 3 (gather):
multiqc [gather across all samples]
Phase 4 (parallel):
featurecounts [per-sample: sample1, sample2, sample3]
Step 2: Customize a Template for Your Data#
Save the template to a file and edit it:
Open my_rnaseq.toml and edit the wildcards and params sections:
[wildcards]
sample = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"] # your sample names
[params]
threads = "8"
star_index = "/data/star_hg38" # your STAR index
gtf = "/data/gencode.v44.gtf" # your GTF file
Also update the inputs paths in each step to match your data layout. For example, if your data is in /data/fastq/{sample}_R1.fq.gz:
[[step]]
name = "fastp"
cmd = "fastp --in1 /data/fastq/{sample}_R1.fq.gz ..."
inputs = ["/data/fastq/{sample}_R1.fq.gz", "/data/fastq/{sample}_R2.fq.gz"]
Step 3: Validate Before Running#
Always validate your workflow file before running it:
This checks for:
- Malformed TOML
- References to undefined wildcards or params
- Unknown
depends_onsteps - Step ordering violations (depending on a step defined later)
- DAG cycles
Example valid output:
Example error output:
◆ workflow 'rnaseq' — 4 step(s), 1 wildcard(s)
✗ Step 'star' depends on 'qc' which is not defined
✗ {params.star_index} is used but 'star_index' is not in [params]
Fix any errors before proceeding.
Step 4: Preview with Dry-Run#
Do a full dry-run to see every expanded command before executing:
This shows:
- DAG phase diagram
- Every expanded command (with wildcards substituted)
- Dependencies and output paths
- Which steps would be cached (outputs already newer than inputs)
Example dry-run output:
◆ Workflow: rnaseq (4 steps, 4 samples)
Phase 1 — fastp [ctrl_1]
Command: fastp --in1 /data/fastq/ctrl_1_R1.fq.gz ...
Inputs: /data/fastq/ctrl_1_R1.fq.gz, /data/fastq/ctrl_1_R2.fq.gz
Outputs: trimmed/ctrl_1_R1.fq.gz, qc/ctrl_1.html
Phase 1 — fastp [ctrl_2]
Command: fastp --in1 /data/fastq/ctrl_2_R1.fq.gz ...
...
[SKIP] Phase 2 — star [ctrl_1] (outputs up-to-date)
The [SKIP] lines tell you which steps will be cached.
Step 5: Format for Readability#
Auto-format the workflow file for consistent style:
This normalizes key alignment and quoting. Use --stdout to preview changes without modifying the file:
Step 6: Run the Workflow#
Once everything looks correct, execute:
The engine will:
- Expand wildcards for all samples
- Build the DAG
- Run Phase 1 steps (fastp) in parallel across all samples
- When all Phase 1 steps finish, run Phase 2 (STAR) in parallel
- After STAR finishes, run MultiQC as a gather step (once)
- Run featureCounts in parallel for all samples
Progress output:
[1/16] fastp ctrl_1 ... done (12.3s)
[2/16] fastp ctrl_2 ... done (11.8s)
[3/16] fastp treat_1 ... done (13.1s)
[4/16] fastp treat_2 ... done (12.7s)
[5/16] star ctrl_1 ... done (4m 12s)
...
[13/16] multiqc ... done (3.2s)
[14/16] featurecounts ctrl_1 ... done (45.2s)
...
✓ Workflow complete in 18m 32s
Step 7: Export to Snakemake or Nextflow#
If your HPC cluster requires Snakemake or Nextflow:
# Export to Snakemake
oxo-call workflow export my_rnaseq.toml --to snakemake -o Snakefile
# Export to Nextflow DSL2
oxo-call workflow export my_rnaseq.toml --to nextflow -o main.nf
The exported files preserve all sample wildcards and dependency structure.
Generate a New Workflow with LLM#
You can also ask the LLM to generate a workflow from scratch:
oxo-call workflow generate \
"ChIP-seq pipeline for H3K27ac, paired-end, with bowtie2 alignment, \
picard duplicate marking, and macs3 peak calling against input control" \
-o chipseq_h3k27ac.toml
Always validate and dry-run LLM-generated workflows before executing:
Workflow Design Tips#
Keep steps focused#
Each [[step]] should do one thing. Avoid chaining multiple tools with && unless they are tightly coupled (e.g., samtools sort && samtools index).
Always specify inputs and outputs#
The engine uses inputs and outputs for cache checking. A step without outputs will always re-run.
Use gather = true for aggregation steps#
Steps that aggregate across all samples (MultiQC, count matrix merging) should have gather = true to ensure they run after all sample instances complete.
Step order matters#
Steps must be defined in order — a step can only reference dependencies that appear before it in the file.
# ✓ CORRECT: align is defined after qc
[[step]]
name = "qc"
...
[[step]]
name = "align"
depends_on = ["qc"]
...
# ✗ WRONG: align references qc which is defined after it
[[step]]
name = "align"
depends_on = ["qc"]
...
[[step]]
name = "qc"
...
What You Learned#
- How to write a
.oxo.tomlworkflow file from scratch - How wildcards expand per-sample commands
- How
gather = trueenables aggregation steps like MultiQC - How to validate, visualize, dry-run, and execute a workflow
- How to export to Snakemake or Nextflow
- How to generate a workflow from natural language
Next steps: - Build pipeline how-to — advanced pipeline patterns - Workflow Engine reference — complete format specification - workflow command reference — all subcommands