Skip to content

Workflow Builder Tutorial#

This tutorial teaches you how to use the oxo-call native workflow engine to build, validate, and run reproducible multi-sample bioinformatics pipelines. You will convert the manual RNA-seq steps from the previous tutorial into a single automated .oxo.toml file.

Time to complete: 20–30 minutes Prerequisites: oxo-call configured, RNA-seq walkthrough completed (recommended) You will learn: .oxo.toml format, wildcards, dependencies, dry-run, DAG visualization


Why Use the Workflow Engine?#

Running commands manually works for a single sample. For a real experiment with 10–100 samples, you need:

  • Reproducibility: every sample processed identically
  • Parallelism: independent samples run at the same time
  • Caching: skip steps whose outputs already exist
  • Auditability: a single file describes the entire pipeline

The native .oxo.toml workflow engine provides all of this with no external dependencies.


The Workflow File Format#

A .oxo.toml file has four sections:

[workflow]       # name and description
[wildcards]      # variables that expand per sample
[params]         # shared configuration values
[[step]]         # repeated for each pipeline step

A minimal example#

[workflow]
name        = "my-pipeline"
description = "A simple two-step pipeline"

[wildcards]
sample = ["sample1", "sample2"]

[params]
threads = "4"

[[step]]
name    = "qc"
cmd     = "fastp --in1 data/{sample}_R1.fq.gz --in2 data/{sample}_R2.fq.gz \
           --out1 trimmed/{sample}_R1.fq.gz --out2 trimmed/{sample}_R2.fq.gz \
           --thread {params.threads} --html qc/{sample}.html"
inputs  = ["data/{sample}_R1.fq.gz", "data/{sample}_R2.fq.gz"]
outputs = ["trimmed/{sample}_R1.fq.gz", "qc/{sample}.html"]

[[step]]
name       = "align"
depends_on = ["qc"]
cmd        = "STAR --genomeDir /data/star_index \
              --readFilesIn trimmed/{sample}_R1.fq.gz trimmed/{sample}_R2.fq.gz \
              --readFilesCommand zcat \
              --outSAMtype BAM SortedByCoordinate \
              --outFileNamePrefix aligned/{sample}/ \
              --runThreadN {params.threads}"
inputs     = ["trimmed/{sample}_R1.fq.gz", "trimmed/{sample}_R2.fq.gz"]
outputs    = ["aligned/{sample}/Aligned.sortedByCoord.out.bam"]

When you run this with sample = ["sample1", "sample2"]:

  • qc runs for both samples in parallel
  • align runs for each sample after its qc step completes

Step 1: Explore the Built-in RNA-seq Template#

Start by examining what a production-ready template looks like:

oxo-call workflow show rnaseq

This prints the full .oxo.toml for the built-in RNA-seq template. Notice:

  • [wildcards] with sample = [...]
  • [params] for threads, star_index, and gtf
  • Steps: fastpstarmultiqc (gather) → featurecounts
  • The multiqc step has gather = true — it runs once after all samples finish

Visualize the dependency graph:

oxo-call workflow vis rnaseq

Output:

◆ workflow 'rnaseq' — 4 step(s), 1 wildcard(s)

Phase 1 (parallel):
  fastp  [per-sample: sample1, sample2, sample3]

Phase 2 (parallel):
  star  [per-sample: sample1, sample2, sample3]

Phase 3 (gather):
  multiqc  [gather across all samples]

Phase 4 (parallel):
  featurecounts  [per-sample: sample1, sample2, sample3]

Step 2: Customize a Template for Your Data#

Save the template to a file and edit it:

oxo-call workflow show rnaseq > my_rnaseq.toml

Open my_rnaseq.toml and edit the wildcards and params sections:

[wildcards]
sample = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]   # your sample names

[params]
threads    = "8"
star_index = "/data/star_hg38"                          # your STAR index
gtf        = "/data/gencode.v44.gtf"                    # your GTF file

Also update the inputs paths in each step to match your data layout. For example, if your data is in /data/fastq/{sample}_R1.fq.gz:

[[step]]
name   = "fastp"
cmd    = "fastp --in1 /data/fastq/{sample}_R1.fq.gz ..."
inputs = ["/data/fastq/{sample}_R1.fq.gz", "/data/fastq/{sample}_R2.fq.gz"]

Step 3: Validate Before Running#

Always validate your workflow file before running it:

oxo-call workflow verify my_rnaseq.toml

This checks for:

  • Malformed TOML
  • References to undefined wildcards or params
  • Unknown depends_on steps
  • Step ordering violations (depending on a step defined later)
  • DAG cycles

Example valid output:

◆ workflow 'rnaseq' — 4 step(s), 1 wildcard(s)
✓ No issues found — workflow is valid

Example error output:

◆ workflow 'rnaseq' — 4 step(s), 1 wildcard(s)
✗ Step 'star' depends on 'qc' which is not defined
✗ {params.star_index} is used but 'star_index' is not in [params]

Fix any errors before proceeding.


Step 4: Preview with Dry-Run#

Do a full dry-run to see every expanded command before executing:

oxo-call workflow dry-run my_rnaseq.toml

This shows:

  • DAG phase diagram
  • Every expanded command (with wildcards substituted)
  • Dependencies and output paths
  • Which steps would be cached (outputs already newer than inputs)

Example dry-run output:

◆ Workflow: rnaseq (4 steps, 4 samples)

Phase 1 — fastp [ctrl_1]
  Command: fastp --in1 /data/fastq/ctrl_1_R1.fq.gz ...
  Inputs:  /data/fastq/ctrl_1_R1.fq.gz, /data/fastq/ctrl_1_R2.fq.gz
  Outputs: trimmed/ctrl_1_R1.fq.gz, qc/ctrl_1.html

Phase 1 — fastp [ctrl_2]
  Command: fastp --in1 /data/fastq/ctrl_2_R1.fq.gz ...
  ...

[SKIP] Phase 2 — star [ctrl_1]  (outputs up-to-date)

The [SKIP] lines tell you which steps will be cached.


Step 5: Format for Readability#

Auto-format the workflow file for consistent style:

oxo-call workflow fmt my_rnaseq.toml

This normalizes key alignment and quoting. Use --stdout to preview changes without modifying the file:

oxo-call workflow fmt my_rnaseq.toml --stdout

Step 6: Run the Workflow#

Once everything looks correct, execute:

oxo-call workflow run my_rnaseq.toml

The engine will:

  1. Expand wildcards for all samples
  2. Build the DAG
  3. Run Phase 1 steps (fastp) in parallel across all samples
  4. When all Phase 1 steps finish, run Phase 2 (STAR) in parallel
  5. After STAR finishes, run MultiQC as a gather step (once)
  6. Run featureCounts in parallel for all samples

Progress output:

[1/16] fastp ctrl_1        ... done (12.3s)
[2/16] fastp ctrl_2        ... done (11.8s)
[3/16] fastp treat_1       ... done (13.1s)
[4/16] fastp treat_2       ... done (12.7s)
[5/16] star ctrl_1         ... done (4m 12s)
...
[13/16] multiqc            ... done (3.2s)
[14/16] featurecounts ctrl_1  ... done (45.2s)
...
✓ Workflow complete in 18m 32s

Step 7: Export to Snakemake or Nextflow#

If your HPC cluster requires Snakemake or Nextflow:

# Export to Snakemake
oxo-call workflow export my_rnaseq.toml --to snakemake -o Snakefile

# Export to Nextflow DSL2
oxo-call workflow export my_rnaseq.toml --to nextflow -o main.nf

The exported files preserve all sample wildcards and dependency structure.


Generate a New Workflow with LLM#

You can also ask the LLM to generate a workflow from scratch:

oxo-call workflow generate \
  "ChIP-seq pipeline for H3K27ac, paired-end, with bowtie2 alignment, \
   picard duplicate marking, and macs3 peak calling against input control" \
  -o chipseq_h3k27ac.toml

Always validate and dry-run LLM-generated workflows before executing:

oxo-call workflow verify chipseq_h3k27ac.toml
oxo-call workflow dry-run chipseq_h3k27ac.toml

Workflow Design Tips#

Keep steps focused#

Each [[step]] should do one thing. Avoid chaining multiple tools with && unless they are tightly coupled (e.g., samtools sort && samtools index).

Always specify inputs and outputs#

The engine uses inputs and outputs for cache checking. A step without outputs will always re-run.

Use gather = true for aggregation steps#

Steps that aggregate across all samples (MultiQC, count matrix merging) should have gather = true to ensure they run after all sample instances complete.

Step order matters#

Steps must be defined in order — a step can only reference dependencies that appear before it in the file.

# ✓ CORRECT: align is defined after qc
[[step]]
name = "qc"
...

[[step]]
name       = "align"
depends_on = ["qc"]
...

# ✗ WRONG: align references qc which is defined after it
[[step]]
name       = "align"
depends_on = ["qc"]
...

[[step]]
name = "qc"
...

What You Learned#

  • How to write a .oxo.toml workflow file from scratch
  • How wildcards expand per-sample commands
  • How gather = true enables aggregation steps like MultiQC
  • How to validate, visualize, dry-run, and execute a workflow
  • How to export to Snakemake or Nextflow
  • How to generate a workflow from natural language

Next steps: - Build pipeline how-to — advanced pipeline patterns - Workflow Engine reference — complete format specification - workflow command reference — all subcommands