How-to: Build a Production Pipeline#
This guide covers advanced patterns for building robust, production-ready bioinformatics pipelines with the oxo-call workflow engine.
Pipeline Design Checklist#
Before writing a .oxo.toml file for a real project:
- Define all samples in
[wildcards] - Extract shared configuration into
[params] - Specify
inputsandoutputsfor every step (enables caching) - Use
depends_onto express all dependencies explicitly - Use
gather = truefor aggregation steps - Run
oxo-call workflow verifybefore any run - Run
oxo-call workflow dry-runto inspect every expanded command - Run
oxo-call workflow visto confirm the DAG looks correct
Pattern 1: Per-Sample Steps with Shared Parameters#
The most common pattern: each sample runs through the same steps, sharing configuration:
[workflow]
name = "wgs-variant-calling"
description = "WGS variant calling pipeline: alignment → BQSR → HaplotypeCaller"
[wildcards]
sample = ["sample_A", "sample_B", "sample_C", "sample_D"]
[params]
threads = "16"
ref = "/data/hg38/hg38.fa"
known = "/data/hg38/dbsnp_146.hg38.vcf.gz"
intervals = "/data/hg38/wgs_calling_regions.hg38.interval_list"
[[step]]
name = "bwa_align"
cmd = "bwa-mem2 mem -t {params.threads} {params.ref} \
data/{sample}_R1.fq.gz data/{sample}_R2.fq.gz \
| samtools sort -@ 4 -o aligned/{sample}.bam && \
samtools index aligned/{sample}.bam"
inputs = ["data/{sample}_R1.fq.gz", "data/{sample}_R2.fq.gz"]
outputs = ["aligned/{sample}.bam", "aligned/{sample}.bam.bai"]
[[step]]
name = "mark_duplicates"
depends_on = ["bwa_align"]
cmd = "picard MarkDuplicates \
I=aligned/{sample}.bam \
O=dedup/{sample}.bam \
M=dedup/{sample}_metrics.txt && \
samtools index dedup/{sample}.bam"
inputs = ["aligned/{sample}.bam"]
outputs = ["dedup/{sample}.bam", "dedup/{sample}.bam.bai"]
[[step]]
name = "bqsr"
depends_on = ["mark_duplicates"]
cmd = "gatk BaseRecalibrator \
-I dedup/{sample}.bam \
-R {params.ref} \
--known-sites {params.known} \
-O bqsr/{sample}.recal.table && \
gatk ApplyBQSR \
-I dedup/{sample}.bam \
-R {params.ref} \
--bqsr-recal-file bqsr/{sample}.recal.table \
-O bqsr/{sample}.bam"
inputs = ["dedup/{sample}.bam"]
outputs = ["bqsr/{sample}.bam"]
[[step]]
name = "haplotypecaller"
depends_on = ["bqsr"]
cmd = "gatk HaplotypeCaller \
-I bqsr/{sample}.bam \
-R {params.ref} \
-L {params.intervals} \
-O gvcf/{sample}.g.vcf.gz \
-ERC GVCF \
--native-pair-hmm-threads 4"
inputs = ["bqsr/{sample}.bam"]
outputs = ["gvcf/{sample}.g.vcf.gz"]
Pattern 2: Gather Steps#
Gather steps aggregate results across all samples and run exactly once after all instances of their dependencies complete:
[[step]]
name = "combine_gvcfs"
gather = true
depends_on = ["haplotypecaller"]
cmd = "gatk CombineGVCFs \
-R {params.ref} \
$(ls gvcf/*.g.vcf.gz | sed 's/^/-V /') \
-O combined/cohort.g.vcf.gz"
inputs = ["gvcf/{sample}.g.vcf.gz"]
outputs = ["combined/cohort.g.vcf.gz"]
[[step]]
name = "genotype_gvcfs"
depends_on = ["combine_gvcfs"]
cmd = "gatk GenotypeGVCFs \
-R {params.ref} \
-V combined/cohort.g.vcf.gz \
-O final/cohort.vcf.gz"
inputs = ["combined/cohort.g.vcf.gz"]
outputs = ["final/cohort.vcf.gz"]
With 4 samples, the execution order is:
bwa_align× 4 (parallel)mark_duplicates× 4 (parallel)bqsr× 4 (parallel)haplotypecaller× 4 (parallel)combine_gvcfs× 1 (gather — waits for all 4)genotype_gvcfs× 1 (sequential after combine)
Pattern 3: Mixed Wildcards#
Some steps may process pairs of conditions rather than individual samples:
Steps using {sample} expand per sample. Steps using {condition} expand per condition.
Note: Mixed wildcards are advanced and require care. Use
workflow verifyto check for expansion errors.
Pattern 4: Conditional Output Paths#
Use parameter substitution to control output organization:
[params]
outdir = "results/v2"
threads = "8"
[[step]]
name = "fastp"
cmd = "fastp --in1 data/{sample}_R1.fq.gz --in2 data/{sample}_R2.fq.gz \
--out1 {params.outdir}/trimmed/{sample}_R1.fq.gz \
--out2 {params.outdir}/trimmed/{sample}_R2.fq.gz \
--thread {params.threads}"
outputs = ["{params.outdir}/trimmed/{sample}_R1.fq.gz"]
Changing outdir in [params] moves all output paths to a new versioned directory.
Restarting a Failed Pipeline#
If a step fails mid-run, fix the issue and re-run:
The engine automatically skips steps whose outputs are already newer than their inputs. Only failed and downstream steps will re-run.
To force a specific step to re-run, delete its output files:
rm aligned/sample_A.bam aligned/sample_A.bam.bai
oxo-call workflow run my_pipeline.toml
# Only bwa_align for sample_A and its downstream steps will re-run
HPC/Cluster Submission#
The native engine runs on the current machine. For HPC cluster execution, export to Snakemake with a cluster profile:
oxo-call workflow export my_pipeline.toml --to snakemake -o Snakefile
# Run on SLURM cluster
snakemake --cluster "sbatch -p short -c {threads} --mem=16G" --jobs 50
# Run with Singularity containers (if workflow has container: directives)
snakemake --use-singularity --cluster "sbatch ..." --jobs 50
Or export to Nextflow for cloud execution:
oxo-call workflow export my_pipeline.toml --to nextflow -o main.nf
# Run locally
nextflow run main.nf
# Run on AWS
nextflow run main.nf -profile aws
Troubleshooting Pipeline Issues#
Verify first#
Fix all reported errors before running.
Check the DAG#
Confirm the phases and dependencies look correct.
Dry-run to inspect commands#
Common errors#
| Error | Cause | Fix |
|---|---|---|
step 'X' depends on 'Y' which is not defined |
Typo in depends_on |
Check step names |
{params.X} is used but 'X' is not in [params] |
Missing param key | Add to [params] |
forward reference to step 'Y' |
Step ordering | Move the referenced step before the current step |
DAG cycle detected |
Circular dependency | Break the cycle |
| Step always re-runs | No outputs defined |
Add outputs list |
Related#
- Workflow Builder tutorial — step-by-step for beginners
- Workflow Engine reference — complete format specification
- workflow command reference — all subcommands