Your First Workflow#

This tutorial walks you through building a realistic bioinformatics workflow from scratch. You will create a quality-control pipeline that processes FASTQ files through FastQC and fastp, then generates a summary report.

Prerequisites#

oxo-flow installed
Paired-end FASTQ files (or the willingness to create test files)
conda or mamba available for environment management.
- If you don't have either, we recommend Miniforge or Mambaforge.

1. Set up the project#

oxo-flow init qc-pipeline
cd qc-pipeline

2. Create environment files#

Create a conda environment file for the QC tools:

# envs/qc.yaml
name: qc
channels:
  - bioconda
  - conda-forge
dependencies:
  - fastqc=0.12.1
  - fastp=0.23.4
  - multiqc=1.21

3. Write the workflow#

Replace qc-pipeline.oxoflow with:

Configuration Syntax

{config.samples_dir} refers to the samples_dir variable defined in the [config] section. This allows you to centralize paths and settings.

[workflow]
name = "qc-pipeline"
version = "1.0.0"
description = "Quality control for paired-end sequencing data"
author = "Your Name"

[config]
samples_dir = "raw_data"
results_dir = "results"

[defaults]
threads = 4
memory = "8G"

!!! info "Wildcard Patterns"
    The `{sample}` in the file paths below is a **wildcard**. oxo-flow will scan your `raw_data` directory for files matching the pattern `{sample}_R1.fastq.gz`, extract the sample name, and automatically generate a task for every sample it finds.

[[rules]]
name = "fastqc_raw"
input = [
    "{config.samples_dir}/{sample}_R1.fastq.gz",
    "{config.samples_dir}/{sample}_R2.fastq.gz"
]
output = [
    "{config.results_dir}/fastqc/{sample}_R1_fastqc.html",
    "{config.results_dir}/fastqc/{sample}_R1_fastqc.zip",
    "{config.results_dir}/fastqc/{sample}_R2_fastqc.html",
    "{config.results_dir}/fastqc/{sample}_R2_fastqc.zip"
]
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/fastqc
fastqc {input} -o {config.results_dir}/fastqc -t {threads}
"""

[[rules]]
name = "fastp_trim"
input = [
    "{config.samples_dir}/{sample}_R1.fastq.gz",
    "{config.samples_dir}/{sample}_R2.fastq.gz"
]
output = [
    "{config.results_dir}/trimmed/{sample}_R1.fastq.gz",
    "{config.results_dir}/trimmed/{sample}_R2.fastq.gz",
    "{config.results_dir}/trimmed/{sample}_fastp.html",
    "{config.results_dir}/trimmed/{sample}_fastp.json"
]
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/trimmed
fastp \
  --in1 {config.samples_dir}/{sample}_R1.fastq.gz \
  --in2 {config.samples_dir}/{sample}_R2.fastq.gz \
  --out1 {config.results_dir}/trimmed/{sample}_R1.fastq.gz \
  --out2 {config.results_dir}/trimmed/{sample}_R2.fastq.gz \
  --html {config.results_dir}/trimmed/{sample}_fastp.html \
  --json {config.results_dir}/trimmed/{sample}_fastp.json \
  --thread {threads}
"""

[[rules]]
name = "fastqc_trimmed"
input = [
    "{config.results_dir}/trimmed/{sample}_R1.fastq.gz",
    "{config.results_dir}/trimmed/{sample}_R2.fastq.gz"
]
output = [
    "{config.results_dir}/fastqc_trimmed/{sample}_R1_fastqc.html",
    "{config.results_dir}/fastqc_trimmed/{sample}_R1_fastqc.zip"
]
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/fastqc_trimmed
fastqc {input} -o {config.results_dir}/fastqc_trimmed -t {threads}
"""

[[rules]]
name = "multiqc"
input = [
    "{config.results_dir}/fastqc/",
    "{config.results_dir}/fastqc_trimmed/",
    "{config.results_dir}/trimmed/"
]
output = [
    "{config.results_dir}/multiqc/multiqc_report.html"
]
threads = 1
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/multiqc
multiqc {config.results_dir} -o {config.results_dir}/multiqc --force
"""

4. Understand the dependency graph#

The workflow forms this DAG:

graph TD
    A[fastqc_raw] --> D[multiqc]
    B[fastp_trim] --> C[fastqc_trimmed]
    B --> D
    C --> D

fastqc_raw and fastp_trim can run in parallel (no dependency between them)
fastqc_trimmed depends on fastp_trim's output
multiqc depends on all three upstream rules

5. Prepare Test Data#

For this tutorial, create minimal test files so oxo-flow has something to process:

mkdir -p raw_data
# Create dummy compressed fastq files
echo "@test1" | gzip > raw_data/sample1_R1.fastq.gz
echo "@test1" | gzip > raw_data/sample1_R2.fastq.gz
echo "@test2" | gzip > raw_data/sample2_R1.fastq.gz
echo "@test2" | gzip > raw_data/sample2_R2.fastq.gz

6. Validate and preview#

oxo-flow validate qc-pipeline.oxoflow
# ✓ qc-pipeline.oxoflow — 4 rules, 4 dependencies

oxo-flow dry-run qc-pipeline.oxoflow

7. Visualize the DAG#

oxo-flow graph qc-pipeline.oxoflow

8. Run with parallel execution#

oxo-flow run qc-pipeline.oxoflow -j 4

The -j 4 flag allows up to 4 jobs to run concurrently. oxo-flow will execute fastqc_raw and fastp_trim in parallel, then fastqc_trimmed, then multiqc.

Key Concepts Covered#

Concept	Where you saw it
Workflow metadata	`[workflow]` section with name, version, description
Configuration variables	`[config]` section referenced as `{config.samples_dir}`
Defaults	`[defaults]` section applied to all rules
Per-rule overrides	`multiqc` rule overrides `threads = 1`
Environment specs	`environment = { conda = "envs/qc.yaml" }`
Wildcard patterns	`{sample}` in file paths
Multi-line shell	Triple-quoted strings with `"""`
Automatic dependencies	Input/output matching across rules

Next Steps#

Variant Calling Pipeline — build a complete NGS analysis workflow
Environment Management — use docker, singularity, and more
Create a Workflow — reference guide for .oxoflow authoring