Your First Workflow#
This tutorial walks you through building a realistic bioinformatics workflow from scratch. You will create a quality-control pipeline that processes FASTQ files through FastQC and fastp, then generates a summary report.
Prerequisites#
- oxo-flow installed
- Paired-end FASTQ files (or the willingness to create test files)
- conda or mamba available for environment management.
- If you don't have either, we recommend Miniforge or Mambaforge.
1. Set up the project#
2. Create environment files#
Create a conda environment file for the QC tools:
# envs/qc.yaml
name: qc
channels:
- bioconda
- conda-forge
dependencies:
- fastqc=0.12.1
- fastp=0.23.4
- multiqc=1.21
3. Write the workflow#
Replace qc-pipeline.oxoflow with:
Configuration Syntax
{config.samples_dir} refers to the samples_dir variable defined in the [config] section. This allows you to centralize paths and settings.
[workflow]
name = "qc-pipeline"
version = "1.0.0"
description = "Quality control for paired-end sequencing data"
author = "Your Name"
[config]
samples_dir = "raw_data"
results_dir = "results"
[defaults]
threads = 4
memory = "8G"
!!! info "Wildcard Patterns"
The `{sample}` in the file paths below is a **wildcard**. oxo-flow will scan your `raw_data` directory for files matching the pattern `{sample}_R1.fastq.gz`, extract the sample name, and automatically generate a task for every sample it finds.
[[rules]]
name = "fastqc_raw"
input = [
"{config.samples_dir}/{sample}_R1.fastq.gz",
"{config.samples_dir}/{sample}_R2.fastq.gz"
]
output = [
"{config.results_dir}/fastqc/{sample}_R1_fastqc.html",
"{config.results_dir}/fastqc/{sample}_R1_fastqc.zip",
"{config.results_dir}/fastqc/{sample}_R2_fastqc.html",
"{config.results_dir}/fastqc/{sample}_R2_fastqc.zip"
]
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/fastqc
fastqc {input} -o {config.results_dir}/fastqc -t {threads}
"""
[[rules]]
name = "fastp_trim"
input = [
"{config.samples_dir}/{sample}_R1.fastq.gz",
"{config.samples_dir}/{sample}_R2.fastq.gz"
]
output = [
"{config.results_dir}/trimmed/{sample}_R1.fastq.gz",
"{config.results_dir}/trimmed/{sample}_R2.fastq.gz",
"{config.results_dir}/trimmed/{sample}_fastp.html",
"{config.results_dir}/trimmed/{sample}_fastp.json"
]
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/trimmed
fastp \
--in1 {config.samples_dir}/{sample}_R1.fastq.gz \
--in2 {config.samples_dir}/{sample}_R2.fastq.gz \
--out1 {config.results_dir}/trimmed/{sample}_R1.fastq.gz \
--out2 {config.results_dir}/trimmed/{sample}_R2.fastq.gz \
--html {config.results_dir}/trimmed/{sample}_fastp.html \
--json {config.results_dir}/trimmed/{sample}_fastp.json \
--thread {threads}
"""
[[rules]]
name = "fastqc_trimmed"
input = [
"{config.results_dir}/trimmed/{sample}_R1.fastq.gz",
"{config.results_dir}/trimmed/{sample}_R2.fastq.gz"
]
output = [
"{config.results_dir}/fastqc_trimmed/{sample}_R1_fastqc.html",
"{config.results_dir}/fastqc_trimmed/{sample}_R1_fastqc.zip"
]
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/fastqc_trimmed
fastqc {input} -o {config.results_dir}/fastqc_trimmed -t {threads}
"""
[[rules]]
name = "multiqc"
input = [
"{config.results_dir}/fastqc/",
"{config.results_dir}/fastqc_trimmed/",
"{config.results_dir}/trimmed/"
]
output = [
"{config.results_dir}/multiqc/multiqc_report.html"
]
threads = 1
environment = { conda = "envs/qc.yaml" }
shell = """
mkdir -p {config.results_dir}/multiqc
multiqc {config.results_dir} -o {config.results_dir}/multiqc --force
"""
4. Understand the dependency graph#
The workflow forms this DAG:
graph TD
A[fastqc_raw] --> D[multiqc]
B[fastp_trim] --> C[fastqc_trimmed]
B --> D
C --> D
fastqc_rawandfastp_trimcan run in parallel (no dependency between them)fastqc_trimmeddepends onfastp_trim's outputmultiqcdepends on all three upstream rules
5. Prepare Test Data#
For this tutorial, create minimal test files so oxo-flow has something to process:
mkdir -p raw_data
# Create dummy compressed fastq files
echo "@test1" | gzip > raw_data/sample1_R1.fastq.gz
echo "@test1" | gzip > raw_data/sample1_R2.fastq.gz
echo "@test2" | gzip > raw_data/sample2_R1.fastq.gz
echo "@test2" | gzip > raw_data/sample2_R2.fastq.gz
6. Validate and preview#
oxo-flow validate qc-pipeline.oxoflow
# ✓ qc-pipeline.oxoflow — 4 rules, 4 dependencies
oxo-flow dry-run qc-pipeline.oxoflow
7. Visualize the DAG#
8. Run with parallel execution#
The -j 4 flag allows up to 4 jobs to run concurrently. oxo-flow will execute fastqc_raw and fastp_trim in parallel, then fastqc_trimmed, then multiqc.
Key Concepts Covered#
| Concept | Where you saw it |
|---|---|
| Workflow metadata | [workflow] section with name, version, description |
| Configuration variables | [config] section referenced as {config.samples_dir} |
| Defaults | [defaults] section applied to all rules |
| Per-rule overrides | multiqc rule overrides threads = 1 |
| Environment specs | environment = { conda = "envs/qc.yaml" } |
| Wildcard patterns | {sample} in file paths |
| Multi-line shell | Triple-quoted strings with """ |
| Automatic dependencies | Input/output matching across rules |
Next Steps#
- Variant Calling Pipeline — build a complete NGS analysis workflow
- Environment Management — use docker, singularity, and more
- Create a Workflow — reference guide for
.oxoflowauthoring