08 — Multi-Omics Integration#
Integrate whole-genome sequencing (WGS), RNA-seq, and bisulfite sequencing (methylation) data into a unified analysis. This represents the most complex DAG topology in the gallery, with three independent processing branches that converge at an integration step.
Concepts Covered
- Complex branching DAG topology with three independent data branches
- Cross-omics data integration (WGS + RNA-seq + Methylation)
- Multiple environment backends in a single pipeline
- Fan-in convergence from independent branches
- Clinical-grade multi-omics reporting
Pipeline Overview#
graph TD
subgraph WGS Branch
A1[wgs_align] --> A2[wgs_call_variants]
end
subgraph RNA-seq Branch
B1[rnaseq_align] --> B2[rnaseq_quantify]
end
subgraph Methylation Branch
C1[bismark_align] --> C2[methylation_extract]
end
A2 --> D[integrate_omics]
B2 --> D
C2 --> D
D --> E[generate_report]
Branches:
- WGS Branch — Alignment → Variant calling (DNA mutations)
- RNA-seq Branch — Splice-aware alignment → Gene expression quantification
- Methylation Branch — Bisulfite alignment → CpG methylation extraction
Convergence:
- Integration — Combine variant, expression, and methylation data per sample
- Report — Generate a multi-omics summary report
Workflow Definition#
# examples/gallery/08_multiomics_integration.oxoflow
[workflow]
name = "multiomics-integration"
version = "1.0.0"
description = "Multi-omics data integration: WGS + RNA-seq + Methylation"
author = "oxo-flow examples"
[config]
reference = "/data/references/GRCh38/genome.fa"
gene_annotation = "/data/references/GRCh38/genes.gtf"
samples = "samples.csv"
genome_build = "GRCh38"
[defaults]
threads = 4
memory = "8G"
# === WGS Branch ===
[[rules]]
name = "wgs_align"
input = ["wgs/{sample}_R1.fastq.gz", "wgs/{sample}_R2.fastq.gz"]
output = ["wgs_aligned/{sample}.sorted.bam"]
threads = 16
memory = "32G"
description = "WGS read alignment with BWA-MEM2"
shell = """
mkdir -p wgs_aligned
bwa-mem2 mem -t {threads} {config.reference} {input[0]} {input[1]} \
| samtools sort -@ 4 -o {output[0]}
samtools index {output[0]}
"""
[rules.environment]
docker = "biocontainers/bwa-mem2:2.2.1"
[[rules]]
name = "wgs_call_variants"
input = ["wgs_aligned/{sample}.sorted.bam"]
output = ["wgs_variants/{sample}.vcf.gz"]
threads = 8
memory = "16G"
description = "Variant calling with GATK HaplotypeCaller"
shell = """
mkdir -p wgs_variants
gatk HaplotypeCaller \
-I {input[0]} -R {config.reference} \
-O {output[0]} --native-pair-hmm-threads {threads}
"""
[rules.environment]
singularity = "docker://broadinstitute/gatk:4.5.0.0"
# === RNA-seq Branch ===
[[rules]]
name = "rnaseq_align"
input = ["rnaseq/{sample}_R1.fastq.gz", "rnaseq/{sample}_R2.fastq.gz"]
output = ["rnaseq_aligned/{sample}/Aligned.sortedByCoord.out.bam"]
threads = 16
memory = "32G"
description = "RNA-seq splice-aware alignment with STAR"
shell = """
mkdir -p rnaseq_aligned/{sample}
STAR --runThreadN {threads} \
--genomeDir /data/references/GRCh38/star_index \
--readFilesIn {input[0]} {input[1]} \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix rnaseq_aligned/{sample}/
"""
[rules.environment]
conda = "envs/star.yaml"
[[rules]]
name = "rnaseq_quantify"
input = ["rnaseq_aligned/{sample}/Aligned.sortedByCoord.out.bam"]
output = ["expression/{sample}.counts.txt"]
threads = 4
description = "Gene expression quantification with featureCounts"
shell = """
mkdir -p expression
featureCounts -T {threads} \
-a {config.gene_annotation} \
-o {output[0]} -p --countReadPairs \
{input[0]}
"""
[rules.environment]
conda = "envs/subread.yaml"
# === Methylation Branch ===
[[rules]]
name = "bismark_align"
input = ["methyl/{sample}_R1.fastq.gz", "methyl/{sample}_R2.fastq.gz"]
output = ["methyl_aligned/{sample}.deduplicated.bam"]
threads = 8
memory = "32G"
description = "Bisulfite-seq alignment with Bismark"
shell = """
mkdir -p methyl_aligned
bismark --genome /data/references/GRCh38/bismark_index \
-1 {input[0]} -2 {input[1]} \
--parallel {threads} -o methyl_aligned/
deduplicate_bismark --bam methyl_aligned/{sample}_R1_bismark_bt2_pe.bam \
-o methyl_aligned/{sample}
"""
[rules.environment]
conda = "envs/bismark.yaml"
[[rules]]
name = "methylation_extract"
input = ["methyl_aligned/{sample}.deduplicated.bam"]
output = ["methylation/{sample}.CpG_report.txt"]
threads = 4
memory = "16G"
description = "Extract CpG methylation calls"
shell = """
mkdir -p methylation
bismark_methylation_extractor --paired-end --comprehensive \
--genome_folder /data/references/GRCh38/bismark_index \
--parallel {threads} --CX --cytosine_report \
-o methylation/ {input[0]}
"""
[rules.environment]
conda = "envs/bismark.yaml"
# === Integration ===
[[rules]]
name = "integrate_omics"
input = [
"wgs_variants/{sample}.vcf.gz",
"expression/{sample}.counts.txt",
"methylation/{sample}.CpG_report.txt"
]
output = ["integration/{sample}.integrated.json"]
threads = 4
memory = "16G"
description = "Integrate multi-omics data layers for each sample"
shell = """
mkdir -p integration
echo '{' > {output[0]}
echo ' "sample": "{sample}",' >> {output[0]}
echo ' "genome_build": "{config.genome_build}",' >> {output[0]}
echo ' "data_layers": ["wgs_variants", "rnaseq_expression", "methylation"],' >> {output[0]}
echo ' "variant_file": "{input[0]}",' >> {output[0]}
echo ' "expression_file": "{input[1]}",' >> {output[0]}
echo ' "methylation_file": "{input[2]}",' >> {output[0]}
echo ' "status": "integrated"' >> {output[0]}
echo '}' >> {output[0]}
"""
[[rules]]
name = "generate_report"
input = ["integration/{sample}.integrated.json"]
output = ["results/{sample}.multiomics_report.html"]
description = "Generate multi-omics integration report"
shell = """
mkdir -p results
echo '<html><body>' > {output[0]}
echo '<h1>Multi-Omics Integration Report</h1>' >> {output[0]}
echo '<h2>Sample: {sample}</h2>' >> {output[0]}
echo '<p>Genome Build: {config.genome_build}</p>' >> {output[0]}
echo '<p>Data layers integrated: WGS, RNA-seq, Methylation</p>' >> {output[0]}
echo '</body></html>' >> {output[0]}
"""
[report]
template = "multiomics_report"
format = ["html", "json"]
sections = ["summary", "variants", "expression", "methylation", "integration", "provenance"]
Scientific Context#
Why Multi-Omics?#
Single-omics analyses provide incomplete pictures:
| Data Type | Information | Limitation |
|---|---|---|
| WGS | DNA mutations, structural variants | Cannot reveal functional impact |
| RNA-seq | Gene expression levels | Cannot identify causal mutations |
| Methylation | Epigenetic regulation | Cannot directly show gene activity |
Integrating all three layers enables:
- Variant-to-expression correlation — Do mutations affect gene expression?
- Epigenetic-expression coupling — Does promoter methylation silence gene expression?
- Multi-layer biomarker discovery — Combine signals for stronger clinical predictions
DAG Parallelism#
The three branches (WGS, RNA-seq, Methylation) are entirely independent and execute in parallel. With sufficient resources (-j 6), all six alignment and analysis steps can run simultaneously:
# Execute with 6 parallel jobs to maximize branch parallelism
oxo-flow run examples/gallery/08_multiomics_integration.oxoflow -j 6
Running the Workflow#
Validate#
$ oxo-flow validate examples/gallery/08_multiomics_integration.oxoflow
✓ examples/gallery/08_multiomics_integration.oxoflow — 8 rules, 7 dependencies
Resource Summary#
| Rule | Threads | Memory | Environment | Branch |
|---|---|---|---|---|
| wgs_align | 16 | 32G | docker | WGS |
| wgs_call_variants | 8 | 16G | singularity | WGS |
| rnaseq_align | 16 | 32G | conda | RNA-seq |
| rnaseq_quantify | 4 | 8G | conda | RNA-seq |
| bismark_align | 8 | 32G | conda | Methylation |
| methylation_extract | 4 | 16G | conda | Methylation |
| integrate_omics | 4 | 16G | system | Integration |
| generate_report | 4 | 8G | system | Report |
Further Reading#
- Venus Pipeline — Clinical-grade somatic variant calling built on oxo-flow
- DAG Engine — How oxo-flow resolves dependencies and optimizes parallel execution
- Environment System — Technical details on environment backend isolation