08 — Multi-Omics Integration#

Integrate whole-genome sequencing (WGS), RNA-seq, and bisulfite sequencing (methylation) data into a unified analysis. This represents the most complex DAG topology in the gallery, with three independent processing branches that converge at an integration step.

Concepts Covered

Complex branching DAG topology with three independent data branches
Cross-omics data integration (WGS + RNA-seq + Methylation)
Multiple environment backends in a single pipeline
Fan-in convergence from independent branches
Clinical-grade multi-omics reporting

Pipeline Overview#

graph TD
    subgraph WGS Branch
        A1[wgs_align] --> A2[wgs_call_variants]
    end
    subgraph RNA-seq Branch
        B1[rnaseq_align] --> B2[rnaseq_quantify]
    end
    subgraph Methylation Branch
        C1[bismark_align] --> C2[methylation_extract]
    end
    A2 --> D[integrate_omics]
    B2 --> D
    C2 --> D
    D --> E[generate_report]

Branches:

WGS Branch — Alignment → Variant calling (DNA mutations)
RNA-seq Branch — Splice-aware alignment → Gene expression quantification
Methylation Branch — Bisulfite alignment → CpG methylation extraction

Convergence:

Integration — Combine variant, expression, and methylation data per sample
Report — Generate a multi-omics summary report

Workflow Definition#

# examples/gallery/08_multiomics_integration.oxoflow

[workflow]
name = "multiomics-integration"
version = "1.0.0"
description = "Multi-omics data integration: WGS + RNA-seq + Methylation"
author = "oxo-flow examples"

[config]
reference = "/data/references/GRCh38/genome.fa"
gene_annotation = "/data/references/GRCh38/genes.gtf"
samples = "samples.csv"
genome_build = "GRCh38"

[defaults]
threads = 4
memory = "8G"

# === WGS Branch ===

[[rules]]
name = "wgs_align"
input = ["wgs/{sample}_R1.fastq.gz", "wgs/{sample}_R2.fastq.gz"]
output = ["wgs_aligned/{sample}.sorted.bam"]
threads = 16
memory = "32G"
description = "WGS read alignment with BWA-MEM2"
shell = """
mkdir -p wgs_aligned
bwa-mem2 mem -t {threads} {config.reference} {input[0]} {input[1]} \
    | samtools sort -@ 4 -o {output[0]}
samtools index {output[0]}
"""

[rules.environment]
docker = "biocontainers/bwa-mem2:2.2.1"

[[rules]]
name = "wgs_call_variants"
input = ["wgs_aligned/{sample}.sorted.bam"]
output = ["wgs_variants/{sample}.vcf.gz"]
threads = 8
memory = "16G"
description = "Variant calling with GATK HaplotypeCaller"
shell = """
mkdir -p wgs_variants
gatk HaplotypeCaller \
    -I {input[0]} -R {config.reference} \
    -O {output[0]} --native-pair-hmm-threads {threads}
"""

[rules.environment]
singularity = "docker://broadinstitute/gatk:4.5.0.0"

# === RNA-seq Branch ===

[[rules]]
name = "rnaseq_align"
input = ["rnaseq/{sample}_R1.fastq.gz", "rnaseq/{sample}_R2.fastq.gz"]
output = ["rnaseq_aligned/{sample}/Aligned.sortedByCoord.out.bam"]
threads = 16
memory = "32G"
description = "RNA-seq splice-aware alignment with STAR"
shell = """
mkdir -p rnaseq_aligned/{sample}
STAR --runThreadN {threads} \
     --genomeDir /data/references/GRCh38/star_index \
     --readFilesIn {input[0]} {input[1]} \
     --readFilesCommand zcat \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix rnaseq_aligned/{sample}/
"""

[rules.environment]
conda = "envs/star.yaml"

[[rules]]
name = "rnaseq_quantify"
input = ["rnaseq_aligned/{sample}/Aligned.sortedByCoord.out.bam"]
output = ["expression/{sample}.counts.txt"]
threads = 4
description = "Gene expression quantification with featureCounts"
shell = """
mkdir -p expression
featureCounts -T {threads} \
              -a {config.gene_annotation} \
              -o {output[0]} -p --countReadPairs \
              {input[0]}
"""

[rules.environment]
conda = "envs/subread.yaml"

# === Methylation Branch ===

[[rules]]
name = "bismark_align"
input = ["methyl/{sample}_R1.fastq.gz", "methyl/{sample}_R2.fastq.gz"]
output = ["methyl_aligned/{sample}.deduplicated.bam"]
threads = 8
memory = "32G"
description = "Bisulfite-seq alignment with Bismark"
shell = """
mkdir -p methyl_aligned
bismark --genome /data/references/GRCh38/bismark_index \
        -1 {input[0]} -2 {input[1]} \
        --parallel {threads} -o methyl_aligned/
deduplicate_bismark --bam methyl_aligned/{sample}_R1_bismark_bt2_pe.bam \
        -o methyl_aligned/{sample}
"""

[rules.environment]
conda = "envs/bismark.yaml"

[[rules]]
name = "methylation_extract"
input = ["methyl_aligned/{sample}.deduplicated.bam"]
output = ["methylation/{sample}.CpG_report.txt"]
threads = 4
memory = "16G"
description = "Extract CpG methylation calls"
shell = """
mkdir -p methylation
bismark_methylation_extractor --paired-end --comprehensive \
    --genome_folder /data/references/GRCh38/bismark_index \
    --parallel {threads} --CX --cytosine_report \
    -o methylation/ {input[0]}
"""

[rules.environment]
conda = "envs/bismark.yaml"

# === Integration ===

[[rules]]
name = "integrate_omics"
input = [
    "wgs_variants/{sample}.vcf.gz",
    "expression/{sample}.counts.txt",
    "methylation/{sample}.CpG_report.txt"
]
output = ["integration/{sample}.integrated.json"]
threads = 4
memory = "16G"
description = "Integrate multi-omics data layers for each sample"
shell = """
mkdir -p integration
echo '{' > {output[0]}
echo '  "sample": "{sample}",' >> {output[0]}
echo '  "genome_build": "{config.genome_build}",' >> {output[0]}
echo '  "data_layers": ["wgs_variants", "rnaseq_expression", "methylation"],' >> {output[0]}
echo '  "variant_file": "{input[0]}",' >> {output[0]}
echo '  "expression_file": "{input[1]}",' >> {output[0]}
echo '  "methylation_file": "{input[2]}",' >> {output[0]}
echo '  "status": "integrated"' >> {output[0]}
echo '}' >> {output[0]}
"""

[[rules]]
name = "generate_report"
input = ["integration/{sample}.integrated.json"]
output = ["results/{sample}.multiomics_report.html"]
description = "Generate multi-omics integration report"
shell = """
mkdir -p results
echo '<html><body>' > {output[0]}
echo '<h1>Multi-Omics Integration Report</h1>' >> {output[0]}
echo '<h2>Sample: {sample}</h2>' >> {output[0]}
echo '<p>Genome Build: {config.genome_build}</p>' >> {output[0]}
echo '<p>Data layers integrated: WGS, RNA-seq, Methylation</p>' >> {output[0]}
echo '</body></html>' >> {output[0]}
"""

[report]
template = "multiomics_report"
format = ["html", "json"]
sections = ["summary", "variants", "expression", "methylation", "integration", "provenance"]

Scientific Context#

Why Multi-Omics?#

Single-omics analyses provide incomplete pictures:

Data Type	Information	Limitation
WGS	DNA mutations, structural variants	Cannot reveal functional impact
RNA-seq	Gene expression levels	Cannot identify causal mutations
Methylation	Epigenetic regulation	Cannot directly show gene activity

Integrating all three layers enables:

Variant-to-expression correlation — Do mutations affect gene expression?
Epigenetic-expression coupling — Does promoter methylation silence gene expression?
Multi-layer biomarker discovery — Combine signals for stronger clinical predictions

DAG Parallelism#

The three branches (WGS, RNA-seq, Methylation) are entirely independent and execute in parallel. With sufficient resources (-j 6), all six alignment and analysis steps can run simultaneously:

# Execute with 6 parallel jobs to maximize branch parallelism
oxo-flow run examples/gallery/08_multiomics_integration.oxoflow -j 6

Running the Workflow#

Validate#

$ oxo-flow validate examples/gallery/08_multiomics_integration.oxoflow
✓ examples/gallery/08_multiomics_integration.oxoflow — 8 rules, 7 dependencies

Resource Summary#

Rule	Threads	Memory	Environment	Branch
wgs_align	16	32G	docker	WGS
wgs_call_variants	8	16G	singularity	WGS
rnaseq_align	16	32G	conda	RNA-seq
rnaseq_quantify	4	8G	conda	RNA-seq
bismark_align	8	32G	conda	Methylation
methylation_extract	4	16G	conda	Methylation
integrate_omics	4	16G	system	Integration
generate_report	4	8G	system	Report