07 — WGS Germline Variant Calling#
A complete whole-genome sequencing (WGS) germline variant calling pipeline following GATK best practices: QC → alignment → deduplication → BQSR → variant calling → filtration → annotation.
Concepts Covered
- GATK best-practices workflow
- Eight-step DAG with strict linear dependencies
- Mixed environments (conda, docker, singularity)
- Clinical-grade variant annotation with VEP
- Report configuration with provenance tracking
Pipeline Overview#
graph TD
A[fastp_qc] --> B[bwa_mem2_align]
B --> C[mark_duplicates]
C --> D[base_recalibration]
D --> E[haplotype_caller]
E --> F[genotype_gvcfs]
F --> G[variant_filtration]
G --> H[annotate_variants]
Steps:
- fastp_qc — Read quality control and adapter trimming
- bwa_mem2_align — Paired-end alignment with BWA-MEM2 (faster than BWA-MEM)
- mark_duplicates — Mark PCR and optical duplicates with GATK MarkDuplicates
- base_recalibration — Base quality score recalibration (BQSR) using known variant sites
- haplotype_caller — Per-sample variant calling in GVCF mode
- genotype_gvcfs — Genotype GVCFs to produce final variant calls
- variant_filtration — Hard-filter variants using GATK recommended thresholds
- annotate_variants — Functional annotation with Ensembl VEP
Workflow Definition#
# examples/gallery/07_wgs_germline.oxoflow
[workflow]
name = "wgs-germline-calling"
version = "1.0.0"
description = "GATK best-practices whole-genome germline variant calling pipeline"
author = "oxo-flow examples"
[config]
reference = "/data/references/GRCh38/genome.fa"
known_sites = "/data/references/GRCh38/dbsnp_146.hg38.vcf.gz"
known_indels = "/data/references/GRCh38/Mills_and_1000G.indels.hg38.vcf.gz"
intervals = "/data/references/GRCh38/wgs_calling_regions.hg38.interval_list"
samples = "samples.csv"
[defaults]
threads = 4
memory = "8G"
[[rules]]
name = "fastp_qc"
input = ["raw/{sample}_R1.fastq.gz", "raw/{sample}_R2.fastq.gz"]
output = [
"trimmed/{sample}_R1.fastq.gz",
"trimmed/{sample}_R2.fastq.gz",
"qc/{sample}_fastp.json"
]
threads = 8
memory = "16G"
description = "Read QC and adapter trimming"
shell = """
mkdir -p trimmed qc
fastp -i {input[0]} -I {input[1]} \
-o {output[0]} -O {output[1]} \
--json {output[2]} --thread {threads} \
--qualified_quality_phred 20 --length_required 50
"""
[rules.environment]
conda = "envs/fastp.yaml"
[[rules]]
name = "bwa_mem2_align"
input = ["trimmed/{sample}_R1.fastq.gz", "trimmed/{sample}_R2.fastq.gz"]
output = ["aligned/{sample}.sorted.bam"]
threads = 16
memory = "32G"
description = "Paired-end alignment with BWA-MEM2 and coordinate sorting"
shell = """
mkdir -p aligned
bwa-mem2 mem -t {threads} \
-R '@RG\\tID:{sample}\\tSM:{sample}\\tPL:ILLUMINA' \
{config.reference} {input[0]} {input[1]} \
| samtools sort -@ 4 -m 2G -o {output[0]}
samtools index {output[0]}
"""
[rules.environment]
docker = "biocontainers/bwa-mem2:2.2.1"
Read Group Escape Sequences
The \\t in the read group string (-R '@RG\\tID:...') is a double-escaped tab character. TOML requires \\ to represent a literal backslash, which the shell then interprets as \t (tab) — the delimiter BWA expects between read group fields.
[[rules]]
name = "mark_duplicates"
input = ["aligned/{sample}.sorted.bam"]
output = ["dedup/{sample}.dedup.bam", "dedup/{sample}.dedup.metrics.txt"]
threads = 4
memory = "16G"
description = "Mark PCR and optical duplicates"
shell = """
mkdir -p dedup
gatk MarkDuplicates \
-I {input[0]} -O {output[0]} -M {output[1]} \
--CREATE_INDEX true
"""
[rules.environment]
singularity = "docker://broadinstitute/gatk:4.5.0.0"
[[rules]]
name = "base_recalibration"
input = ["dedup/{sample}.dedup.bam"]
output = ["bqsr/{sample}.recal.bam"]
threads = 4
memory = "16G"
description = "Base quality score recalibration (BQSR)"
shell = """
mkdir -p bqsr
gatk BaseRecalibrator \
-I {input[0]} -R {config.reference} \
--known-sites {config.known_sites} \
--known-sites {config.known_indels} \
-O bqsr/{sample}.recal_data.table
gatk ApplyBQSR \
-I {input[0]} -R {config.reference} \
--bqsr-recal-file bqsr/{sample}.recal_data.table \
-O {output[0]}
"""
[rules.environment]
singularity = "docker://broadinstitute/gatk:4.5.0.0"
[[rules]]
name = "haplotype_caller"
input = ["bqsr/{sample}.recal.bam"]
output = ["variants/{sample}.g.vcf.gz"]
threads = 4
memory = "16G"
description = "Per-sample variant calling in GVCF mode"
shell = """
mkdir -p variants
gatk HaplotypeCaller \
-I {input[0]} -R {config.reference} \
-O {output[0]} -ERC GVCF \
--native-pair-hmm-threads {threads} \
-L {config.intervals}
"""
[rules.environment]
singularity = "docker://broadinstitute/gatk:4.5.0.0"
[[rules]]
name = "genotype_gvcfs"
input = ["variants/{sample}.g.vcf.gz"]
output = ["variants/{sample}.genotyped.vcf.gz"]
threads = 4
memory = "16G"
description = "Genotype GVCFs to produce final variant calls"
shell = """
gatk GenotypeGVCFs \
-R {config.reference} \
-V {input[0]} -O {output[0]}
"""
[rules.environment]
singularity = "docker://broadinstitute/gatk:4.5.0.0"
[[rules]]
name = "variant_filtration"
input = ["variants/{sample}.genotyped.vcf.gz"]
output = ["variants/{sample}.filtered.vcf.gz"]
threads = 2
memory = "8G"
description = "Hard-filter variants with GATK recommended thresholds"
shell = """
gatk VariantFiltration \
-R {config.reference} -V {input[0]} -O {output[0]} \
--filter-expression 'QD < 2.0' --filter-name 'LowQD' \
--filter-expression 'MQ < 40.0' --filter-name 'LowMQ' \
--filter-expression 'FS > 60.0' --filter-name 'HighFS' \
--filter-expression 'SOR > 3.0' --filter-name 'HighSOR'
"""
[rules.environment]
singularity = "docker://broadinstitute/gatk:4.5.0.0"
[[rules]]
name = "annotate_variants"
input = ["variants/{sample}.filtered.vcf.gz"]
output = ["annotation/{sample}.annotated.vcf.gz"]
threads = 4
memory = "16G"
description = "Functional variant annotation with VEP"
shell = """
mkdir -p annotation
vep --input_file {input[0]} --output_file {output[0]} \
--format vcf --vcf --compress_output bgzip \
--assembly GRCh38 --offline --cache \
--sift b --polyphen b --symbol --numbers --biotype \
--total_length --canonical --ccds \
--force_overwrite --fork {threads}
"""
[rules.environment]
conda = "envs/vep.yaml"
[report]
template = "germline_report"
format = ["html", "json"]
sections = ["summary", "qc_metrics", "coverage", "variants", "annotations", "provenance"]
Clinical Considerations#
BQSR (Base Quality Score Recalibration)#
BQSR corrects systematic errors in base quality scores assigned by the sequencer. It uses known variant sites (dbSNP, Mills indels) to distinguish true variants from sequencing artifacts. This step is critical for clinical-grade variant calling accuracy.
GVCF Mode#
HaplotypeCaller runs in GVCF mode (-ERC GVCF) to produce genomic VCFs that contain information about both variant and reference-confident sites. This enables downstream joint genotyping across cohorts without re-running variant calling.
Hard Filtering Thresholds#
The variant filtration step applies GATK-recommended hard filters:
| Filter | Threshold | Meaning |
|---|---|---|
| QD | < 2.0 | Variant quality normalized by depth |
| MQ | < 40.0 | Root mean square mapping quality |
| FS | > 60.0 | Fisher strand bias |
| SOR | > 3.0 | Symmetric odds ratio strand bias |
For production use, consider VQSR (Variant Quality Score Recalibration) instead of hard filters when sufficient training data is available.
Running the Workflow#
Validate#
$ oxo-flow validate examples/gallery/07_wgs_germline.oxoflow
✓ examples/gallery/07_wgs_germline.oxoflow — 8 rules, 8 dependencies
Resource Summary#
| Rule | Threads | Memory | Environment |
|---|---|---|---|
| fastp_qc | 8 | 16G | conda |
| bwa_mem2_align | 16 | 32G | docker |
| mark_duplicates | 4 | 16G | singularity |
| base_recalibration | 4 | 16G | singularity |
| haplotype_caller | 4 | 16G | singularity |
| genotype_gvcfs | 4 | 16G | singularity |
| variant_filtration | 2 | 8G | singularity |
| annotate_variants | 4 | 16G | conda |
What's Next?#
Move on to Multi-Omics Integration for a complex pipeline that combines WGS, RNA-seq, and methylation data.