Cohort Studies with Sample Groups#

This guide explains how to use [[sample_groups]] (WC-02) to run per-sample and per-group analyses across an entire cohort.

Problem#

Population-scale studies require the same pipeline steps to run independently for every sample in every group. Manually duplicating rules for 50+ samples is error-prone and unmaintainable.

Solution: `[[sample_groups]]`#

Define one [[sample_groups]] block per group. Each block contains a list of sample IDs. Rules that use {sample} or {group} placeholders are expanded once per (group, sample) combination.

[[sample_groups]]
name    = "control"
samples = ["CTRL_001", "CTRL_002", "CTRL_003"]

[[sample_groups]]
name    = "case"
samples = ["CASE_001", "CASE_002"]

Wildcard Placeholders#

Placeholder	Replaced with
`{sample}`	Individual sample name
`{group}`	Group name for this sample

Expanded Rule Names#

For a rule align across the groups above the engine produces:

align_control_CTRL_001
align_control_CTRL_002
align_control_CTRL_003
align_case_CASE_001
align_case_CASE_002

Rules that do not reference {sample} or {group} (e.g., a multiqc step that takes the whole qc/ directory) run once and are kept as-is.

Group Metadata#

Attach arbitrary metadata to each group for use in downstream reporting:

[[sample_groups]]
name    = "treatment_arm_A"
samples = ["PT_A001", "PT_A002"]
[sample_groups.metadata]
drug  = "compound_X"
dose  = "100mg"

Minimal Example#

[workflow]
name = "cohort-minimal"

[config]
reference = "/data/ref/hg38.fa"

[[sample_groups]]
name    = "healthy"
samples = ["H001", "H002"]

[[sample_groups]]
name    = "disease"
samples = ["D001", "D002", "D003"]

[[rules]]
name   = "align"
input  = ["raw/{sample}_R1.fq.gz", "raw/{sample}_R2.fq.gz"]
output = ["aligned/{sample}.bam"]
shell  = "bwa mem -t {threads} {config.reference} {input[0]} {input[1]} | samtools sort -o {output[0]}"
threads = 8

[[rules]]
name   = "haplotype_caller"
input  = ["aligned/{sample}.bam"]
output = ["gvcf/{sample}.g.vcf.gz"]
shell  = "gatk HaplotypeCaller -I {input[0]} -R {config.reference} -O {output[0]} -ERC GVCF"
threads = 4

# Aggregation step — runs ONCE for all samples
[[rules]]
name   = "multiqc"
input  = ["qc/"]
output = ["reports/multiqc_report.html"]
shell  = "multiqc qc/ -o reports/"

Combining Groups and Pairs#

You can use both [[sample_groups]] and [[pairs]] in the same workflow. They expand independently: group-wildcard rules are expanded over samples, and pair-wildcard rules are expanded over pairs.

Full Example#

See examples/cohort_analysis.oxoflow for a complete population genomics pipeline including QC, alignment, deduplication, variant calling, and multi-QC aggregation.

Loading Groups from External File#

For large cohorts with many groups and samples, use sample_groups_file in [workflow] instead of inline [[sample_groups]]:

[workflow]
name = "cohort-analysis"
sample_groups_file = "metadata/groups.tsv"  # or .csv, .json

Supported formats:

TSV Format#

group   sample
healthy H001
healthy H002
disease D001
disease D002

For groups with metadata, use JSON format.

JSON Format#

[
  {
    "name": "treatment_arm_A",
    "samples": ["PT_A001", "PT_A002"],
    "metadata": {
      "drug": "compound_X",
      "dose": "100mg"
    }
  },
  {
    "name": "treatment_arm_B",
    "samples": ["PT_B001", "PT_B002"],
    "metadata": {
      "drug": "compound_Y",
      "dose": "50mg"
    }
  }
]

You can combine inline [[sample_groups]] with sample_groups_file — entries from both sources are merged.

External file benefits

Easily manage large cohort definitions
Share group definitions across multiple workflows
Update sample lists without modifying workflow files
Supports metadata for downstream reporting