Skip to content

Cohort Studies with Sample Groups#

This guide explains how to use [[sample_groups]] (WC-02) to run per-sample and per-group analyses across an entire cohort.

Problem#

Population-scale studies require the same pipeline steps to run independently for every sample in every group. Manually duplicating rules for 50+ samples is error-prone and unmaintainable.

Solution: [[sample_groups]]#

Define one [[sample_groups]] block per group. Each block contains a list of sample IDs. Rules that use {sample} or {group} placeholders are expanded once per (group, sample) combination.

[[sample_groups]]
name    = "control"
samples = ["CTRL_001", "CTRL_002", "CTRL_003"]

[[sample_groups]]
name    = "case"
samples = ["CASE_001", "CASE_002"]

Wildcard Placeholders#

Placeholder Replaced with
{sample} Individual sample name
{group} Group name for this sample

Expanded Rule Names#

For a rule align across the groups above the engine produces:

  • align_control_CTRL_001
  • align_control_CTRL_002
  • align_control_CTRL_003
  • align_case_CASE_001
  • align_case_CASE_002

Rules that do not reference {sample} or {group} (e.g., a multiqc step that takes the whole qc/ directory) run once and are kept as-is.

Group Metadata#

Attach arbitrary metadata to each group for use in downstream reporting:

[[sample_groups]]
name    = "treatment_arm_A"
samples = ["PT_A001", "PT_A002"]
[sample_groups.metadata]
drug  = "compound_X"
dose  = "100mg"

Minimal Example#

[workflow]
name = "cohort-minimal"

[config]
reference = "/data/ref/hg38.fa"

[[sample_groups]]
name    = "healthy"
samples = ["H001", "H002"]

[[sample_groups]]
name    = "disease"
samples = ["D001", "D002", "D003"]

[[rules]]
name   = "align"
input  = ["raw/{sample}_R1.fq.gz", "raw/{sample}_R2.fq.gz"]
output = ["aligned/{sample}.bam"]
shell  = "bwa mem -t {threads} {config.reference} {input[0]} {input[1]} | samtools sort -o {output[0]}"
threads = 8

[[rules]]
name   = "haplotype_caller"
input  = ["aligned/{sample}.bam"]
output = ["gvcf/{sample}.g.vcf.gz"]
shell  = "gatk HaplotypeCaller -I {input[0]} -R {config.reference} -O {output[0]} -ERC GVCF"
threads = 4

# Aggregation step — runs ONCE for all samples
[[rules]]
name   = "multiqc"
input  = ["qc/"]
output = ["reports/multiqc_report.html"]
shell  = "multiqc qc/ -o reports/"

Combining Groups and Pairs#

You can use both [[sample_groups]] and [[pairs]] in the same workflow. They expand independently: group-wildcard rules are expanded over samples, and pair-wildcard rules are expanded over pairs.

Full Example#

See examples/cohort_analysis.oxoflow for a complete population genomics pipeline including QC, alignment, deduplication, variant calling, and multi-QC aggregation.


Loading Groups from External File#

For large cohorts with many groups and samples, use sample_groups_file in [workflow] instead of inline [[sample_groups]]:

[workflow]
name = "cohort-analysis"
sample_groups_file = "metadata/groups.tsv"  # or .csv, .json

Supported formats:

TSV Format#

group   sample
healthy H001
healthy H002
disease D001
disease D002

For groups with metadata, use JSON format.

JSON Format#

[
  {
    "name": "treatment_arm_A",
    "samples": ["PT_A001", "PT_A002"],
    "metadata": {
      "drug": "compound_X",
      "dose": "100mg"
    }
  },
  {
    "name": "treatment_arm_B",
    "samples": ["PT_B001", "PT_B002"],
    "metadata": {
      "drug": "compound_Y",
      "dose": "50mg"
    }
  }
]

You can combine inline [[sample_groups]] with sample_groups_file — entries from both sources are merged.

External file benefits

  • Easily manage large cohort definitions
  • Share group definitions across multiple workflows
  • Update sample lists without modifying workflow files
  • Supports metadata for downstream reporting