Run on a Cluster#

This guide explains how to execute oxo-flow workflows on HPC clusters using SLURM, PBS, SGE, and LSF backends.

Overview#

oxo-flow's cluster module translates each rule into a cluster job submission. Resource requirements declared in the .oxoflow file (threads, memory, gpu, disk, time_limit) are mapped to the appropriate scheduler directives.

Environment wrapping is applied automatically — conda, docker, singularity, pixi, and venv environments are properly wrapped in the generated scripts.

Supported Schedulers#

Scheduler	Status	Directive prefix
SLURM	Supported	`#SBATCH`
PBS/Torque	Supported	`#PBS`
SGE	Supported	`#$`
LSF	Supported	`#BSUB`

Declaring Resources#

Set resource requirements per rule:

[[rules]]
name = "align"
input = ["{sample}_R1.fastq.gz"]
output = ["aligned/{sample}.bam"]
threads = 16
memory = "32G"
environment = { singularity = "docker://biocontainers/bwa:0.7.17" }
shell = "bwa mem -t {threads} ref.fa {input} | samtools sort -o {output}"

[rules.resources]
gpu = 0
disk = "100G"
time_limit = "24h"

Resource fields#

Field	Type	Example	Description
`threads`	Integer	`16`	Number of CPU cores
`memory`	String	`"32G"`	RAM allocation
`gpu`	Integer	`1`	Number of GPUs (simple count)
`gpu_spec`	Table	See below	Detailed GPU specification
`disk`	String	`"100G"`	Local disk space
`time_limit`	String	`"24h"`	Wall-time limit

GPU Specification#

For basic GPU requests, use the gpu field:

[rules.resources]
gpu = 2  # Request 2 GPUs

For advanced GPU configuration (SLURM only), use gpu_spec:

[rules.resources.gpu_spec]
count = 2
model = "a100"       # GPU model (optional, SLURM only)
memory_gb = 40       # Per-GPU memory in GB (optional, SLURM only)

Different schedulers handle GPU requests differently:

Scheduler	GPU Directive	Notes
SLURM	`--gres=gpu:2` or `--gres=gpu:a100:2:40g`	Full support for model and memory spec
PBS	`gpu=2`	Basic count only; model selection varies by site
SGE	`-l gpu=2`	Basic count only; requires queue configuration
LSF	`-gpu 2`	Basic count only

SLURM Example#

oxo-flow generates SLURM job scripts automatically. For the align rule above, the generated script looks like:

#!/bin/bash
#SBATCH --job-name=align
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=24:00:00
#SBATCH --output=logs/align_%j.out
#SBATCH --error=logs/align_%j.err

# Environment wrapping (automatically applied)
singularity exec docker://biocontainers/bwa:0.7.17 \
  bwa mem -t 16 ref.fa sample1_R1.fastq.gz | samtools sort -o aligned/sample1.bam

PBS Example#

#!/bin/bash
#PBS -N align
#PBS -l ncpus=16
#PBS -l mem=32gb
#PBS -l walltime=24:00:00
#PBS -o logs/align.out
#PBS -e logs/align.err

cd $PBS_O_WORKDIR

# Environment wrapping (automatically applied)
singularity exec docker://biocontainers/bwa:0.7.17 \
  bwa mem -t 16 ref.fa sample1_R1.fastq.gz | samtools sort -o aligned/sample1.bam

SGE Example#

#!/bin/bash
#$ -N align
#$ -pe smp 16
#$ -l h_vmem=2G
#$ -l h_rt=24:00:00
#$ -o logs/align.out
#$ -e logs/align.err
#$ -cwd

# Environment wrapping (automatically applied)
singularity exec docker://biocontainers/bwa:0.7.17 \
  bwa mem -t 16 ref.fa sample1_R1.fastq.gz | samtools sort -o aligned/sample1.bam

Environment Wrapping#

When generating cluster scripts, oxo-flow automatically wraps commands through the environment resolver:

| Backend | Wrapping | |---|---|---| | Conda | conda activate <env>; <command> | | Docker | docker run --rm -v ... <image> <command> | | Singularity | singularity exec <image> <command> | | Pixi | pixi run <command> | | Venv | source <venv>/bin/activate; <command> | | Modules | module load <mod1> <mod2>; <command> |

Environment Examples#

Conda with GPU for deep learning:

[[rules]]
name = "train_model"
input = ["data/train.h5"]
output = ["models/trained.pt"]
threads = 8
memory = "64G"
environment = { conda = "envs/pytorch.yaml" }
shell = "python train.py --input {input} --output {output} --gpus {resources.gpu}"

[rules.resources]
gpu = 2
time_limit = "24h"

Singularity with Modules (common on HPC):

[[rules]]
name = "variant_call"
input = ["aligned/{sample}.bam"]
output = ["variants/{sample}.vcf"]
threads = 16
memory = "32G"
environment = { 
    singularity = "docker://broadinstitute/gatk:4.4.0.0",
    modules = ["cuda/11.8"]  # Load CUDA module first
}
shell = "gatk HaplotypeCaller -I {input} -O {output}"

Pixi for reproducible environments:

[[rules]]
name = "qc_check"
input = ["{sample}.fastq.gz"]
output = ["qc/{sample}_fastqc.html"]
threads = 4
environment = { pixi = "pixi.toml" }
shell = "fastqc -t {threads} -o qc/ {input}"

Pure Module-based (traditional HPC):

[[rules]]
name = "align"
input = ["reads/{sample}.fq"]
output = ["aligned/{sample}.bam"]
threads = 32
memory = "64G"
environment = { modules = ["bwa/0.7.17", "samtools/1.17", "gcc/11"] }
shell = "bwa mem -t {threads} ref.fa {input} | samtools sort -o {output}"

Pre-build environments on cluster nodes

Ensure your conda environments, docker images, or singularity containers are available on all cluster nodes before submitting jobs. Use --skip-env-setup when environments are pre-built.

Resource Enforcement#

Local Execution#

When running locally (oxo-flow run), resource constraints are enforced:

Check: Before execution, verify resources are available
Reserve: Reserve resources before starting the job
Release: Release resources after completion (or on failure/timeout)

# Limit to 16 threads and 32GB memory for local execution
oxo-flow run pipeline.oxoflow --max-threads 16 --max-memory 32768

Cluster Execution#

On clusters, the scheduler enforces resources based on the generated directives. oxo-flow does not manage resources during cluster execution — the scheduler handles that.

Best Practices#

Use Singularity on clusters

Most HPC clusters do not allow Docker. Use Singularity instead — oxo-flow handles the conversion automatically when you specify singularity = "docker://...".

Set realistic time limits

Generous wall-time limits prevent premature job termination but may lower scheduling priority. Profile your jobs first.

Use --keep-going for large batches

When running hundreds of samples, use oxo-flow run -k so that a single failure does not abort the entire run.

Check resource availability

Use sinfo (SLURM), pbsnodes (PBS), or qhost (SGE) to verify available resources before submitting.

Cache environment setup

Use --cache-dir to persist environment setup state across runs for faster startup.

Monitoring Jobs#

After submission, use your cluster's native tools:

# SLURM
squeue -u $USER

# PBS
qstat -u $USER

# SGE
qstat

# LSF
bjobs

Or use oxo-flow's status command with a checkpoint file:

oxo-flow status .oxo-flow/checkpoint.json