SAM/BAM Processing Tutorial#
This tutorial walks through a complete SAM/BAM processing pipeline using samtools. By the end, you will have sorted, indexed, filtered, and summarized a BAM file — all with natural-language commands.
Time to complete: 20–30 minutes
Prerequisites: oxo-call configured, samtools installed
You will learn: multi-step BAM workflows, dry-run validation, docs enrichment
Overview: The BAM Processing Pipeline#
A typical BAM processing workflow looks like this:
raw alignment (.bam)
│
▼ sort by coordinate
sorted.bam
│
▼ index
sorted.bam.bai
│
▼ filter (keep primary, mapped reads)
filtered.bam + filtered.bam.bai
│
▼ statistics
flagstat.txt + idxstats.txt
We will use oxo-call at every step, demonstrating how natural-language commands map to correct samtools flags.
Setting Up#
Check that samtools is available#
Add rich documentation (optional)#
oxo-call fetches --help automatically, but you can add the samtools manual for richer context:
Verify the documentation is indexed:
Check the built-in skill#
The skill contains concepts like "BAM MUST be coordinate-sorted BEFORE indexing" — this knowledge guides the LLM to generate the right flags in the right order.
Step 1: Sort by Coordinate#
Most downstream tools require a coordinate-sorted BAM. Let us generate the command:
Expected output:
Command: samtools sort -@ 4 -o sorted.bam aligned.bam
Explanation: -@ 4 uses 4 worker threads; -o specifies the output file; coordinate sort is the default.
Now execute it:
Why does
-@ 4appear?
The samtools skill includes the concept: "Use -@ N for N additional threads." This guides the LLM to include threading flags whenever you mention thread count.
Step 2: Index the Sorted BAM#
Indexing is required for random-access queries and most genome browsers:
Execute:
This creates sorted.bam.bai.
Note: The skill reminds the LLM that you must sort before indexing. If you try to index an unsorted BAM, samtools will error. Try running
oxo-call dry-run samtools "index unsorted.bam"— the explanation will mention sorting as a prerequisite.
Step 3: Filter to Primary Mapped Reads#
Remove secondary alignments, supplementary alignments, and unmapped reads:
oxo-call dry-run samtools \
"extract only primary mapped reads from sorted.bam into filtered.bam, then index filtered.bam"
Expected:
Command: samtools view -F 0x904 -b -o filtered.bam sorted.bam && samtools index filtered.bam
Explanation: -F 0x904 excludes unmapped (0x4), secondary (0x100), and supplementary (0x800) reads; -b writes BAM format.
Run it:
oxo-call run samtools \
"extract only primary mapped reads from sorted.bam into filtered.bam, then index filtered.bam"
Understanding
-F 0x904:
0x904 = 0x4 | 0x100 | 0x800— this bitmask excludes unmapped, secondary, and supplementary alignments. The samtools skill includes this as a common pitfall: "Use -F 0x904 to keep only primary mapped reads." Without the skill, the LLM might use a less complete flag set.
Step 4: Check Alignment Statistics#
Get a summary of the alignment quality:
oxo-call dry-run samtools "show alignment statistics for filtered.bam and save to flagstat.txt"
# → samtools flagstat filtered.bam > flagstat.txt
oxo-call run samtools "show alignment statistics for filtered.bam and save to flagstat.txt"
cat flagstat.txt
Example output:
10234567 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
10234567 + 0 mapped (100.00% : N/A)
10234567 + 0 paired in sequencing
5117284 + 0 read1
5117283 + 0 read2
...
Get per-chromosome read counts:
oxo-call dry-run samtools "show per-chromosome read counts for filtered.bam and save to idxstats.txt"
# → samtools idxstats filtered.bam > idxstats.txt
Step 5: Mark Duplicates (Optional)#
For most variant-calling pipelines, duplicate reads should be marked:
oxo-call dry-run picard \
"mark duplicates in filtered.bam, output to dedup.bam, write metrics to dedup_metrics.txt"
Expected:
Command: picard MarkDuplicates I=filtered.bam O=dedup.bam M=dedup_metrics.txt
Explanation: I/O are input/output; M writes duplication metrics; duplicates are marked not removed by default.
Picard vs samtools markdup?
Ask oxo-call to explain:oxo-call dry-run samtools "mark duplicate reads in filtered.bam"will generatesamtools markdup. Both approaches are valid; the choice depends on your pipeline. Try both dry-runs to compare.
Reviewing the Full Pipeline#
Look at everything that ran:
You will see each command with its exit code and provenance metadata, giving you a complete audit trail of the BAM processing steps.
Putting It Together with a Workflow#
All these steps can be automated with the native workflow engine. See the Workflow Builder tutorial to learn how to turn this pipeline into a reproducible .oxo.toml file.
What You Learned#
- How to run a multi-step BAM processing pipeline with oxo-call
- Why skills matter —
samtoolsskill providessort-before-indexand bitmask guidance - How to enrich documentation with remote URLs
- How to review full execution history with provenance