RNA-seq Analysis Walkthrough#
This tutorial walks through a complete bulk RNA-seq analysis pipeline — from raw FASTQ reads to a count matrix — using oxo-call to generate every command. This mirrors a real-world analysis workflow.
Time to complete: 30–45 minutes (reading) + compute time
Prerequisites: oxo-call configured, tools installed: fastp, STAR, featureCounts (or salmon)
You will learn: end-to-end pipeline construction, workflow integration, skill-driven accuracy
RNA-seq Pipeline Overview#
Raw FASTQ reads
│
▼ Quality Control (fastp)
Trimmed FASTQ + QC report
│
▼ Alignment (STAR)
Aligned BAM (coordinate-sorted)
│
▼ QC Aggregation (MultiQC)
Combined QC report
│
▼ Quantification (featureCounts)
Count matrix (gene × sample)
We will use oxo-call at each step. The commands shown use example filenames — adapt them to your data.
Setup#
Sample data assumptions#
This tutorial assumes:
- Paired-end RNA-seq data:
sample1_R1.fastq.gz,sample1_R2.fastq.gz - STAR genome index at:
/data/star_hg38/ - GTF annotation at:
/data/gencode.v44.gtf
Verify tools and skills#
# Check tools are installed
fastp --version
STAR --version
featureCounts -v
# Check built-in skills
oxo-call skill list | grep -E "fastp|star|featurecounts"
# fastp ✓ built-in
# star ✓ built-in
# featurecounts ✓ built-in
Add remote documentation (optional but recommended)#
oxo-call docs add fastp --url https://github.com/OpenGene/fastp#usage
oxo-call docs add star --url https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
Step 1: Quality Control with fastp#
fastp trims adapters, removes low-quality reads, and generates a QC report in one step.
Dry-run first#
oxo-call dry-run fastp \
"trim adapters from paired-end reads sample1_R1.fastq.gz and sample1_R2.fastq.gz, \
output trimmed reads to trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz, \
use 4 threads, generate HTML report at qc/sample1.html and JSON at qc/sample1.json"
Expected:
Command: fastp \
--in1 sample1_R1.fastq.gz --in2 sample1_R2.fastq.gz \
--out1 trimmed/sample1_R1.fastq.gz --out2 trimmed/sample1_R2.fastq.gz \
--thread 4 \
--html qc/sample1.html --json qc/sample1.json
Explanation: fastp auto-detects adapters; --thread 4 for parallelism; HTML/JSON for MultiQC compatibility.
Execute#
mkdir -p trimmed qc
oxo-call run fastp \
"trim adapters from paired-end reads sample1_R1.fastq.gz and sample1_R2.fastq.gz, \
output trimmed reads to trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz, \
use 4 threads, generate HTML report at qc/sample1.html and JSON at qc/sample1.json"
Why no
--adapter_sequence?
The fastp skill includes the concept: "fastp auto-detects adapters — do not hardcode adapter sequences unless they are being missed." This is a common mistake when migrating from trimmomatic.
Step 2: Alignment with STAR#
STAR is the gold-standard splice-aware aligner for RNA-seq. It requires a pre-built genome index.
Build the index (if you have not already)#
oxo-call dry-run STAR \
"build a genome index for hg38, genome fasta at /data/hg38.fa, \
GTF at /data/gencode.v44.gtf, output to /data/star_hg38, use 8 threads, \
sjdbOverhang 100 (read length minus 1)"
Expected:
Command: STAR \
--runMode genomeGenerate \
--genomeDir /data/star_hg38 \
--genomeFastaFiles /data/hg38.fa \
--sjdbGTFfile /data/gencode.v44.gtf \
--sjdbOverhang 100 \
--runThreadN 8
Align trimmed reads#
oxo-call dry-run STAR \
"align trimmed paired-end reads trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz \
to genome index at /data/star_hg38, output to aligned/sample1/, \
sort BAM by coordinate, use 8 threads, generate a BAM file"
Expected:
Command: STAR \
--genomeDir /data/star_hg38 \
--readFilesIn trimmed/sample1_R1.fastq.gz trimmed/sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix aligned/sample1/ \
--runThreadN 8
Explanation: --readFilesCommand zcat handles .gz files; SortedByCoordinate produces a ready-to-use BAM.
The
--readFilesCommand zcatpitfall:
Forgetting this flag when using.gzinput is one of the most common STAR mistakes. The STAR skill includes it as a pitfall: "Always add--readFilesCommand zcatfor compressed FASTQ."
Execute:
mkdir -p aligned/sample1
oxo-call run STAR \
"align trimmed paired-end reads trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz \
to genome index at /data/star_hg38, output to aligned/sample1/, \
sort BAM by coordinate, use 8 threads, generate a BAM file"
STAR Two-Pass Mode:
For novel splice junction discovery (e.g., tumor RNA-seq, rare transcripts), consider using STAR's two-pass mode. In the first pass, STAR discovers splice junctions; in the second pass, it re-maps reads using the discovered junctions for improved sensitivity. Add--twopassMode Basicto your alignment task description. Note that two-pass mode increases runtime and memory usage. For standard differential expression analyses with well-annotated genomes (e.g., human, mouse), one-pass mode with a comprehensive GTF annotation is usually sufficient.
Index the BAM#
Step 3: Aggregate QC with MultiQC#
MultiQC reads all QC outputs and creates a single interactive report. It should be run after all samples are processed.
oxo-call dry-run multiqc \
"aggregate QC reports from qc/ and STAR log files from aligned/ into a single report at results/multiqc/"
Expected:
Command: multiqc qc/ aligned/ -o results/multiqc/
Explanation: multiqc automatically discovers fastp JSON reports and STAR Log.final.out files in the specified directories.
mkdir -p results/multiqc
oxo-call run multiqc \
"aggregate QC reports from qc/ and STAR log files from aligned/ into a single report at results/multiqc/"
Open results/multiqc/multiqc_report.html in a browser to review:
- Per-sample read quality before/after trimming
- Alignment rates
- Duplication rates
- Any samples that need attention
Step 4: Quantification with featureCounts#
featureCounts counts reads per gene. It is part of the subread package.
oxo-call dry-run featureCounts \
"count reads per gene in aligned/sample1/Aligned.sortedByCoord.out.bam \
using GTF at /data/gencode.v44.gtf, for paired-end data, \
use 4 threads, output to results/counts.txt"
Expected:
Command: featureCounts \
-a /data/gencode.v44.gtf \
-o results/counts.txt \
-p \
-T 4 \
aligned/sample1/Aligned.sortedByCoord.out.bam
Explanation: -p specifies paired-end mode; -a is the GTF annotation; -T sets threads.
Using Salmon for quantification instead?
featureCounts works on aligned BAMs (alignment-based). Salmon uses pseudo-alignment directly on FASTQ files (alignment-free) and is faster for many use cases. Try:
oxo-call dry-run salmon "quantify sample1_R1.fastq.gz and sample1_R2.fastq.gz against index at /data/salmon_hg38_index, output to quant/sample1/"
Execute:
mkdir -p results
oxo-call run featureCounts \
"count reads per gene in aligned/sample1/Aligned.sortedByCoord.out.bam \
using GTF at /data/gencode.v44.gtf, for paired-end data, \
use 4 threads, output to results/counts.txt"
Step 5: Running Multiple Samples#
For a real experiment you will have many samples. Instead of repeating commands manually, use the workflow engine:
# View the built-in RNA-seq template
oxo-call workflow show rnaseq
# Dry-run the complete pipeline
oxo-call workflow dry-run rnaseq
To customize for your samples and paths, see the Workflow Builder tutorial.
Checking the Full Run History#
This shows every command in the order it was executed, with exit codes, timestamps, and provenance metadata.
What You Learned#
- How to run a complete RNA-seq pipeline step-by-step with oxo-call
- How built-in skills prevent common mistakes (
--readFilesCommand zcat,-pfor paired-end) - How to add remote documentation for richer LLM context
- How MultiQC integrates naturally into the oxo-call workflow
- Where to go next: the workflow engine for multi-sample automation
Next steps: - Workflow Builder tutorial — automate this for multiple samples - featureCounts command reference — full options - Skill System reference — how skills improve accuracy