RNA-seq Analysis Walkthrough#

This tutorial walks through a complete bulk RNA-seq analysis pipeline — from raw FASTQ reads to a count matrix — using oxo-call to generate every command. This mirrors a real-world analysis workflow.

Time to complete: 30–45 minutes (reading) + compute time Prerequisites: oxo-call configured, tools installed: fastp, STAR, featureCounts (or salmon) You will learn: end-to-end pipeline construction, workflow integration, skill-driven accuracy

RNA-seq Pipeline Overview#

Raw FASTQ reads
       │
  ▼ Quality Control (fastp)
Trimmed FASTQ + QC report
       │
  ▼ Alignment (STAR)
Aligned BAM (coordinate-sorted)
       │
  ▼ QC Aggregation (MultiQC)
Combined QC report
       │
  ▼ Quantification (featureCounts)
Count matrix (gene × sample)

We will use oxo-call at each step. The commands shown use example filenames — adapt them to your data.

Setup#

Sample data assumptions#

This tutorial assumes:

Paired-end RNA-seq data: sample1_R1.fastq.gz, sample1_R2.fastq.gz
STAR genome index at: /data/star_hg38/
GTF annotation at: /data/gencode.v44.gtf

Verify tools and skills#

# Check tools are installed
fastp --version
STAR --version
featureCounts -v

# Check built-in skills
oxo-call skill list | grep -E "fastp|star|featurecounts"
# fastp        ✓ built-in
# star         ✓ built-in
# featurecounts ✓ built-in

Add remote documentation (optional but recommended)#

oxo-call docs add fastp --url https://github.com/OpenGene/fastp#usage
oxo-call docs add star --url https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

Step 1: Quality Control with fastp#

fastp trims adapters, removes low-quality reads, and generates a QC report in one step.

Dry-run first#

oxo-call dry-run fastp \
  "trim adapters from paired-end reads sample1_R1.fastq.gz and sample1_R2.fastq.gz, \
   output trimmed reads to trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz, \
   use 4 threads, generate HTML report at qc/sample1.html and JSON at qc/sample1.json"

Expected:

Command: fastp \
  --in1 sample1_R1.fastq.gz --in2 sample1_R2.fastq.gz \
  --out1 trimmed/sample1_R1.fastq.gz --out2 trimmed/sample1_R2.fastq.gz \
  --thread 4 \
  --html qc/sample1.html --json qc/sample1.json
Explanation: fastp auto-detects adapters; --thread 4 for parallelism; HTML/JSON for MultiQC compatibility.

Execute#

mkdir -p trimmed qc
oxo-call run fastp \
  "trim adapters from paired-end reads sample1_R1.fastq.gz and sample1_R2.fastq.gz, \
   output trimmed reads to trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz, \
   use 4 threads, generate HTML report at qc/sample1.html and JSON at qc/sample1.json"

Why no --adapter_sequence?
The fastp skill includes the concept: "fastp auto-detects adapters — do not hardcode adapter sequences unless they are being missed." This is a common mistake when migrating from trimmomatic.

Step 2: Alignment with STAR#

STAR is the gold-standard splice-aware aligner for RNA-seq. It requires a pre-built genome index.

Build the index (if you have not already)#

oxo-call dry-run STAR \
  "build a genome index for hg38, genome fasta at /data/hg38.fa, \
   GTF at /data/gencode.v44.gtf, output to /data/star_hg38, use 8 threads, \
   sjdbOverhang 100 (read length minus 1)"

Expected:

Command: STAR \
  --runMode genomeGenerate \
  --genomeDir /data/star_hg38 \
  --genomeFastaFiles /data/hg38.fa \
  --sjdbGTFfile /data/gencode.v44.gtf \
  --sjdbOverhang 100 \
  --runThreadN 8

Align trimmed reads#

oxo-call dry-run STAR \
  "align trimmed paired-end reads trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz \
   to genome index at /data/star_hg38, output to aligned/sample1/, \
   sort BAM by coordinate, use 8 threads, generate a BAM file"

Expected:

Command: STAR \
  --genomeDir /data/star_hg38 \
  --readFilesIn trimmed/sample1_R1.fastq.gz trimmed/sample1_R2.fastq.gz \
  --readFilesCommand zcat \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix aligned/sample1/ \
  --runThreadN 8
Explanation: --readFilesCommand zcat handles .gz files; SortedByCoordinate produces a ready-to-use BAM.

The --readFilesCommand zcat pitfall:
Forgetting this flag when using .gz input is one of the most common STAR mistakes. The STAR skill includes it as a pitfall: "Always add --readFilesCommand zcat for compressed FASTQ."

Execute:

mkdir -p aligned/sample1
oxo-call run STAR \
  "align trimmed paired-end reads trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz \
   to genome index at /data/star_hg38, output to aligned/sample1/, \
   sort BAM by coordinate, use 8 threads, generate a BAM file"

STAR Two-Pass Mode:
For novel splice junction discovery (e.g., tumor RNA-seq, rare transcripts), consider using STAR's two-pass mode. In the first pass, STAR discovers splice junctions; in the second pass, it re-maps reads using the discovered junctions for improved sensitivity. Add --twopassMode Basic to your alignment task description. Note that two-pass mode increases runtime and memory usage. For standard differential expression analyses with well-annotated genomes (e.g., human, mouse), one-pass mode with a comprehensive GTF annotation is usually sufficient.

Index the BAM#

oxo-call run samtools "index aligned/sample1/Aligned.sortedByCoord.out.bam"

Step 3: Aggregate QC with MultiQC#

MultiQC reads all QC outputs and creates a single interactive report. It should be run after all samples are processed.

oxo-call dry-run multiqc \
  "aggregate QC reports from qc/ and STAR log files from aligned/ into a single report at results/multiqc/"

Expected:

Command: multiqc qc/ aligned/ -o results/multiqc/
Explanation: multiqc automatically discovers fastp JSON reports and STAR Log.final.out files in the specified directories.

mkdir -p results/multiqc
oxo-call run multiqc \
  "aggregate QC reports from qc/ and STAR log files from aligned/ into a single report at results/multiqc/"

Open results/multiqc/multiqc_report.html in a browser to review:

Per-sample read quality before/after trimming
Alignment rates
Duplication rates
Any samples that need attention

Step 4: Quantification with featureCounts#

featureCounts counts reads per gene. It is part of the subread package.

oxo-call dry-run featureCounts \
  "count reads per gene in aligned/sample1/Aligned.sortedByCoord.out.bam \
   using GTF at /data/gencode.v44.gtf, for paired-end data, \
   use 4 threads, output to results/counts.txt"

Expected:

Command: featureCounts \
  -a /data/gencode.v44.gtf \
  -o results/counts.txt \
  -p \
  -T 4 \
  aligned/sample1/Aligned.sortedByCoord.out.bam
Explanation: -p specifies paired-end mode; -a is the GTF annotation; -T sets threads.

Using Salmon for quantification instead?
featureCounts works on aligned BAMs (alignment-based). Salmon uses pseudo-alignment directly on FASTQ files (alignment-free) and is faster for many use cases. Try:
oxo-call dry-run salmon "quantify sample1_R1.fastq.gz and sample1_R2.fastq.gz against index at /data/salmon_hg38_index, output to quant/sample1/"

Execute:

mkdir -p results
oxo-call run featureCounts \
  "count reads per gene in aligned/sample1/Aligned.sortedByCoord.out.bam \
   using GTF at /data/gencode.v44.gtf, for paired-end data, \
   use 4 threads, output to results/counts.txt"

Step 5: Running Multiple Samples#

For a real experiment you will have many samples. Instead of repeating commands manually, use the workflow engine:

# View the built-in RNA-seq template
oxo-call workflow show rnaseq

# Dry-run the complete pipeline
oxo-call workflow dry-run rnaseq

To customize for your samples and paths, see the Workflow Builder tutorial.

Checking the Full Run History#

oxo-call history list | head -20

This shows every command in the order it was executed, with exit codes, timestamps, and provenance metadata.

What You Learned#

How to run a complete RNA-seq pipeline step-by-step with oxo-call
How built-in skills prevent common mistakes (--readFilesCommand zcat, -p for paired-end)
How to add remote documentation for richer LLM context
How MultiQC integrates naturally into the oxo-call workflow
Where to go next: the workflow engine for multi-sample automation

Next steps: - Workflow Builder tutorial — automate this for multiple samples - featureCounts command reference — full options - Skill System reference — how skills improve accuracy