Skip to content

LLM Integration#

LLM Prompt Architecture and Grounding Strategy

Overview#

oxo-call supports multiple LLM providers for command generation:

Provider Default Model Token Required
GitHub Copilot gpt-5-mini (⭐ free tier) Yes (GitHub App token via oxo-call config login)
OpenAI gpt-4.1 (1M context) Yes
Anthropic claude-sonnet-4-6-20250514 (1M context) Yes
Ollama llama3.2 No (local)
DeepSeek deepseek-chat (128K context) Yes (OpenAI-compatible API)
Moonshot AI (Kimi) moonshot-v1-128k Yes (OpenAI-compatible API)
ZhipuAI (GLM) glm-4 (128K context) Yes (OpenAI-compatible API)
MiniMax minimax-chat (1M context) Yes (OpenAI-compatible API)

LLM Roles#

oxo-call uses the LLM in up to three distinct roles per invocation:

Role Trigger System Prompt
Command generation (always) Every run / dry-run Expert bioinformatics command generator
Task optimization (automatic) Pre-generation step Expand and clarify the user's task description
Result verification (--verify) Post-execution step Expert bioinformatics QC analyst

Each role uses a separate system prompt so the LLM behaves appropriately for the job.

Command Generation Prompt#

System Prompt#

The command generation system prompt uses 10 concise rules that are optimised for reliability across all model sizes. Three variants exist (see Adaptive Prompt Compression below); this section describes the Full variant used for large models.

Format Rule

  1. Respond with EXACTLY two labeled lines: ARGS: and EXPLANATION:. No other text.

Invocation Rules (2–5)

  1. NEVER start ARGS with the tool name (auto-prepended by system).
  2. First token = subcommand (sort, view, mem, index, etc), NEVER a flag.
  3. Companion binaries (e.g. bowtie2-build) or scripts (e.g. bbduk.sh) go as first token when skill docs say so.
  4. Multi-step: join with &&. Tool name auto-prepended ONLY to first segment — later commands MUST include their full binary name.

Accuracy Rules (6–7)

  1. Use ONLY flags from docs or skill examples — never invent flags.
  2. Include every file/path from the task. Prefer skill example flags. Include threads (-@/-t/--threads) and output (-o) when applicable.

Convention Rules (8–10)

  1. Default conventions: paired-end, coordinate-sorted BAM, hg38, gzipped FASTQ, Phred+33.
  2. Match format flags to actual types (BAM/SAM/CRAM, gzipped/plain, paired/single, FASTA/FASTQ).
  3. If no arguments needed: ARGS: (none).

Response Format#

The LLM must respond with exactly two labeled lines:

ARGS: <generated arguments>
EXPLANATION: <brief explanation of why these arguments were chosen>

If the response doesn't match this format, oxo-call retries the request.

Raw Prompt Example#

Below is an example of the complete user prompt sent to the LLM for a samtools sort task using the Full tier. The system prompt (shown above) is sent separately; the user prompt focuses on context and task.

# Tool: `samtools`

## Expert Knowledge (from skill)

### Key Concepts
- BAM files MUST be coordinate-sorted before indexing with samtools index
- Use -@ to set additional threads for parallel processing
- samtools view -F 0x904 filters out unmapped, secondary, and supplementary reads

### Common Pitfalls
- Forgetting to index after sorting — samtools index requires a coordinate-sorted BAM
- Using -q without -b — quality filtering without BAM output produces SAM to stdout
- Not specifying -o — output goes to stdout by default, which can corrupt terminal

### Worked Examples
Task: sort a BAM file by coordinate
Args: sort -o sorted.bam input.bam
Explanation: coordinate sort is the default; -o specifies output file

Task: index a sorted BAM file
Args: index sorted.bam
Explanation: creates .bai index required for random access

## Tool Documentation
<captured --help output and cached documentation>

## Task
sort input.bam by coordinate and output to sorted.bam

## Output
ARGS: <subcommand then flags, NO tool name>
EXPLANATION: <brief>

For the Compact tier (used with ≤3B models), the prompt uses a few-shot format:

Tool: samtools

Task: Sort a BAM file by coordinate

---FEW-SHOT---

ARGS: sort -@ 4 -o sorted.bam input.bam
EXPLANATION: Sort BAM by coordinate with 4 threads.

---FEW-SHOT---

Tool: samtools
Task: sort bam by coordinate

Use --verbose mode to see the actual prompt for any command:

oxo-call dry-run --verbose samtools "sort input.bam by coordinate"

Automatic Task Normalization#

oxo-call uses a doc-enriched prompting strategy that works in a single LLM call by default:

  1. Structured Doc Extraction (deterministic, no LLM): When documentation is fetched, oxo-call extracts a structured FlagEntry catalog and concrete command examples from the help text. This is injected directly into the prompt.

  2. Default (Fast) mode: A single LLM call with the doc-enriched prompt. The flag catalog prevents hallucinated flags and doc-extracted examples serve as few-shot demonstrations — critical for small models (≤3B).

  3. Quality mode (via --scenario full): Multi-stage pipeline with optional task normalization, mini-skill generation, and doc cleaning. Activated only when explicitly requested or when the orchestrator determines high complexity and no skill is available. When a skill is available, the orchestrator always selects Fast mode since the skill already provides the grounding that Quality mode would generate.

When Quality mode is active:

  • Task standardization and mini-skill generation run concurrently via tokio::join! when both are needed, reducing wall-clock latency by up to 50%.
  • Mini-skill cache is keyed by (tool, doc_hash) rather than (tool, task, doc_hash), so the second invocation for the same tool is always a cache hit regardless of the user's task description.
  • Task standardization only triggers when the task is shorter than 10 characters, contains vague keywords (e.g., "just", "simply"), or contains non-ASCII characters.

When the task is normalized, the enriched version:

  • Expands ambiguous terms into specific operations (e.g., "sort bam" → "sort BAM file input.bam by genomic coordinate and write to sorted.bam")
  • Infers bioinformatics defaults (paired-end reads, hg38, 8 threads, gzipped output, Phred+33 encoding)
  • Specifies output file names when omitted (derived from input names)
  • Preserves all file names, paths, and sample identifiers from the original task
  • Responds in the same language as the original task

Doc-Enriched Prompt Architecture#

The key innovation for doc-only accuracy: DocProcessor deterministically extracts structured knowledge from --help output and injects it into the prompt.

Flag Catalog Extraction

# From: samtools sort --help
OPTIONS:
  -o FILE    Write final output to FILE
  -@ INT     Number of additional threads
  -n         Sort by read name

# Extracted FlagEntry catalog:
-o FILE → "Write final output to FILE"
-@ INT  → "Number of additional threads"
-n      → "Sort by read name"

This catalog is injected into the prompt as "Valid Flags" — the LLM learns which flags actually exist and avoids hallucinating non-existent ones.

Doc-Extracted Examples

Command-line examples from the EXAMPLES section (lines with $ prompts, shell pipes, or flag patterns) are extracted and used as few-shot demonstrations:

# Extracted from help text:
$ samtools sort -o sorted.bam input.bam
$ samtools sort -@ 4 -o out.bam in.bam

For Compact/Medium prompt tiers, these are injected as ---FEW-SHOT--- user/assistant pairs, which is critical for ≤3B models that learn better from examples than from rules.

Quality Score

A deterministic 0.0–1.0 score computed from doc completeness:

Component Weight
USAGE section present 0.25
EXAMPLES section present 0.25
Extracted command examples (up to 5) 0.25
Flag catalog entries (up to 10) 0.15
Subcommands present 0.05
Quick reference flags 0.05

Result Verification (--verify)#

When --verify is set on run or workflow run, an extra LLM call is made after execution. The LLM acts as a bioinformatics QC analyst and analyses:

  • The exit code (with awareness that some tools use non-zero for warnings, exit 137 = OOM, exit 139 = segfault)
  • Error signals in stderr (ERROR, FATAL, Exception, Traceback, Segmentation fault, OOM, Permission denied, etc.)
  • Declared output files — their existence and sizes (zero-byte = suspicious)
  • Tool-specific patterns (e.g., samtools truncated-BAM warnings, STAR alignment rate, GATK exceptions, BWA reference errors)
  • Distinguishes fatal failures from harmless noise (progress bars, INFO/NOTE messages, version banners)

The structured response includes:

  • STATUS: success | warning | failure
  • SUMMARY: a one-sentence verdict in the same language as the task
  • ISSUES: a list of detected problems (empty when clean)
  • SUGGESTIONS: actionable fixes

Verification is advisory — it never changes the process exit code. In JSON mode (--json), a verification block is appended to the output.

Provider Configuration#

See the Configuration tutorial for setup instructions.

Grounding Strategy#

oxo-call uses a "docs-first" grounding strategy with three layers:

  1. Doc-extracted grounding (deterministic): Flag catalog and command examples extracted from --help output are injected into every prompt. This prevents flag hallucination even without skill files.
  2. Skill grounding (when available): Expert-authored concepts, pitfalls, and examples from skill files provide deeper domain knowledge.
  3. Self-learning cache: Successful commands are cached by tool+task+doc hash, becoming implicit few-shot examples for future similar tasks.

This layered approach means:

  • With skill file: Best accuracy — skill examples + doc grounding + cache
  • Without skill file: High accuracy — doc-extracted examples + flag catalog + cache
  • Without docs: Degraded but functional — model knowledge + cache only

The doc-enriched prompt is the key innovation for achieving reliable results with small models (≤3B parameters) using only --help output.

Adaptive Prompt Compression#

When llm.context_window is configured (or auto-detected from the model name), oxo-call automatically compresses prompts to fit the model's context budget. Three tiers are used, each with a purpose-built system prompt and user prompt strategy:

Tier Context Window System Prompt User Prompt Strategy Target Models
Full ≥ 16k or unknown 10 rules (~450 tokens) Skill → Docs → Task → concise Output 7B+, ≥16K context
Medium 4k – 16k Medium-specific (~120 tokens) Skill(5 examples) → truncated Docs → Task → Output 4–7B, 4K–16K context
Compact ≤ 4k Concrete example + 3 rules (~80 tokens) Few-shot(2 examples or fallback) → optional Docs → Task ≤3B, any context

Tier Design Philosophy#

Full — For models that can effectively use all available context. The system prompt contains 10 comprehensive rules; the user prompt injects full skill knowledge and complete documentation before the task.

Medium — For mid-range models with limited but usable context. Uses a dedicated, shorter system prompt. Documentation is truncated to fit the remaining budget after skill examples (up to 5, task-relevant selection) are included. Docs are placed after skill but before task, so the model focuses on expert knowledge first.

Compact — For small models (≤3B) that suffer from context overflow. Key design decisions:

  1. Few-shot > instructions: Small models imitate better than they follow rules. The ---FEW-SHOT--- markers create user/assistant/user turns that demonstrate the exact output format.
  2. Doc-extracted examples as few-shot: When no skill is loaded but documentation is available, command examples extracted from the help text are used as few-shot demonstrations. This grounds the model in the tool's actual flag format without needing a skill file.
  3. Concrete examples > abstract placeholders: The system prompt uses ARGS: sort -@ 4 -o out.bam in.bam instead of ARGS: <subcommand then flags>, because some models (e.g., starcoder2) would literally output the placeholder text.
  4. Flag catalog injection: Even in Compact tier, a brief "Valid flags" line is included to prevent flag hallucination.
  5. No format template in the final user message: Including Output:\nARGS: sort... causes some models to output empty — they interpret the template as the answer already being provided.
  6. Selective documentation injection: When no skill examples are available, a heavily truncated doc section is injected as the only grounding source.

Auto-Detection#

The context window is inferred from common model name patterns:

Model Name Pattern Detected Context Tier
qwen2.5-coder:0.5b, phi-3:3b 2,048 Compact
llama3:8b, deepseek-coder-v2:16b 8,192 Medium
qwen2.5:72b, llama3:70b 32,768 Full
gpt-4.1, gpt-4.1-mini, gpt-4.1-nano 1,047,576 (~1M) Full
gpt-4o, gpt-5-mini, gpt-4o-mini 128,000 Full
claude-opus-4-6, claude-opus-4-7, claude-sonnet-4-6 1,000,000 Full
claude-3-5-sonnet, claude-4 200,000 Full
gemini-2.5, gemini-3 1,000,000 Full
deepseek-v3, deepseek-r1, deepseek-v2 131,072 (128K) Full
deepseek-coder 16,384 Medium
moonshot-v1-8k 8,000 Medium
moonshot-v1-32k 32,768 Full
moonshot-v1-128k, kimi-* 128,000 Full
kimi-k2.5 256,000 Full
glm-4, glm-4-flash, chatglm-* 128,000 Full
glm-4-long, glm-5.1 1,000,000+ Full
minimax-m2.7 1,000,000 Full

Manual Configuration#

Override auto-detection via config.toml or environment variables:

[llm]
context_window = 4096   # force Medium tier
prompt_tier = "compact"  # force Compact tier regardless of context_window

Or per-invocation:

# Force a specific tier
oxo-call config set llm.prompt_tier compact    # ≤3B models
oxo-call config set llm.prompt_tier auto       # auto-detect (default)

# Override context window
oxo-call config set llm.context_window 2048    # force Compact

# Per-invocation via environment
OXO_CALL_LLM_PROMPT_TIER=compact oxo-call dry-run samtools "sort bam"

Design Rationale#

Mini models (≤ 3B parameters) suffer from context overflow — when the prompt exceeds their effective context, the output quality degrades sharply (empty output, format violations, hallucinated flags). The Compact tier addresses this by:

  1. Reducing system prompt from ~1,600 characters (16 rules) to ~200 characters (concrete example + 3 rules)
  2. Using few-shot instead of format instructions — small models imitate better than they follow abstract rules
  3. Limiting to 2 most-relevant examples as few-shot assistant messages
  4. Omitting raw documentation unless no skill examples are available
  5. Avoiding format templates in the user message that confuse small models

Small Model Performance#

After the 3-tier prompt system redesign, small model accuracy improved dramatically:

Model Parameters Before After
qwen2.5-coder 0.5B ~0% 83–100%
deepseek-coder 1.3B ~20% 75–100%
llama3.2 3B ~0% 100%
starcoder2 3B ~0% 91%
ministral 3B ~0% 100%

"Before" = original Full-tier prompt on all models. "After" = automatic tier selection with the redesigned prompt system.

LLM Response Caching#

oxo-call can cache LLM responses to avoid redundant API calls for similar or identical prompts. This is particularly useful for:

  • Repeated tasks with the same tool and similar descriptions
  • Development and testing workflows
  • Cost optimization when using paid LLM APIs

How It Works#

The cache uses a semantic hash computed from:

  • Tool name
  • Task description (normalized)
  • Documentation hash
  • Skill name (if used)
  • Model identifier

When cache is enabled and a matching entry exists, the cached response is returned immediately without an LLM API call.

Configuration#

Cache is disabled by default for independent benchmarking. Enable it via:

# Enable caching
oxo-call config set llm.cache_enabled true

# Check cache status
oxo-call config get llm.cache_enabled

Cache Behavior#

Setting Behavior
llm.cache_enabled = true Cache hits return cached responses; misses are stored after LLM call
llm.cache_enabled = false All requests go to LLM (default)
--no-cache flag Bypasses cache for this invocation (fetches fresh docs, ignores cached LLM response)

Cache Storage#

  • Location: ~/.local/share/oxo-call/llm_cache.jsonl
  • Format: JSONL (one JSON object per line)
  • Expiration: 7 days
  • Metadata: Each entry stores tool, task, args, explanation, model, timestamp, and hit count

Cache Priority#

When cache is enabled:

  1. Cache hit (exact semantic match) → return cached response
  2. User preferences from command history → inform the LLM prompt
  3. Fresh LLM call → generate and cache new response

Streaming Response (SSE)#

By default, oxo-call uses Server-Sent Events (SSE) streaming for all LLM API calls. This means tokens are printed to stderr as they arrive, reducing perceived latency for the user — especially for large models and complex tasks.

How it works#

  1. The HTTP request is sent with "stream": true.
  2. The API returns a stream of data: lines (SSE protocol).
  3. Each chunk is parsed for content deltas and printed to stderr immediately.
  4. After the stream completes, the full collected response is parsed as usual (e.g., ARGS: / EXPLANATION: extraction).

Because streaming output goes to stderr, stdout remains clean — JSON output, piped commands, and scripting work exactly as before.

Disabling streaming#

Streaming can be disabled in two ways:

Per-invocation — pass --no-stream to any LLM-backed command:

oxo-call run --no-stream samtools "sort bam by coordinate"
oxo-call dry-run --no-stream bwa "align reads"
oxo-call chat --no-stream samtools "how to sort"
oxo-call workflow generate --no-stream "RNA-seq pipeline"
oxo-call server run --no-stream mycluster samtools "sort bam"

Globally — set the llm.stream config key:

# Disable streaming globally
oxo-call config set llm.stream false

# Re-enable streaming
oxo-call config set llm.stream true

Performance considerations#

Streaming adds minimal overhead:

  • Network: SSE chunks are typically the same size as a non-streaming response; the total data transferred is identical.
  • CPU: Parsing individual JSON chunks is lightweight (a few microseconds per chunk).
  • Latency: First-token latency is significantly reduced because the user sees output as soon as the first token is generated, rather than waiting for the full response.

Disable streaming (--no-stream or llm.stream = false) when:

  • Running in non-interactive environments (CI, batch scripts) where stderr output is undesirable
  • Benchmarking LLM response times (streaming adds small per-chunk overhead that affects timing measurements)
  • Using providers that don't support SSE (rare — all major providers support it)