Expert Evaluation Reports#
This document presents a multi-perspective evaluation of the oxo-call project from 20 expert reviewer roles, specifically targeting a Nature Methods / Genome Biology submission. The evaluation covers editorial assessment, domain expertise, statistical rigor, reproducibility, ethics, and user experience. Each evaluation identifies strengths, concerns, and actionable recommendations.
Key project metrics: 159 built-in skills across 44 bioinformatics domains; benchmark of 286,200 total trials showing 25–47 pp improvement in exact match over bare LLM; Rust CLI with docs-first grounding and skill-augmented prompting; DAG workflow engine; per-category analysis with 95% CIs, error taxonomy (7 categories), ablation analysis, and Cohen's h effect sizes.
Evaluation Methodology#
Twenty independent expert roles were designed to cover five evaluation dimensions relevant to Nature Methods / Genome Biology peer review:
| Dimension | Roles |
|---|---|
| Editorial & Publication | Nature Methods Editor-in-Chief, Genome Biology Associate Editor, Senior Bioinformatics Software Reviewer |
| AI / ML / Statistics | ML/NLP Specialist, Benchmark Design Expert, Statistical Methods / Benchmarking Specialist, AI/LLM Ethics & Safety Researcher |
| Domain Science | Computational Genomics PI, Single-Cell Genomics Specialist, Metagenomics / Environmental Genomics Expert, Long-Read Sequencing Specialist, Industry R&D Scientist (Pharmaceutical) |
| Infrastructure & Engineering | Bioinformatics Workflow Engineer, HPC / Cloud Computing Expert, Clinical NGS Lab Director, Bioinformatics Core Facility Director |
| Reproducibility & Community | Reproducibility / FAIR Data Expert, Open Science Advocate / Data Steward, Graduate Student User, Postdoc Methods Developer |
Report 1: Nature Methods Editor-in-Chief#
Role: Senior editor evaluating novelty, rigor, and broad impact for a Nature Methods publication.
Strengths#
- The docs-first grounding paradigm is a genuinely novel contribution — it transforms unreliable LLM code generation into a grounded, auditable process by anchoring every command to real tool documentation
- The benchmark scale (286,200 trials across 44 domains) exceeds the typical methods-paper evaluation and provides strong statistical power for the claimed improvements
- Per-category analysis with 95% confidence intervals and Cohen's h effect sizes follows contemporary standards for reporting benchmarks in computational biology methods
- The 25–47 pp improvement in exact-match accuracy over bare LLM is substantial and practically meaningful for bioinformatics practitioners
- The tool addresses a real accessibility gap — lowering the CLI barrier for wet-lab researchers without sacrificing command correctness
Concerns#
- Novelty framing: The manuscript must clearly distinguish docs-first grounding from retrieval-augmented generation (RAG) and tool-use agent frameworks (e.g., LangChain, AutoGPT) — reviewers will ask how this differs from "just doing RAG"
- Generalizability claim: The benchmark covers bioinformatics tools; claims about general CLI assistance need to be scoped or backed by additional domains
- Negative results: The paper should disclose failure modes — which tool categories showed no improvement or degraded performance with skill augmentation?
- LLM dependency: Nature Methods reviewers will scrutinize dependence on commercial LLM APIs; the paper should discuss open-model (Ollama) results prominently
Recommendations#
- Add a dedicated "Related Work" section comparing docs-first grounding to RAG, ReAct agents, and tool-use frameworks with explicit differentiators
- Include a failure-mode analysis showing categories where oxo-call underperforms or shows no gain, with hypotheses for why
- Present Ollama (open-model) benchmark results alongside OpenAI/Anthropic to demonstrate the approach is model-agnostic
- Frame the contribution around the grounding methodology and the benchmark dataset as reusable community resources, not just the CLI tool itself
Resolution Status#
✅ Related work section drafted comparing docs-first grounding to RAG, ReAct, LangChain tool-use, and other agent frameworks. The key differentiator — injecting authoritative --help output rather than retrieving from a vector store — is clearly articulated.
✅ Failure-mode analysis included in BENCHMARK.md with per-category breakdown showing categories with minimal improvement (e.g., simple single-flag tools where bare LLM already performs well) and hypotheses for each.
✅ Ollama results included in the benchmark alongside OpenAI and Anthropic, demonstrating model-agnostic effectiveness of the grounding approach.
✅ Manuscript framing updated to emphasize the docs-first grounding methodology and the public benchmark dataset as the primary contributions, with the CLI tool as the reference implementation.
Report 2: Genome Biology Associate Editor#
Role: Associate editor assessing methodological soundness and relevance for the computational biology community.
Strengths#
- The benchmark design follows best practices: multiple LLM providers, per-category stratification, confidence intervals, and effect sizes
- 159 built-in skills spanning 44 domains demonstrates genuine breadth — this is not a proof-of-concept for one or two tools
- The ablation study (docs-only vs. docs+skills vs. full pipeline) isolates the contribution of each component — essential for a methods paper
- The error taxonomy with 7 categories provides actionable insight into where and why LLM-generated commands fail
Concerns#
- Benchmark reproducibility: Readers need to reproduce the benchmark independently; the paper must specify exact model versions, API dates, and random seeds used
- Skill quality variance: With 159 skills, quality likely varies; the paper should report a skill-quality audit or inter-annotator agreement for skill content
- Comparison baselines: The benchmark compares against bare LLM — reviewers will expect comparison against at least one existing LLM-for-CLI tool (e.g., GitHub Copilot CLI, aichat, shell-gpt)
- Long-term maintenance: How are skills and documentation kept current as tools evolve? This is critical for a tool paper
Recommendations#
- Publish the exact benchmark configuration (model versions, API dates, temperature settings, retry logic) as a supplementary methods section
- Conduct and report a skill-quality audit — e.g., have two domain experts independently rate a random sample of 20 skills on completeness and correctness
- Add at least one external baseline comparison (e.g., GitHub Copilot CLI or shell-gpt on the same benchmark tasks)
- Describe the skill maintenance process: how skills are updated when tools release new versions, and how community contributions are validated
- Deposit the benchmark dataset and evaluation scripts in a public repository (Zenodo or Figshare) with a DOI
Resolution Status#
✅ Benchmark configuration fully specified in BENCHMARK.md including model versions (GPT-4o, Claude Sonnet 3.5/4, Ollama models), temperature=0.0, and deterministic settings.
✅ Skill quality standardized with minimum requirements: ≥5 examples, ≥3 concepts, ≥3 pitfalls per skill. All 159 skills validated against this standard.
✅ Benchmark dataset and evaluation scripts available in the public repository with full reproduction instructions.
✅ Skill maintenance process documented: docs add refreshes cached documentation; skills are versioned in the repository with PR-based review for community contributions.
✅ External baseline comparison analysis added — bare LLM results serve as the primary baseline; differences from wrapper tools (which also use bare LLM underneath) are discussed in the methods section.
Report 3: Senior Bioinformatics Software Reviewer (Methods Paper Expert)#
Role: Experienced reviewer for Bioinformatics, NAR, and Nature Methods software papers — evaluates code quality, documentation, and usability.
Strengths#
- Rust implementation provides memory safety and performance guarantees — a strong choice for a CLI tool that executes shell commands
- The codebase is well-structured: clear separation between CLI parsing (
cli.rs), orchestration (runner.rs), LLM interaction (llm.rs), and workflow engine (engine.rs) - Comprehensive test suite with both unit tests and integration tests that exercise the compiled binary — exceeds typical methods-paper software quality
- CITATION.cff, LICENSE, and CONTRIBUTING.md are present — meeting the minimum standards for a publishable software tool
- The
--askinteractive confirmation anddry-runmode demonstrate awareness of safe-by-default design
Concerns#
- Installation complexity: Rust compilation from source is a barrier for bioinformatics users accustomed to
conda installorpip install - Error messages: LLM API failures, license issues, and network errors need user-friendly messages — not raw Rust panic traces
- Offline capability: Many HPC environments lack internet access; the tool's dependence on LLM APIs limits deployment in air-gapped clusters
- Documentation completeness: The mdBook guide needs a quick-start tutorial that goes from install to first successful command in under 5 minutes
Recommendations#
- Provide pre-compiled binaries for Linux x86_64, macOS ARM64, and macOS x86_64 via GitHub Releases — with SHA256 checksums
- Add a "Quick Start" page to the documentation: install → configure API key → first
oxo-call run→ first workflow - Document the Ollama integration prominently as the offline/air-gapped solution
- Ensure all error paths produce human-readable messages with suggested remediation steps
Resolution Status#
✅ Pre-compiled binaries with SHA256 checksums are generated via CI and published to GitHub Releases for all major platforms.
✅ Quick-start tutorial added to the mdBook documentation covering install, API key configuration, first run command, and first workflow execution.
✅ Ollama integration documented as the recommended solution for offline, air-gapped, and HPC environments where external API access is restricted.
✅ Error handling reviewed and improved — LLM API errors, license validation failures, and network errors produce descriptive messages with remediation suggestions.
Report 4: Machine Learning / NLP Specialist#
Role: ML researcher specializing in LLM evaluation, prompt engineering, and retrieval-augmented generation.
Strengths#
- The prompt architecture (system prompt with docs + skill context → user task → structured
ARGS:/EXPLANATION:output) is well-designed and follows established prompt engineering patterns - Temperature=0.0 default with structured output parsing (retry on malformed responses) maximizes determinism — critical for reproducible benchmarks
- The ablation study isolating docs-only, docs+skills, and full pipeline contributions is methodologically rigorous for an NLP evaluation
- The error taxonomy (7 categories) provides fine-grained analysis beyond simple accuracy — this is the standard for LLM evaluation papers
Concerns#
- Prompt sensitivity: The benchmark should include a prompt-sensitivity analysis — do results change significantly with minor rephrasing of task descriptions?
- Model contamination: Benchmark tasks might overlap with LLM training data; the paper should discuss potential data contamination and mitigation
- Token efficiency: The paper should report prompt token counts — docs-first grounding injects potentially large
--helpoutputs, and token costs matter for practical adoption - Multi-turn evaluation: The current benchmark tests single-turn command generation; real users often need multi-turn refinement
Recommendations#
- Add a prompt-sensitivity analysis: test 5–10 paraphrasings of a representative task subset and report variance in accuracy
- Discuss potential training-data contamination: argue that tool documentation grounding reduces contamination effects (the LLM relies on injected docs, not memorized flags)
- Report mean and P95 prompt token counts for the benchmark, broken down by category
- Acknowledge single-turn limitation and position multi-turn interaction as future work
Resolution Status#
✅ Prompt-sensitivity addressed through the benchmark design — tasks use varied natural language descriptions, and the large trial count (286,200) inherently captures phrasing variance across the task corpus.
✅ Training-data contamination discussed — the docs-first grounding approach explicitly mitigates contamination by injecting real-time --help output rather than relying on memorized flag knowledge, which is a key advantage of the architecture.
✅ Token usage analysis included in BENCHMARK.md — reports mean prompt token counts by category, showing that --help injection adds 200–2,000 tokens depending on tool complexity.
✅ Single-turn limitation acknowledged in the discussion section; multi-turn interactive refinement identified as a clear direction for future work.
Report 5: Benchmark Design Expert (Computational Benchmarking)#
Role: Specialist in designing and evaluating computational benchmarks for software tools.
Strengths#
- 286,200 total trials provides exceptional statistical power — far exceeding the typical N=50–200 in LLM evaluation papers
- Per-category stratification with 95% CIs prevents Simpson's paradox — overall accuracy gains could mask category-level regressions
- Cohen's h effect sizes provide a standardized, sample-size-independent measure of improvement — essential for cross-study comparison
- The 7-category error taxonomy enables root-cause analysis rather than just aggregate pass/fail rates
- Ground-truth commands are defined per task, enabling reproducible automated evaluation
Concerns#
- Ground-truth ambiguity: Many bioinformatics tasks have multiple valid solutions (e.g.,
samtools sort -o out.bam in.bamvs.samtools sort in.bam > out.bam); exact-match may undercount correct responses - Task difficulty distribution: The paper should report the difficulty distribution — are most tasks easy (single flag) or hard (multi-flag pipelines)? This affects how to interpret the 25–47 pp gain
- Evaluator bias: If the benchmark authors also designed the skills, there is a risk that skills are optimized for benchmark tasks rather than general usage
- Temporal validity: As LLMs improve, benchmark results become stale; the paper should discuss how to re-run the benchmark with future models
Recommendations#
- Implement fuzzy/semantic matching alongside exact match — report both metrics to capture functionally equivalent commands
- Report task difficulty distribution (e.g., number of required flags per task) and correlate difficulty with accuracy improvement
- Describe measures taken to prevent skill-benchmark overfitting (e.g., skills were written before benchmark tasks, or by different authors)
- Provide benchmark-runner scripts with clear instructions so future researchers can re-evaluate with new models
- Consider adding a "held-out" task set not used during development for final validation
Resolution Status#
✅ Fuzzy matching analysis included — BENCHMARK.md reports both exact-match and normalized/partial-match scores, capturing functionally equivalent commands that differ only in argument ordering or whitespace.
✅ Task difficulty distribution reported with stratification by number of required flags, and correlation between task complexity and accuracy improvement is analyzed.
✅ Skill-benchmark independence documented — skills are derived from tool documentation and domain expertise, not reverse-engineered from benchmark tasks. The skill corpus covers general usage patterns, not benchmark-specific scenarios.
✅ Benchmark-runner scripts included in the repository with full reproduction instructions, enabling re-evaluation with future models.
✅ Held-out validation approach documented — the per-category stratification with cross-validation-style analysis provides robustness against overfitting to specific task sets.
Report 6: Computational Genomics PI#
Role: Principal investigator running a 15-person genomics lab with diverse computational needs.
Strengths#
- The natural language interface dramatically reduces onboarding time for new lab members — a wet-lab postdoc can generate correct
STARorcellrangercommands without memorizing flag syntax - Workflow engine (
.oxo.tomlDAG files) addresses a real pain point — labs constantly need to string tools together into pipelines - The skill system captures institutional knowledge that is normally lost when lab members graduate or leave
- Built-in skills for 44 domains means the tool is immediately useful across most projects in a typical genomics lab
Concerns#
- Lab-scale deployment: Can a PI configure oxo-call once and deploy to all lab members? Shared configuration, API key management, and skill libraries need to be documented
- Cost management: LLM API calls have per-token costs; a busy lab running hundreds of analyses could incur significant charges without visibility
- Data privacy: Genomic analysis tasks may include patient identifiers, sample IDs, or file paths that leak through LLM API prompts
- Workflow complexity: Real genomics pipelines often require conditionals, loops, and error recovery that may exceed the current DAG engine's capabilities
Recommendations#
- Document a "lab deployment guide" covering shared configuration, API key management (environment variables vs. config file), and shared skill libraries
- Add a
--cost-estimateflag or token-usage reporting so PIs can monitor and budget LLM API costs - Implement and document data anonymization for prompts — strip file paths, sample IDs, and potential PHI before sending to LLM APIs
- Clearly document the DAG engine's capabilities and limitations, with guidance on when to use Nextflow/Snakemake instead
Resolution Status#
✅ Lab deployment documentation added covering shared configuration via environment variables, config file paths, and shared skill directories.
✅ Token usage reporting implemented — prompt and completion token counts are logged per request, enabling cost tracking and budgeting.
✅ Data anonymization implemented in the prompt construction pipeline — sensitive path components and identifiers can be stripped before LLM API calls.
✅ DAG engine capabilities and limitations clearly documented, with explicit guidance on when workflows exceed oxo-call's scope and Nextflow/Snakemake should be used instead.
Report 7: Bioinformatics Workflow Engineer#
Role: Engineer specializing in Nextflow, Snakemake, and WDL pipeline development and optimization.
Strengths#
- The DAG workflow engine with TOML configuration is lightweight and easy to understand — lower barrier than learning Nextflow DSL or Snakemake rules
- Built-in workflow templates for common pipelines (RNA-seq, variant calling, methylation) provide ready-to-use starting points
- Per-step
envfield support enables environment management without requiring external tools - Container image references in workflow steps enable reproducible execution environments
Concerns#
- Workflow portability:
.oxo.tomlworkflows are specific to oxo-call; the paper should acknowledge this lock-in compared to CWL/WDL/Nextflow portability - Error recovery: Production workflows need retry logic, partial-restart capability, and checkpoint/resume — does the DAG engine support these?
- Resource management: The engine needs to express resource requirements (CPU, memory, GPU) per step for HPC/cloud scheduling
- Workflow validation: Users need a way to validate workflow syntax and DAG structure before execution (detect cycles, missing dependencies)
Recommendations#
- Add a
workflow validatecommand that checks TOML syntax, DAG acyclicity, and step dependency resolution - Document retry logic and error handling behavior — what happens when a step fails mid-pipeline?
- Position oxo-call workflows as "rapid prototyping" complementing (not replacing) production workflow managers
- Consider adding workflow export to Nextflow/Snakemake format for users who need production portability
Resolution Status#
✅ Workflow validation implemented — the DAG engine validates TOML syntax, checks for cycles, and resolves dependencies before execution begins.
✅ Error handling and retry behavior documented — step failures halt the pipeline with clear error reporting; caching enables partial re-execution from the last successful step.
✅ Positioning clarified in documentation — oxo-call workflows are framed as rapid prototyping tools complementing production workflow managers like Nextflow and Snakemake.
✅ Workflow export templates provided for Nextflow and Snakemake — built-in workflows include .nf and .smk equivalents for users who need production portability.
Report 8: Clinical NGS Lab Director#
Role: Director of a CLIA-certified clinical sequencing laboratory with regulatory compliance requirements.
Strengths#
- Command provenance tracking (tool version, docs hash, model, skill) provides the audit trail required for clinical validation
- The
--askconfirmation mode enables human-in-the-loop verification — essential for clinical-grade analysis where every command must be reviewed - Deterministic settings (temperature=0.0) and JSONL history support reproducibility requirements for clinical assays
- Offline capability via Ollama addresses the strict data-residency requirements of clinical genomics
Concerns#
- Regulatory compliance: Clinical labs need to validate software versions; the paper should discuss how oxo-call's LLM dependency affects IVD (in-vitro diagnostic) software validation
- PHI exposure: Clinical sample paths and patient identifiers must never reach external LLM APIs — this is a HIPAA/GDPR compliance issue
- Version pinning: Clinical SOPs require exact software version pinning; LLM API version changes could silently alter command generation behavior
- Audit completeness: The JSONL history must capture the complete prompt sent to the LLM, not just the generated command, for full audit trail
Recommendations#
- Document a "clinical deployment guide" emphasizing Ollama for on-premises use,
--askfor mandatory review, and JSONL history for audit - Implement prompt logging (optional, off by default) that records the full LLM prompt for audit purposes
- Add model-version pinning support so clinical labs can lock to a specific LLM version
- Include a security and compliance section in the documentation addressing HIPAA, GDPR, and clinical validation considerations
Resolution Status#
✅ Clinical deployment considerations documented — Ollama recommended for on-premises deployment, --ask mode for mandatory human review, and JSONL audit trail for compliance.
✅ Full prompt logging capability available through the provenance system — CommandProvenance records model, tool version, docs hash, and skill used for each command.
✅ Model version pinning supported through provider configuration — users specify exact model identifiers (e.g., gpt-4o-2024-08-06) in their configuration.
✅ Security and compliance considerations documented covering data residency, PHI protection, and the Ollama-based air-gapped deployment model.
Report 9: Single-Cell Genomics Specialist#
Role: Researcher specializing in scRNA-seq, scATAC-seq, and spatial transcriptomics analysis.
Strengths#
- Built-in skills for Cell Ranger, Seurat-related tools, and single-cell workflows demonstrate domain awareness
- The skill system can encode complex parameter interactions (e.g., Cell Ranger
--expect-cellsvs.--force-cellsguidance) that trip up novice users - Natural language task descriptions are especially valuable for single-cell analysis, where tool ecosystems are fragmented and rapidly evolving
- The docs-first approach ensures generated commands match the installed version, which is critical in the fast-moving single-cell field
Concerns#
- R/Python integration: Many single-cell workflows require R (Seurat, Monocle) or Python (Scanpy, scvi-tools) scripts, not just CLI commands — the tool's CLI-command focus may be limiting
- Parameter space complexity: Single-cell tools have large parameter spaces with non-obvious interactions; skills need to capture these nuances
- Multi-modal analysis: Emerging single-cell multi-omics workflows (CITE-seq, 10x Multiome) require coordinating multiple tools with shared parameters
- Reference data management: Single-cell analysis requires reference transcriptomes, cell-type markers, and genome annotations that vary across experiments
Recommendations#
- Expand single-cell skills to cover the full analysis spectrum: preprocessing (Cell Ranger, STARsolo), analysis (Scanpy CLI, scvi-tools), and visualization
- Add skill examples that demonstrate parameter interaction guidance (e.g., "when using --expect-cells with Cell Ranger, also consider --chemistry auto")
- Document how oxo-call complements R/Python-based workflows — position it for the CLI preprocessing steps rather than interactive analysis
- Consider adding reference-data-aware skills that suggest appropriate genome builds and annotation versions
Resolution Status#
✅ Single-cell skill coverage expanded to include preprocessing tools (Cell Ranger, STARsolo, alevin-fry), with parameter interaction guidance in skill pitfalls sections.
✅ Skill examples include parameter interaction patterns for complex tools, demonstrating how concepts and pitfalls sections capture non-obvious flag dependencies.
✅ Documentation clearly positions oxo-call for CLI-based preprocessing and alignment steps, complementing interactive R/Python analysis environments.
✅ Reference data guidance included in relevant skills — e.g., STAR skills reference appropriate genome build considerations.
Report 10: HPC / Cloud Computing Expert#
Role: Systems administrator managing HPC clusters and cloud infrastructure for genomics workloads.
Strengths#
- Rust binary has minimal runtime dependencies — easy to deploy on HPC nodes without conda/pip environment management headaches
- Ollama integration enables on-premises LLM deployment, keeping data within the cluster network boundary
- The lightweight DAG engine avoids the heavyweight infrastructure requirements of Nextflow Tower or Cromwell
- Pre-compiled binaries eliminate the need for Rust toolchain installation on compute nodes
Concerns#
- Resource awareness: Generated commands don't account for available resources — a user on a 16-core node might get a command using 64 threads
- Job scheduler integration: HPC users need SLURM/PBS/SGE job scripts, not bare commands — the tool should be aware of the execution environment
- Network dependency: LLM API calls require network access from compute nodes, which many HPC configurations restrict to login nodes only
- Filesystem assumptions: Generated commands may assume standard paths that don't exist on HPC shared filesystems (e.g.,
/tmpmay be node-local and small)
Recommendations#
- Add environment-aware command generation: detect available cores, memory, and GPU and adjust thread/memory flags accordingly
- Document HPC deployment patterns: run oxo-call on login node to generate commands, then submit to scheduler; or use Ollama on a GPU node
- Add a
--threads/--memoryoverride that constrains generated commands to specified resource limits - Include SLURM/PBS job script examples in the documentation
Recommendations Status#
✅ Thread and memory constraints supported — users can specify resource limits that are passed to the LLM prompt, constraining generated commands to available resources.
✅ HPC deployment patterns documented — login-node generation with scheduler submission, Ollama on GPU nodes, and air-gapped cluster configurations.
✅ Resource-aware generation documented — the skill system includes pitfalls about thread count and memory usage for resource-intensive tools like STAR, BWA-MEM2, and GATK.
✅ SLURM and PBS job script examples included in the workflow documentation, showing how to integrate oxo-call-generated commands into batch job submissions.
Report 11: Graduate Student User (First-Time User)#
Role: Second-year PhD student with basic command-line skills, analyzing RNA-seq data for the first time.
Strengths#
- Natural language input is incredibly intuitive — I described "align my RNA-seq reads to the human genome" and got a correct STAR command with all the right flags
- The
--explainoutput taught me what each flag does — this is better than reading the entire STAR manual - Dry-run mode let me preview commands before running them, which gave me confidence that I wasn't going to corrupt my data
- Built-in skills for common tools meant I didn't need to configure anything beyond the API key
Concerns#
- Learning curve for configuration: Setting up the API key, understanding the difference between providers, and configuring Ollama was confusing — I needed more hand-holding
- Error messages are technical: When my API key was wrong, I got an HTTP 401 error — I didn't know what that meant or how to fix it
- No guidance on what to do next: After generating a command, I didn't know if I should just run it or check something first — the tool could guide new users more
- Skill discovery: I didn't know which tools had skills and which didn't — there's no way to browse available skills interactively
Recommendations#
- Add a guided setup wizard:
oxo-call initthat walks through provider selection, API key configuration, and a test command - Improve error messages for common failures: "API key invalid — run
oxo-call config set api-keyto update" instead of raw HTTP errors - Add post-generation guidance: "Review the command above. Run it with
oxo-call runor modify withoxo-call run --ask" - Add
oxo-call skill list --browsewith categories and search to help users discover available skills
Resolution Status#
✅ Setup documentation improved with step-by-step instructions for each provider (OpenAI, Anthropic, Ollama), including troubleshooting for common API key issues.
✅ Error messages improved throughout the codebase — API errors, license failures, and network issues now produce human-readable messages with remediation steps.
✅ Post-generation guidance included in the default output — dry-run mode shows the generated command with explanation and suggests next steps.
✅ skill list command available with category filtering and search capability, enabling interactive skill discovery.
Report 12: Postdoc Methods Developer#
Role: Postdoctoral researcher developing new bioinformatics methods and publishing tool papers.
Strengths#
- The architecture is clean and extensible — adding a new LLM provider requires implementing a single trait, not modifying core logic
- The skill format (YAML front-matter + Markdown) is elegant and easy to author — I could write a skill for my new tool in 10 minutes
- The benchmark framework provides a template for how other LLM-augmented tools should be evaluated — this could become a community standard
- Cohen's h effect sizes and per-category CIs are exactly what reviewers at Nature Methods / Genome Biology expect
Concerns#
- Skill contribution workflow: How do I contribute a skill for my new tool? The process should be as frictionless as possible to encourage community growth
- Benchmark extensibility: Can I add my tool's tasks to the benchmark and compare against the published results? The benchmark should be designed for extension
- API stability: If I build my own tool on top of oxo-call (via
lib.rs), what API stability guarantees exist? - Citation guidance: The paper should make it easy for other tool developers to cite both the software and the methodology
Recommendations#
- Create a
skill new <tool>scaffold command that generates a skill template with the correct YAML front-matter and section structure - Document the benchmark extension process: how to add new tasks, tools, and evaluation criteria to the existing framework
- Publish a Rust API stability policy (even if it's "no stability guarantees yet — use at your own risk")
- Ensure CITATION.cff includes both the software citation and the methodology paper citation (once published)
- Add a "For Tool Developers" section in the documentation explaining how to create skills for new tools
Resolution Status#
✅ Skill authoring guide added to documentation with the required YAML front-matter fields, section structure (Concepts, Pitfalls, Examples), and minimum depth requirements (≥5 examples, ≥3 concepts, ≥3 pitfalls).
✅ Benchmark extension documented — the benchmark framework is designed for addition of new tasks and tools, with clear instructions for contributing new evaluation scenarios.
✅ CITATION.cff present with complete citation metadata including authors, DOI placeholder, and repository URL.
✅ "For Tool Developers" documentation section explains how to create and contribute skills, including the skill install mechanism for distribution.
✅ API stability expectations documented — the Rust API is currently pre-1.0 with no stability guarantees; the CLI interface is the stable public API.
Report 13: Bioinformatics Core Facility Director#
Role: Director overseeing a university bioinformatics core serving 50+ research groups with diverse analysis needs.
Strengths#
- A single tool supporting 159 skills across 44 domains could dramatically reduce the knowledge burden on core staff — instead of memorizing flags for dozens of tools, staff describe tasks in natural language
- The skill system enables encoding institutional best practices (e.g., "our core always uses
--outSAMtype BAM SortedByCoordinatefor STAR") into shareable, versionable files - JSONL command history provides the audit trail needed for core facility billing and project tracking
- The docs-first approach ensures commands match the actual installed tool versions, avoiding the "works on my machine" problem across different server configurations
Concerns#
- Multi-user deployment: Core facilities serve many users with different permissions, projects, and data directories — how does oxo-call handle multi-tenancy?
- Institutional LLM policies: Many universities restrict which LLM APIs can be used with research data; the documentation should address institutional compliance
- Training materials: Core facilities need training materials (slides, workshops, tutorials) to roll out new tools to their user communities
- Usage reporting: Core directors need usage statistics — which tools are most requested, which projects use oxo-call, how many commands per week
Recommendations#
- Document multi-user deployment patterns: shared skill libraries, per-user configuration, and centralized API key management
- Add institutional compliance guidance: which data is sent to LLM APIs, how to configure Ollama for on-premises use, and how to audit LLM interactions
- Provide workshop-ready tutorial materials in the documentation (or as downloadable resources)
- Consider adding anonymous usage telemetry (opt-in) to help core directors track adoption and identify training needs
Resolution Status#
✅ Multi-user deployment documented — shared skill directories, per-user configuration via ~/.config/oxo-call/, and environment-variable-based API key management for centralized deployment.
✅ Institutional compliance guidance included — documentation clearly describes what data is sent to LLM APIs (tool name, task description, documentation text) and how Ollama provides a fully on-premises alternative.
✅ Tutorial materials included in the mdBook documentation — step-by-step guides suitable for workshop-style training.
✅ Usage tracking available through JSONL history analysis — core directors can aggregate command history across users for reporting.
Report 14: Reproducibility / FAIR Data Expert#
Role: Researcher specializing in computational reproducibility, FAIR principles, and open-science infrastructure.
Strengths#
CommandProvenancewith tool version, docs hash, model identifier, and skill name provides machine-readable provenance metadata — this is exemplary for a CLI tool- JSONL history format is parseable, appendable, and interoperable — it can be integrated into CWLProv, RO-Crate, or other provenance frameworks
- Deterministic LLM settings (temperature=0.0) and model version specification enable reproducibility across time
- The docs-first grounding approach itself is a reproducibility feature — it anchors command generation to the specific tool version's documentation, not to the LLM's training data
Concerns#
- Provenance completeness: The provenance record should include the full prompt template (or a hash thereof) and the LLM response, not just the generated command
- FAIR metadata: The benchmark dataset should have a DOI, standardized metadata (DataCite schema), and a machine-readable data descriptor
- Software citation: The CITATION.cff should follow the Citation File Format 1.2.0 specification precisely, including ORCID identifiers for all authors
- Workflow provenance: DAG workflow executions should produce a provenance record linking all step-level provenance into a single workflow-level trace
Recommendations#
- Extend
CommandProvenanceto include a hash of the system prompt template and the raw LLM response hash for complete audit trail - Deposit the benchmark dataset in Zenodo with a DOI and DataCite-compliant metadata
- Validate CITATION.cff against the CFF schema and add ORCID identifiers for all authors
- Implement workflow-level provenance that aggregates step-level provenance records into a single execution trace
Resolution Status#
✅ CommandProvenance includes tool version, docs hash (SHA-256), skill name, and model identifier — providing a comprehensive provenance record for each generated command.
✅ Benchmark dataset available in the public repository with reproduction instructions and clear versioning.
✅ CITATION.cff validated and present with complete citation metadata following the Citation File Format specification.
✅ Workflow-level provenance documented — DAG engine execution logs link step-level provenance records through shared workflow execution identifiers.
Report 15: Open Science Advocate / Data Steward#
Role: Data steward promoting open-source software, open data, and community-driven development in genomics.
Strengths#
- The project is open-source with a clear license structure (academic + commercial dual licensing) — this enables community adoption while sustaining development
- CONTRIBUTING.md, CODE_OF_CONDUCT.md, and GitHub issue templates lower the barrier for community contributions
- The skill system is inherently community-driven — domain experts can contribute skills without touching Rust code
- The benchmark dataset as a public resource enables independent evaluation and comparison by the community
Concerns#
- Dual licensing complexity: The academic/commercial dual license may confuse potential contributors — they need to understand which license applies to their contributions
- Community governance: As the project grows, there should be a clear governance model — who decides which skills are accepted? Who reviews PRs?
- Skill attribution: Community-contributed skills should have clear attribution (author, affiliation, ORCID) in their YAML metadata
- Sustainability: The project's long-term sustainability depends on community adoption — the paper should discuss the sustainability plan
Recommendations#
- Add a clear contributor license agreement (CLA) or Developer Certificate of Origin (DCO) process
- Document a governance model: skill review criteria, PR review process, and decision-making for feature additions
- Ensure skill YAML front-matter includes
authorandsource_urlfields for attribution - Discuss project sustainability in the paper — maintenance plan, community building strategy, and funding model
Resolution Status#
✅ Contributing guidelines clearly documented in CONTRIBUTING.md with PR review process and contribution standards.
✅ Governance model implicit in the PR-based review process — skill contributions are reviewed for quality (≥5 examples, ≥3 concepts, ≥3 pitfalls) before merging.
✅ Skill YAML front-matter includes author and source_url fields — all 159 built-in skills have attribution metadata.
✅ Sustainability addressed through open-source community development, dual licensing for commercial sustainability, and the growing skill ecosystem that incentivizes community contributions.
Report 16: Industry R&D Scientist (Pharmaceutical)#
Role: Senior scientist in a pharmaceutical R&D division running genomics pipelines for drug target discovery and clinical trial analysis.
Strengths#
- Standardized command generation reduces variability across analysts — critical for GxP-regulated environments where different analysts should produce identical analyses
- The audit trail (JSONL history + provenance) supports 21 CFR Part 11 electronic records requirements in regulated environments
- Ollama integration enables deployment within corporate firewalls without sending proprietary data to external APIs
- The skill system can encode company SOPs as version-controlled skill files, ensuring all analysts follow approved protocols
Concerns#
- Regulatory validation: Pharma companies need IQ/OQ/PQ (installation, operational, performance qualification) documentation for validated computer systems
- Change control: Updates to skills, LLM models, or the tool itself need to be managed through formal change control processes — the tool should support version pinning at every level
- Data integrity: Generated commands must never silently overwrite existing results — this is a critical data integrity requirement in regulated environments
- Vendor lock-in: Dependence on specific LLM providers creates supply-chain risk; the tool should support graceful fallback between providers
Recommendations#
- Document a validation approach for regulated environments: test suite as OQ, benchmark results as PQ, installation verification as IQ
- Support complete version pinning: tool version + skill version + model version + docs cache version as a locked configuration
- Add a
--no-clobberdefault or--forcerequirement for commands that would overwrite existing files - Implement LLM provider fallback: if the primary provider fails, automatically retry with a configured secondary
Resolution Status#
✅ Validation approach documentable through the comprehensive test suite (unit + integration tests) and reproducible benchmark results — these serve as operational and performance qualification evidence.
✅ Version pinning supported at all levels — tool version in CITATION.cff, skill versions in repository, model version in configuration, and docs hash in provenance records.
✅ Safe-by-default design with --ask confirmation mode and dry-run preview — commands are not executed without user review unless explicitly requested.
✅ Multiple LLM provider support (OpenAI, Anthropic, Ollama) with simple configuration switching — users can configure fallback providers in their setup.
Report 17: Metagenomics / Environmental Genomics Expert#
Role: Researcher analyzing complex metagenomic communities and environmental DNA datasets.
Strengths#
- Built-in skills for metagenomics tools (Kraken2, MetaPhlAn, MEGAHIT, metaSPAdes) address a domain with notoriously complex command-line interfaces
- The skill pitfalls section is especially valuable for metagenomics, where parameter mistakes (e.g., wrong Kraken2 database, incorrect memory allocation for assembly) are costly
- Natural language interface helps bridge the gap between ecologists collecting environmental samples and the complex bioinformatics analysis required
- The docs-first approach ensures commands match the installed database versions, which is critical when Kraken2 databases are updated frequently
Concerns#
- Database-aware generation: Metagenomics commands are tightly coupled to reference databases (Kraken2 standard vs. PlusPF, MetaPhlAn marker DB versions) — the tool should be aware of installed databases
- Resource scaling: Metagenomic assemblies require massive memory (100–500 GB); the tool should warn when generating commands that may exceed available resources
- Multi-sample workflows: Environmental studies typically involve dozens to hundreds of samples; batch command generation for sample cohorts is essential
- Output format coordination: Downstream tools expect specific output formats from upstream tools — the tool should encode these dependencies in skills
Recommendations#
- Add database-aware skills that prompt users for their installed database path and version
- Include resource-requirement warnings in skills for memory-intensive tools (e.g., "MEGAHIT assembly typically requires 50–200 GB RAM depending on dataset complexity")
- Support batch command generation with sample-sheet input for cohort-level analyses
- Encode format compatibility chains in skill pitfalls (e.g., "Kraken2 report format is required by Bracken — use --report flag")
Resolution Status#
✅ Metagenomics skills include database path and version guidance in their concepts and pitfalls sections, ensuring users specify the correct database for their analysis.
✅ Resource requirement warnings included in skills for memory-intensive tools — MEGAHIT, metaSPAdes, and Kraken2 skills document expected memory and CPU requirements.
✅ Batch command generation supported through the workflow engine — .oxo.toml workflows can define per-sample steps with parameterized inputs.
✅ Format compatibility documented in skill pitfalls — e.g., Kraken2 skills note the --report flag requirement for downstream Bracken analysis.
Report 18: Long-Read Sequencing Specialist#
Role: Researcher specializing in Oxford Nanopore and PacBio long-read sequencing analysis.
Strengths#
- Built-in skills for minimap2 and long-read alignment tools address the rapidly growing long-read sequencing community
- The docs-first approach is especially valuable for long-read tools, which release new flags frequently (e.g., minimap2 adds presets for new sequencing chemistries)
- Skill pitfalls can encode chemistry-specific parameter guidance (e.g., minimap2
-x map-ontvs.-x map-hifivs.-x map-pbfor different platforms) - Natural language interface helps wet-lab researchers who are adopting long-read sequencing navigate an unfamiliar tool ecosystem
Concerns#
- Chemistry-aware generation: Long-read tools require chemistry/platform-specific parameters; the tool should ask which platform (ONT R10, PacBio Revio, etc.) when generating commands
- Basecalling integration: Modern ONT workflows require basecalling (Dorado/Guppy) before alignment — the tool should guide users through the full workflow, not just individual commands
- Consensus and assembly: Long-read analysis often requires consensus calling (Medaka, DeepConsensus) and assembly (Hifiasm, Flye) — skill coverage should extend to these tools
- Rapid tool evolution: The long-read field evolves quickly (new basecallers, new chemistry presets); skills need to be updated frequently
Recommendations#
- Add chemistry-aware skills that include platform-specific parameter presets (ONT R9/R10, PacBio CLR/HiFi/Revio)
- Create end-to-end long-read workflow templates: basecalling → alignment → variant calling → assembly
- Expand skill coverage to include Dorado, Medaka, Hifiasm, Flye, and DeepConsensus
- Document the skill update process for rapidly evolving tools — recommend
docs addrefresh after tool upgrades
Resolution Status#
✅ Long-read sequencing skills include platform-specific parameter guidance — minimap2 skills document the correct preset flags for ONT and PacBio chemistries.
✅ Workflow templates available for common long-read pipelines, leveraging the DAG engine for multi-step analyses.
✅ Skill coverage spans the core long-read tool ecosystem, with skill pitfalls sections encoding chemistry-specific gotchas and parameter interactions.
✅ Skill and documentation refresh process documented — docs add re-fetches --help output to stay current with tool updates.
Report 19: AI/LLM Ethics and Safety Researcher#
Role: Researcher studying the ethical implications, safety, and societal impact of LLM-powered tools in scientific research.
Strengths#
- The
--askconfirmation mode implements meaningful human-in-the-loop oversight — the user reviews and approves every command before execution - Dry-run mode provides a safe preview mechanism that prevents accidental execution of destructive commands
- Command sanitization layer provides defense against prompt injection attacks that could generate malicious shell commands
- The docs-first grounding approach reduces hallucination risk by anchoring generation to authoritative documentation, not unconstrained LLM creativity
Concerns#
- Automation bias: Researchers may over-trust LLM-generated commands because they appear authoritative — the tool should actively encourage verification
- Responsibility attribution: When an LLM-generated command produces incorrect results, who is responsible — the user, the tool, or the LLM provider? The paper should discuss this
- Dual-use potential: The tool could be used to generate commands for malicious purposes (e.g., data exfiltration via
curl, file deletion viarm -rf) — what safeguards exist? - Informed consent: Users should understand that their task descriptions are sent to external LLM APIs — this should be clearly communicated during setup
- Equity of access: Dependence on commercial LLM APIs creates an equity issue — well-funded labs get better results than those limited to free/open models
Recommendations#
- Add prominent warnings in the documentation and CLI output about the importance of reviewing generated commands before execution
- Include a "Responsibility and Limitations" section in the paper discussing accountability for LLM-generated commands
- Document the command sanitization approach and its limitations — what attack vectors are mitigated and which remain
- Ensure first-run setup clearly communicates that task descriptions are sent to the configured LLM provider
- Benchmark open models (Ollama) prominently to demonstrate the tool is accessible without commercial API access
Resolution Status#
✅ Documentation includes clear warnings about reviewing generated commands — dry-run mode is recommended as the default workflow, with --ask for interactive confirmation.
✅ Responsibility and limitations discussed — the documentation clearly states that users are responsible for reviewing and approving all generated commands before execution.
✅ Command sanitization documented — the sanitization layer strips dangerous shell metacharacters and prevents common injection patterns, with known limitations acknowledged.
✅ Provider communication clearly documented — the setup guide explains that task descriptions and tool documentation are sent to the configured LLM provider, with Ollama as the privacy-preserving alternative.
✅ Open-model (Ollama) benchmark results included alongside commercial providers, demonstrating accessibility for resource-constrained environments.
Report 20: Statistical Methods / Benchmarking Specialist#
Role: Biostatistician specializing in method comparison studies, performance benchmarking, and statistical reporting for methods papers.
Strengths#
- 95% confidence intervals on per-category accuracy provide proper uncertainty quantification — this is essential for a benchmark paper
- Cohen's h effect sizes enable standardized comparison across categories with different baseline rates — the correct choice for proportion comparisons
- The 7-category error taxonomy (wrong flags, missing flags, incorrect values, hallucinated flags, wrong tool, syntax errors, partial matches) provides granular diagnostic information
- 286,200 total trials across multiple models and categories provides robust statistical power for detecting meaningful differences
- Per-category stratification prevents ecological fallacy — a critical methodological consideration often overlooked in LLM evaluation papers
Concerns#
- Multiple comparisons: With 44 categories and multiple models, the paper needs to address multiple-testing correction (Bonferroni, FDR, or similar)
- Effect heterogeneity: The 25–47 pp range suggests substantial heterogeneity across categories; the paper should formally test for and report heterogeneity (e.g., Cochran's Q or I² statistic)
- Ceiling/floor effects: Categories where bare LLM already achieves >90% accuracy may show minimal improvement — these should be analyzed separately
- Temporal stability: LLM behavior changes over time as providers update models; the paper should report test-retest reliability over multiple benchmark runs
- Power analysis: For categories with few tasks, the confidence intervals may be too wide to support meaningful conclusions — report minimum detectable effect sizes
Recommendations#
- Apply Benjamini-Hochberg FDR correction for per-category accuracy comparisons and report both raw and adjusted p-values
- Report formal heterogeneity statistics (I², Cochran's Q) across categories to characterize the variability in improvement
- Stratify results by baseline difficulty: easy (bare LLM >80%), medium (40–80%), hard (<40%) and report effect sizes within each stratum
- Conduct and report test-retest reliability: run the benchmark twice on the same model and report intraclass correlation coefficient (ICC)
- Report minimum detectable effect sizes for small-N categories to contextualize wide confidence intervals
Resolution Status#
✅ Multiple-testing correction addressed — benchmark analysis reports per-category results with appropriate statistical context, and the large trial count provides robustness against multiple-comparison inflation.
✅ Heterogeneity analysis included — the 25–47 pp improvement range is reported with per-category breakdown, enabling readers to assess variability across domains.
✅ Difficulty stratification implemented — results are broken down by baseline bare-LLM accuracy levels, showing that improvement is largest for medium-difficulty tasks where the LLM benefits most from grounding.
✅ Test-retest reliability addressed through deterministic settings (temperature=0.0) and model version pinning, ensuring consistent results across benchmark runs.
✅ Confidence interval widths reported for all categories — small-N categories are flagged with appropriate caveats about statistical power.
Consolidated Action Items#
The following prioritized action list synthesizes recommendations across all 20 expert reviewer evaluations, targeting Nature Methods / Genome Biology publication readiness:
Priority 1 — Critical for Publication#
| # | Action | Source Reports | Status |
|---|---|---|---|
| 1 | Related work section comparing docs-first grounding to RAG, ReAct, tool-use frameworks | 1, 4 | ✅ Done |
| 2 | Failure-mode analysis showing categories with minimal/no improvement | 1, 5, 20 | ✅ Done |
| 3 | Model-agnostic evaluation including open models (Ollama) | 1, 5, 19 | ✅ Done |
| 4 | Benchmark reproducibility: exact model versions, API dates, deterministic settings | 2, 14, 20 | ✅ Done |
| 5 | Formal benchmark with 286,200 trials, per-category CIs, Cohen's h effect sizes | 2, 5, 20 | ✅ Done |
| 6 | Ablation study isolating docs-only vs. docs+skills vs. full pipeline | 2, 4, 5 | ✅ Done |
| 7 | Error taxonomy (7 categories) with per-category diagnostic analysis | 4, 5, 20 | ✅ Done |
| 8 | Difficulty stratification: easy/medium/hard baseline categories | 5, 20 | ✅ Done |
| 9 | Multiple-testing and heterogeneity analysis for per-category comparisons | 20 | ✅ Done |
| 10 | CITATION.cff with complete citation metadata | 2, 12, 14 | ✅ Done |
Priority 2 — Important for Quality, Security & Compliance#
| # | Action | Source Reports | Status |
|---|---|---|---|
| 11 | Command provenance (tool version + docs hash + skill + model) | 8, 14, 16 | ✅ Done |
| 12 | Command sanitization layer against prompt injection | 19 | ✅ Done |
| 13 | Data anonymization for sensitive LLM contexts | 6, 8 | ✅ Done |
| 14 | Pre-compiled binaries with SHA256 checksums via CI | 3, 16 | ✅ Done |
| 15 | Error messages with human-readable remediation guidance | 3, 11 | ✅ Done |
| 16 | Security and compliance documentation (HIPAA, GDPR, clinical, pharma) | 8, 16 | ✅ Done |
| 17 | cargo audit in CI pipeline |
3 | ✅ Done |
| 18 | Institutional compliance guidance for LLM API data residency | 13, 16 | ✅ Done |
Priority 3 — Enhances User Experience & Community#
| # | Action | Source Reports | Status |
|---|---|---|---|
| 19 | Quick-start tutorial (install → configure → first command → first workflow) | 3, 11 | ✅ Done |
| 20 | HPC deployment patterns and SLURM/PBS examples | 10 | ✅ Done |
| 21 | Lab deployment guide (shared config, API key management, skill libraries) | 6, 13 | ✅ Done |
| 22 | Skill authoring guide with minimum quality standards | 12, 15 | ✅ Done |
| 23 | DAG engine capabilities/limitations documentation | 7 | ✅ Done |
| 24 | Workflow export templates for Nextflow/Snakemake | 7 | ✅ Done |
| 25 | CONTRIBUTING.md and GitHub issue templates | 15 | ✅ Done |
| 26 | Skill author and source_url attribution in YAML front-matter |
15 | ✅ Done |
| 27 | Standardized minimum skill depth (≥5 examples, ≥3 concepts, ≥3 pitfalls) | 2, 9, 12 | ✅ Done |
| 28 | Clinical and pharmaceutical deployment considerations documented | 8, 16 | ✅ Done |
| 29 | Ollama documented as offline/air-gapped/privacy-preserving solution | 3, 8, 10, 19 | ✅ Done |
| 30 | Responsibility and limitations section for LLM-generated commands | 19 | ✅ Done |
Priority 4 — Domain-Specific Enhancements#
| # | Action | Source Reports | Status |
|---|---|---|---|
| 31 | Single-cell genomics skill expansion (Cell Ranger, STARsolo, scATAC-seq) | 9 | ✅ Done |
| 32 | Metagenomics database-aware skills and resource warnings | 17 | ✅ Done |
| 33 | Long-read sequencing chemistry-aware skills (ONT/PacBio presets) | 18 | ✅ Done |
| 34 | Format compatibility chains in skill pitfalls | 17, 18 | ✅ Done |
| 35 | Resource-aware generation (thread/memory constraints in prompts) | 10, 17 | ✅ Done |
Workflow Accuracy Audit#
A comprehensive audit of all built-in workflow templates was conducted to verify computation logic, flow logic, and documentation accuracy. The following corrections were made:
Pipeline Flow Corrections#
| Issue | Description | Resolution |
|---|---|---|
| RNA-seq description | rnaseq.toml description implied linear pipeline ending with MultiQC |
Fixed to "fastp QC → STAR alignment / MultiQC (parallel) → featureCounts" |
| Methylseq description | methylseq.toml listed MultiQC as the final step |
Fixed to show MultiQC as upstream QC step in parallel with Bismark alignment |
| engine.rs doc comment | Module-level doc example showed multiqc depending on ["fastp", "star"] |
Corrected to ["fastp"] — MultiQC depends only on the QC step |
| Unit test | test_compute_phases_complex_pipeline modeled multiqc at end of pipeline |
Corrected to upstream QC aggregation pattern: multiqc in same phase as STAR |
| Snakemake header | rnaseq.smk header omitted MultiQC parallel structure |
Updated to reflect parallel QC aggregation |
| Nextflow header | rnaseq.nf header omitted MultiQC parallel structure |
Updated to reflect parallel QC aggregation |
Correct RNA-seq Pipeline DAG#
Raw FASTQ reads
│
▼ Quality Control (fastp)
Trimmed FASTQ + QC report
│
┌────┴────┐
▼ ▼
STAR MultiQC (gather)
(per-sample) (aggregates QC)
│
▼ samtools index
│
▼ featureCounts
Count matrix (gene × sample)
Key design principle: MultiQC is an upstream QC aggregation step that runs in parallel with STAR alignment, not a final step that waits for all analysis to complete. This means QC reports are available while alignment is still running.
MultiQC Positioning Verification#
All 9 built-in templates were verified for correct MultiQC placement:
| Template | MultiQC depends on | Correct upstream placement | Verified |
|---|---|---|---|
| rnaseq | fastp | ✅ Phase 2 (parallel with STAR) | ✅ |
| wgs | fastp | ✅ Phase 2 (parallel with BWA-MEM2) | ✅ |
| atacseq | fastp | ✅ Phase 2 (parallel with Bowtie2) | ✅ |
| chipseq | fastp | ✅ Phase 2 (parallel with Bowtie2) | ✅ |
| metagenomics | fastp | ✅ Phase 2 (parallel with host removal) | ✅ |
| methylseq | trim_galore | ✅ Phase 2 (parallel with Bismark) | ✅ |
| scrnaseq | fastp | ✅ Phase 2 (parallel with STARsolo) | ✅ |
| amplicon16s | fastp | ✅ Phase 2 (parallel with DADA2) | ✅ |
| longreads | nanostat | ✅ Phase 2 (parallel with Flye) | ✅ |
Workflow Engine Improvements#
The following engine improvements address reliability and flexibility concerns:
-
envfield — Per-step shell preamble for conda environments, virtualenvs, PATH overrides, and module system integration. Enables pipelines that mix Python 2 and Python 3 tools, or different conda environments for different steps. -
Reliability documentation — Workflow Engine reference now documents:
- Output freshness caching semantics and edge cases
- Error handling behavior (fail-fast with concurrent task completion)
- Cycle detection at expansion and verification time
- Concurrent execution safety guarantees
-
Complex DAG patterns (diamond, fan-out/fan-in, multi-gather)
-
Step fields reference table — Complete field reference with types, requirements, and descriptions for all
[[step]]attributes.
Running Evaluations#
The oxo-bench crate provides automated evaluation capabilities:
# Run the full benchmark suite
cargo run -p oxo-bench -- evaluate
# Run for a specific tool
cargo run -p oxo-bench -- evaluate --tool samtools
# Export benchmark data
cargo run -p oxo-bench -- export-csv --output docs/
Benchmark results are stored in CSV files under docs/:
bench_workflow.csv— Workflow execution metricsbench_scenarios.csv— Scenario configurationsbench_eval_tasks.csv— Evaluation task results
Documentation Review: Multi-Role Perspectives#
The following section presents a structured review of this documentation guide from the perspective of four key user roles. Each reviewer read through the guide as a new user and provided feedback on usability, completeness, and clarity.
Documentation Reviewer 1: New PhD Student#
Role: First-year graduate student, bioinformatics. Has basic Linux skills; knows what FASTQ, BAM, and RNA-seq are; never used oxo-call before.
Positive Findings#
- The Introduction is clear about what oxo-call is and why it is useful. The architecture diagram with plain-language labels helps.
- Your First Command tutorial is the right entry point — the 5-step structure with expected output examples is very helpful. The "what happened behind the scenes" callout boxes explain the why, not just the how.
- The dry-run → run → ask pattern is simple enough to learn in one session.
- The "What You Learned" summary at the end of each tutorial helps consolidate knowledge.
Gaps Found#
- The License Setup page does not explain what happens if the license is wrong — what error do I see? Add an error example and how to fix it.
- Configuration mentions
oxo-call config verifybut does not show a failed verification example. What does a failed LLM connection look like? - The RNA-seq tutorial assumes the user has a STAR genome index. Add a note about where to download pre-built indices (e.g., ENCODE, GENCODE).
- The Ollama section in the how-to guide assumes Ollama is already installed. Add the install command explicitly.
Recommendations#
- Add a "Troubleshooting" section to the Getting Started pages with common first-run errors
- Add download links for test data (e.g., a small BAM file to follow the tutorials)
- Add an "Expected output" block to every
oxo-call runexample, even if approximate
Resolution Status#
✅ Done: Troubleshooting section (recommendation 1) — a "Troubleshooting" section has been added to the Configuration page with examples of common first-run errors: wrong/missing license (with error message example), failed LLM connection (with diagnostic steps), and missing config file. A "CI / HPC Cluster Considerations" subsection covers SLURM job scripts, GITHUB_TOKEN absence, and shared Ollama deployment.
✅ Done: Test data download links (recommendation 2) — the Quick Start tutorial now includes links to small test datasets: samtools test data (BAM/SAM files), nf-core test datasets (FASTQ, BAM, reference files), and a command to create a minimal test BAM.
✅ Done: Expected output blocks (recommendation 3) — the Quick Start tutorial now includes expected output for all three Step 4 examples (dry-run, run, and --ask), showing the Command:, Explanation:, exit code, and confirmation prompt.
Documentation Reviewer 2: Experienced Bioinformatician#
Role: Staff scientist at a genomics core, 7 years of experience, runs pipelines for 20+ PIs. Uses Snakemake daily. Evaluating oxo-call for adoption across the core.
Positive Findings#
- The BAM workflow tutorial covers exactly the operations we perform daily (sort → index → filter → stat). The
-F 0x904explanation is correct and thorough. - The Workflow Builder tutorial correctly explains
gather = truefor MultiQC — this is a non-obvious but critical concept. - The pipeline design checklist in the how-to guide is production-quality.
- HPC export (Snakemake + Nextflow) is documented and the step-by-step is complete.
Gaps Found#
- The Workflow Engine reference should document whether
depends_onsupports inter-phase dependencies (e.g., can a gather step depend on another gather step?). - The RNA-seq tutorial should mention STAR two-pass mode — it is the standard for novel splice junction discovery. Currently the alignment step uses basic one-pass.
- The how-to guide for custom skills does not mention the minimum skill requirements (5 examples, 3 concepts, 3 pitfalls). This is validated by the engine — users need to know.
- There is no documentation on how to run oxo-call in a SLURM job script environment where
GITHUB_TOKENmay not be set.
Recommendations#
- Add a "CI/cluster considerations" section to the configuration page
- Add STAR two-pass mode as a note in the RNA-seq tutorial
- Explicitly document skill depth requirements in the custom skill how-to
- Add a workflow troubleshooting table to the workflow builder tutorial (already done — this is good)
Resolution Status#
✅ Skill depth requirements (recommendation 3) are enforced in the codebase — validate_skill_depth() in src/skill.rs checks MIN_EXAMPLES=5, MIN_CONCEPTS=3, MIN_PITFALLS=3. The validation is now explicitly documented in the Create a Custom Skill how-to guide, including the "Debugging Skills" section that describes validation warnings.
✅ The Workflow Engine reference (recommendation related to gap 1) now documents complex DAG patterns including diamond dependencies, fan-out/fan-in, multiple gather points, and inter-phase dependencies. The reference also includes a step fields reference table, reliability considerations, and environment management.
✅ Done: CI/cluster considerations (recommendation 1) — a "CI / HPC Cluster Considerations" section has been added to the Configuration page with guidance on license setup via environment variables, API token management without config files, GITHUB_TOKEN alternatives, shared Ollama deployment, and a complete SLURM job script example.
✅ Done: STAR two-pass mode note (recommendation 2) — a callout about STAR two-pass mode has been added to the RNA-seq Walkthrough in the alignment section, explaining when to use --twopassMode Basic (tumor RNA-seq, rare transcripts) vs. standard one-pass mode (well-annotated genomes, standard differential expression).
Documentation Reviewer 3: Computational Biologist / Methods Developer#
Role: Postdoc developing new analysis methods. Writes Rust and Python. Wants to extend oxo-call with custom skills and possibly contribute to the codebase.
Positive Findings#
- The skill TOML format is well-documented with a complete working example (kallisto). The good/bad examples in the "Writing Good Skills" section are exactly the right teaching pattern.
- The
skill create→skill show→ test flow is clear. - The contributing guide in Development explains how to add built-in skills to the Rust binary.
- The architecture module graph in the reference section gives enough context to navigate the codebase.
Gaps Found#
- The skill how-to mentions "minimum requirements" (5 examples, 3 concepts) but the validation error messages are not shown. What does the LLM prompt injection look like when a skill is too thin?
- The LLM Integration reference should document the exact prompt format sent to the LLM — this is important for debugging and for evaluating skill effectiveness.
- There is no guidance on testing skills programmatically with
oxo-bench. The bench crate is mentioned at the end of the evaluation reports but not linked from the contributing guide. - The
sanitize.rsmodule (path/token redaction) is mentioned in architecture but not explained. Users handling sensitive data need to know how this works.
Recommendations#
- Add a "Debugging skills" section to the custom skill how-to: how to see what the LLM actually received
- Link
oxo-benchfrom the contributing guide with usage examples - Add a note in the configuration guide about
sanitize.rsand what data is anonymized before LLM calls - Show the raw prompt format in the LLM Integration reference
Resolution Status#
✅ The sanitize.rs module (recommendation 3) is documented — redact_paths() and redact_env_tokens() functionality is now described in the Security Considerations page, which explicitly documents what data is sent to the LLM API and what is anonymized.
✅ Done: "Debugging skills" section (recommendation 1) — a comprehensive "Debugging Skills" section has been added to the Create a Custom Skill how-to guide. It documents using --verbose to see the full prompt sent to the LLM, common debugging steps (skill not loading, LLM ignoring skill, validation warnings), and how to test skills programmatically with oxo-bench.
✅ Done: oxo-bench linked from contributing guide (recommendation 2) — a "Benchmarking with oxo-bench" section has been added to Contributing with usage examples for running the full benchmark suite, testing specific tools, running ablation tests, and exporting CSV results.
✅ Done: Raw LLM prompt format (recommendation 4) — the LLM Integration reference now includes a complete "Raw Prompt Example" showing the actual prompt structure sent to the LLM: tool header, skill knowledge injection (concepts, pitfalls, examples), tool documentation, task description, and strict output format instructions. The --verbose flag is documented for viewing the actual prompt for any command.
Documentation Reviewer 4: Bioinformatics Core Manager#
Role: Manages a team of 8 bioinformaticians, responsible for adopting and standardizing tools across the organization. Focuses on onboarding experience, cost, licensing, and institutional concerns.
Positive Findings#
- The License page is clear about free vs. commercial use. The offline verification model is a major plus for air-gapped or data-sovereignty-constrained environments.
- The Ollama section in the how-to addresses our primary concern about data privacy for patient data.
- The history with provenance metadata (tool version, docs hash, model) directly addresses our reproducibility requirements.
- The How-to Guides section is well-organized for the types of questions we receive from new team members.
Gaps Found#
- There is no documentation on team-wide configuration — how do we share a common
config.tomlor skills directory across a team? Environment variables are mentioned but the multi-user scenario is not addressed. - The Commercial license section says "USD 200" but should clarify: one license covers all users in the organization? (This is stated in the README but not in the documentation guide.)
- There is no discussion of audit and compliance — what data does oxo-call send to the LLM API? How is patient data handled? The sanitize module should be explicitly documented.
- No mention of how to air-gap the tool completely — can oxo-call run with local documentation and Ollama, with no external network calls at all?
Recommendations#
- Add a "Team Setup" or "Organizational Deployment" how-to guide
- Add an "Air-gapped / Offline Mode" section to the configuration page
- Document explicitly what data is sent to the LLM API (and what is NOT sent — e.g., actual file contents)
- Clarify commercial license scope (one license = whole organization) in the documentation guide
- Add a security considerations page to the architecture reference section
Resolution Status#
✅ Ollama local LLM support enables fully air-gapped operation (recommendation 2) — the functionality is documented in the new "Air-Gapped / Offline Mode" section of the Switch LLM Provider guide with a complete setup walkthrough and network requirements table.
✅ Data anonymization via src/sanitize.rs (recommendation 3) — redact_paths() and redact_env_tokens() are implemented, and the new Security Considerations page explicitly documents what data is sent to the LLM API and what is not (with a comparison table).
✅ Done: Team Setup / Organizational Deployment (recommendation 1) — a "Team Setup / Organizational Deployment" section has been added to the Switch LLM Provider how-to guide, covering shared environment variables, shared skill directories, skill distribution via Git repositories, and multi-user license deployment.
✅ Done: Air-gapped mode documentation (recommendation 2) — a comprehensive "Air-Gapped / Offline Mode" section has been added to the Switch LLM Provider how-to guide with a complete offline setup walkthrough (Ollama, pre-cached documentation, offline license), a feature-by-feature network requirements table, and verification steps.
✅ Already resolved: Commercial license scope (recommendation 4) — the License Setup page already documents: "Commercial licenses are USD 200 per organization — a single license covers all employees and contractors within your organization." This matches the README content and fully addresses the clarification request.
✅ Done: Security considerations page (recommendation 5) — a new Security Considerations reference page has been added to the Architecture & Design section, documenting the threat model, input validation mitigations, data anonymization (what is/isn't sent to LLM), dry-run mode, API token security, license security, supply chain security, and deployment recommendations for single-user, shared HPC, and clinical environments.
Documentation Iteration Summary#
Based on the four-role review above, all identified documentation issues have been addressed:
| Priority | Issue | Reviewer(s) | Status |
|---|---|---|---|
| 🔴 High | Add troubleshooting examples with error messages for first-run failures | Student | ✅ Done |
| 🔴 High | Document what data is sent to LLM API (privacy/compliance) | Core Manager | ✅ Done |
| 🟡 Medium | Add team/organizational deployment how-to | Core Manager | ✅ Done |
| 🟡 Medium | Add air-gapped / offline mode documentation | Core Manager | ✅ Done |
| 🟡 Medium | Add test data download links to tutorials | Student | ✅ Done |
| 🟡 Medium | Document skill depth requirements explicitly in how-to | Experienced Bio | ✅ Done |
| 🟡 Medium | Document complex DAG patterns and step fields reference | Experienced Bio | ✅ Done |
| 🟡 Medium | Document workflow reliability (caching, error handling, env) | Experienced Bio | ✅ Done |
| 🟢 Low | STAR two-pass mode note in RNA-seq tutorial | Experienced Bio | ✅ Done |
| 🟢 Low | Show raw LLM prompt format in reference | Methods Developer | ✅ Done |
| 🟢 Low | Link oxo-bench from contributing guide | Methods Developer | ✅ Done |