InteLigent c799173dd0 align context and skills consistency (#192) (#198)		2 months ago
..
results	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	3 months ago
README.md	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
TEMPLATE.md	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	3 months ago
gemini.md	c799173dd0 align context and skills consistency (#192) (#198)	2 months ago
gpt.md	c799173dd0 align context and skills consistency (#192) (#198)	2 months ago
grok.md	c799173dd0 align context and skills consistency (#192) (#198)	2 months ago
llama.md	c799173dd0 align context and skills consistency (#192) (#198)	2 months ago
openrouter.md	c799173dd0 align context and skills consistency (#192) (#198)	2 months ago

OpenAgent Prompt Variants

Model-specific prompt optimizations with comprehensive test results.

🚀 Quick Start

# Test a variant with eval framework
cd evals/framework
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=smoke-test

# Run full core suite
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=core-tests

# View results
open ../results/index.html

📊 Capabilities Matrix

Variant	Model Family	Approval Gate	Context Loading	Stop on Failure	Delegation	Tool Usage	Pass Rate	Status
`default`	Claude	✅	✅	✅	✅	✅	7/7 (100%)	✅ Stable
`gpt`	GPT	✅	✅	✅	✅	✅	7/7 (100%)	✅ Stable
`gemini`	Gemini	✅	✅	✅	✅	✅	7/7 (100%)	✅ Stable
`grok`	Grok	✅	✅	✅	✅	✅	7/7 (100%)	✅ Stable
`llama`	Llama/OSS	✅	✅	✅	✅	✅	7/7 (100%)	✅ Stable

Legend:

✅ Works reliably (passes tests)
⚠️ Partial/inconsistent
❌ Does not work
- Not tested yet

Last Updated: 2025-12-08
Test Suite: Core tests (7 tests)
Model Used: opencode/grok-code-fast (for validation)

📝 Available Variants

`default.md` - Claude Optimized

Target Models:

anthropic/claude-sonnet-4-20250514 (primary)
anthropic/claude-3-5-sonnet-20241022

Optimizations:

Structured with <context> tags for Claude's context handling
Detailed workflow stages with checkpoints
Emphasis on safety rules and approval gates

Test Results:

{
  "total_tests": 7,
  "passed": 7,
  "failed": 0,
  "pass_rate": 100%,
  "avg_duration": "~45s per test"
}

Known Issues: None

Use When: Using Claude models (recommended for production)

`gpt.md` - GPT-4 Optimized

Target Models:

openai/gpt-4o
openai/gpt-4-turbo
openai/gpt-4o-mini

Optimizations:

Structured with clear sections and headers
Explicit instructions for tool usage
Emphasis on step-by-step reasoning

Test Results:

{
  "total_tests": 7,
  "passed": 7,
  "failed": 0,
  "pass_rate": 100%,
  "avg_duration": "~40s per test"
}

Known Issues: None

Use When: Using GPT-4 family models

`gemini.md` - Gemini Optimized

Target Models:

google/gemini-2.0-flash-exp
google/gemini-2.5-flash
google/gemini-pro

Optimizations:

Structured for Gemini's instruction-following
Clear role definitions
Emphasis on safety and validation

Test Results:

{
  "total_tests": 7,
  "passed": 7,
  "failed": 0,
  "pass_rate": 100%,
  "avg_duration": "~35s per test"
}

Known Issues: None

Use When: Using Gemini models

`grok.md` - Grok Optimized

Target Models:

opencode/grok-code-fast (free tier)
x-ai/grok-beta

Optimizations:

Concise, direct instructions
Emphasis on practical execution
Optimized for Grok's coding focus

Test Results:

{
  "total_tests": 7,
  "passed": 7,
  "failed": 0,
  "pass_rate": 100%,
  "avg_duration": "~50s per test"
}

Known Issues: None

Use When: Using Grok models (great for free tier testing)

`llama.md` - Llama/OSS Optimized

Target Models:

ollama/llama3.2
ollama/qwen2.5
ollama/deepseek-r1
Other open-source models

Optimizations:

Clear, structured instructions
Explicit examples and patterns
Optimized for smaller model context windows
Emphasis on tool usage patterns

Test Results:

{
  "total_tests": 7,
  "passed": 7,
  "failed": 0,
  "pass_rate": 100%,
  "avg_duration": "~60s per test"
}

Known Issues: None

Use When: Using open-source models (Llama, Qwen, DeepSeek, etc.)

🧪 Testing Variants

Quick Smoke Test (1 test, ~30s)

cd evals/framework
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=smoke-test

Core Test Suite (7 tests, ~5-8min)

npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=core-tests

With Specific Model

npm run eval:sdk -- --agent=openagent --prompt-variant=llama --model=ollama/llama3.2 --suite=core-tests

View Results

# Dashboard
open ../results/index.html

# JSON results
cat ../results/latest.json

# Per-variant results
cat results/llama-results.json

📈 Test Coverage

All variants are tested against the Core Test Suite which validates:

Critical Rules (4 tests)

✅ Approval Gate - Requests approval before execution
✅ Context Loading - Loads required context files
✅ Stop on Failure - Stops and reports errors
✅ Report First - Reports before fixing

Functionality (3 tests)

✅ Simple Tasks - Handles tasks directly (no unnecessary delegation)
✅ Delegation - Delegates appropriately to subagents
✅ Tool Usage - Uses proper tools (read/grep vs bash)

Total: 7 tests covering ~85% of critical functionality

See evals/agents/openagent/config/core-tests.json for details.

🔧 Creating a New Variant

Step 1: Copy Template

cp .opencode/prompts/openagent/TEMPLATE.md .opencode/prompts/openagent/my-variant.md

Step 2: Edit Metadata

---
model_family: oss
recommended_models:
  - ollama/my-model
status: experimental
maintainer: your-name
description: Optimized for my specific use case
tested_with: ollama/my-model
last_tested: 2025-12-08
---

Step 3: Customize Prompt

Edit the prompt content for your target model:

Adjust instruction style
Modify examples
Change emphasis areas
Optimize for model strengths

Step 4: Test

# Smoke test
cd evals/framework
npm run eval:sdk -- --agent=openagent --prompt-variant=my-variant --suite=smoke-test

# Core suite
npm run eval:sdk -- --agent=openagent --prompt-variant=my-variant --suite=core-tests

Step 5: Document Results

Update this README with:

Test results (pass rate, timing)
Known issues or limitations
Recommended use cases
Model-specific notes

Step 6: Submit PR

Include variant file only (don't modify default.md)
Update this README with results
Ensure tests pass (≥85% pass rate)

📊 Understanding Results

Result Files

Per-variant results (results/{variant}-results.json):

{
  "variant": "llama",
  "model": "ollama/llama3.2",
  "timestamp": "2025-12-08T21:43:08.964Z",
  "summary": {
    "total": 7,
    "passed": 7,
    "failed": 0,
    "pass_rate": 1
  },
  "tests": [...]
}

Dashboard (evals/results/index.html):

Filter by variant
Compare pass rates
View detailed test results
Track trends over time

🎯 Best Practices

Choosing a Variant

Match your model family - Use gpt.md for GPT-4, llama.md for OSS
Test before committing - Run core suite to verify
Check compatibility - Some models work better with certain variants
Start with default - If unsure, default.md works well across models

Testing Your Changes

Start with smoke-test - Fast validation (1 test)
Run core-tests - Thorough validation (7 tests)
Test with your target model - Ensure compatibility
Check dashboard - Visual feedback on performance

Contributing Variants

Document thoroughly - Explain optimizations and trade-offs
Test extensively - Run full core suite multiple times
Be honest - Document both improvements and limitations
Share results - Help others by documenting findings

🚀 Advanced Usage

Custom Test Suites

Create custom suites for your variant:

# Create suite
cp evals/agents/openagent/config/smoke-test.json \
   evals/agents/openagent/config/my-suite.json

# Validate
cd evals/framework && npm run validate:suites openagent

# Run
npm run eval:sdk -- --agent=openagent --prompt-variant=my-variant --suite=my-suite

Comparing Models

Test the same variant with different models:

# Test Llama 3.2
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --model=ollama/llama3.2 --suite=core-tests

# Test Qwen 2.5
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --model=ollama/qwen2.5 --suite=core-tests

# Compare in dashboard
open ../results/index.html

🤝 Contributing

What Makes a Good Variant?

✅ Clear target - Specify which model(s) it's optimized for
✅ Documented changes - Explain what you changed and why
✅ Test results - Include real test results (≥85% pass rate)
✅ Honest assessment - Document both improvements and limitations
✅ Proper metadata - Complete YAML frontmatter

Promoting a Variant to Default

A variant can become the new default if it:

Shows significant improvement in test results
Works reliably across multiple models
Has been tested by multiple contributors
Doesn't introduce new critical issues
Maintains ≥95% pass rate on core tests

Maintainers will review test results and community feedback before promoting.

📚 Related Documentation

Main Prompts README - Prompt library overview
Eval Framework Guide - How to run tests
Test Suite Validation - Creating test suites
Contributing Guide - Contribution guidelines

🆘 Troubleshooting

Variant Not Found

# List available variants
ls .opencode/prompts/openagent/*.md

# Verify variant name
npm run eval:sdk -- --agent=openagent --prompt-variant=your-variant --suite=smoke-test

Tests Failing

Check variant metadata (YAML frontmatter)
Verify recommended model is available
Run with debug: npm run eval:sdk -- --debug
Check specific test failures in dashboard

Low Pass Rate

Review failed tests in dashboard
Check if model supports required capabilities
Consider adjusting prompt for model strengths
Test with different models in same family

💡 Tips

Start small - Test with smoke-test first
Iterate quickly - Use smoke-test for rapid iteration
Document everything - Help others learn from your experience
Share results - Update this README with your findings
Ask for help - Open an issue if you're stuck

Questions? See main README or open an issue.

README.md

OpenAgent Prompt Variants

🚀 Quick Start

📊 Capabilities Matrix

📝 Available Variants

default.md - Claude Optimized

gpt.md - GPT-4 Optimized

gemini.md - Gemini Optimized

grok.md - Grok Optimized

llama.md - Llama/OSS Optimized

🧪 Testing Variants

Quick Smoke Test (1 test, ~30s)

Core Test Suite (7 tests, ~5-8min)

With Specific Model

View Results

📈 Test Coverage

Critical Rules (4 tests)

Functionality (3 tests)

🔧 Creating a New Variant

Step 1: Copy Template

Step 2: Edit Metadata

Step 3: Customize Prompt

Step 4: Test

Step 5: Document Results

Step 6: Submit PR

📊 Understanding Results

Result Files

🎯 Best Practices

Choosing a Variant

Testing Your Changes

Contributing Variants

🚀 Advanced Usage

Custom Test Suites

Comparing Models

🤝 Contributing

What Makes a Good Variant?

Promoting a Variant to Default

📚 Related Documentation

🆘 Troubleshooting

Variant Not Found

Tests Failing

Low Pass Rate

💡 Tips

`default.md` - Claude Optimized

`gpt.md` - GPT-4 Optimized

`gemini.md` - Gemini Optimized

`grok.md` - Grok Optimized

`llama.md` - Llama/OSS Optimized