|
|
2 months ago | |
|---|---|---|
| .. | ||
| results | 3 months ago | |
| README.md | 3 months ago | |
| TEMPLATE.md | 3 months ago | |
| gemini.md | 2 months ago | |
| gpt.md | 2 months ago | |
| grok.md | 2 months ago | |
| llama.md | 2 months ago | |
| openrouter.md | 2 months ago | |
Model-specific prompt optimizations with comprehensive test results.
# Test a variant with eval framework
cd evals/framework
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=smoke-test
# Run full core suite
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=core-tests
# View results
open ../results/index.html
| Variant | Model Family | Approval Gate | Context Loading | Stop on Failure | Delegation | Tool Usage | Pass Rate | Status |
|---|---|---|---|---|---|---|---|---|
default |
Claude | ✅ | ✅ | ✅ | ✅ | ✅ | 7/7 (100%) | ✅ Stable |
gpt |
GPT | ✅ | ✅ | ✅ | ✅ | ✅ | 7/7 (100%) | ✅ Stable |
gemini |
Gemini | ✅ | ✅ | ✅ | ✅ | ✅ | 7/7 (100%) | ✅ Stable |
grok |
Grok | ✅ | ✅ | ✅ | ✅ | ✅ | 7/7 (100%) | ✅ Stable |
llama |
Llama/OSS | ✅ | ✅ | ✅ | ✅ | ✅ | 7/7 (100%) | ✅ Stable |
Legend:
- Not tested yetLast Updated: 2025-12-08
Test Suite: Core tests (7 tests)
Model Used: opencode/grok-code-fast (for validation)
default.md - Claude OptimizedTarget Models:
anthropic/claude-sonnet-4-20250514 (primary)anthropic/claude-3-5-sonnet-20241022Optimizations:
<context> tags for Claude's context handlingTest Results:
{
"total_tests": 7,
"passed": 7,
"failed": 0,
"pass_rate": 100%,
"avg_duration": "~45s per test"
}
Known Issues: None
Use When: Using Claude models (recommended for production)
gpt.md - GPT-4 OptimizedTarget Models:
openai/gpt-4oopenai/gpt-4-turboopenai/gpt-4o-miniOptimizations:
Test Results:
{
"total_tests": 7,
"passed": 7,
"failed": 0,
"pass_rate": 100%,
"avg_duration": "~40s per test"
}
Known Issues: None
Use When: Using GPT-4 family models
gemini.md - Gemini OptimizedTarget Models:
google/gemini-2.0-flash-expgoogle/gemini-2.5-flashgoogle/gemini-proOptimizations:
Test Results:
{
"total_tests": 7,
"passed": 7,
"failed": 0,
"pass_rate": 100%,
"avg_duration": "~35s per test"
}
Known Issues: None
Use When: Using Gemini models
grok.md - Grok OptimizedTarget Models:
opencode/grok-code-fast (free tier)x-ai/grok-betaOptimizations:
Test Results:
{
"total_tests": 7,
"passed": 7,
"failed": 0,
"pass_rate": 100%,
"avg_duration": "~50s per test"
}
Known Issues: None
Use When: Using Grok models (great for free tier testing)
llama.md - Llama/OSS OptimizedTarget Models:
ollama/llama3.2ollama/qwen2.5ollama/deepseek-r1Optimizations:
Test Results:
{
"total_tests": 7,
"passed": 7,
"failed": 0,
"pass_rate": 100%,
"avg_duration": "~60s per test"
}
Known Issues: None
Use When: Using open-source models (Llama, Qwen, DeepSeek, etc.)
cd evals/framework
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=smoke-test
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --suite=core-tests
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --model=ollama/llama3.2 --suite=core-tests
# Dashboard
open ../results/index.html
# JSON results
cat ../results/latest.json
# Per-variant results
cat results/llama-results.json
All variants are tested against the Core Test Suite which validates:
Total: 7 tests covering ~85% of critical functionality
See evals/agents/openagent/config/core-tests.json for details.
cp .opencode/prompts/openagent/TEMPLATE.md .opencode/prompts/openagent/my-variant.md
---
model_family: oss
recommended_models:
- ollama/my-model
status: experimental
maintainer: your-name
description: Optimized for my specific use case
tested_with: ollama/my-model
last_tested: 2025-12-08
---
Edit the prompt content for your target model:
# Smoke test
cd evals/framework
npm run eval:sdk -- --agent=openagent --prompt-variant=my-variant --suite=smoke-test
# Core suite
npm run eval:sdk -- --agent=openagent --prompt-variant=my-variant --suite=core-tests
Update this README with:
Per-variant results (results/{variant}-results.json):
{
"variant": "llama",
"model": "ollama/llama3.2",
"timestamp": "2025-12-08T21:43:08.964Z",
"summary": {
"total": 7,
"passed": 7,
"failed": 0,
"pass_rate": 1
},
"tests": [...]
}
Dashboard (evals/results/index.html):
Create custom suites for your variant:
# Create suite
cp evals/agents/openagent/config/smoke-test.json \
evals/agents/openagent/config/my-suite.json
# Validate
cd evals/framework && npm run validate:suites openagent
# Run
npm run eval:sdk -- --agent=openagent --prompt-variant=my-variant --suite=my-suite
Test the same variant with different models:
# Test Llama 3.2
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --model=ollama/llama3.2 --suite=core-tests
# Test Qwen 2.5
npm run eval:sdk -- --agent=openagent --prompt-variant=llama --model=ollama/qwen2.5 --suite=core-tests
# Compare in dashboard
open ../results/index.html
A variant can become the new default if it:
Maintainers will review test results and community feedback before promoting.
# List available variants
ls .opencode/prompts/openagent/*.md
# Verify variant name
npm run eval:sdk -- --agent=openagent --prompt-variant=your-variant --suite=smoke-test
npm run eval:sdk -- --debugQuestions? See main README or open an issue.