Darren Hinde f669cac34c feat: repository review and MVI context system implementation (#85)		3 months ago
..
templates	784ffadf92 chore: verify and stabilize main branch (#42)	4 months ago
README.md	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	3 months ago
create-agent.md	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
create-tests.md	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago

New Agent Creation System

Research-backed agent creation following Anthropic 2025 best practices

Overview

This command system helps you create production-ready OpenCode agents with:

✅ Minimal prompts (~500 tokens at "right altitude")
✅ Single agent + tools (not multi-agent for coding)
✅ Just-in-time context (loaded on demand, not pre-loaded)
✅ Clear tool definitions (purpose, when to use, when not to use)
✅ Comprehensive testing (8 essential test types)

Quick Start

Create a New Agent

# Interactive agent creation
/create-agent my-agent-name

# Or specify in prompt
"Create a new agent called 'python-dev' for Python development"

Generate Test Suite

# Generate 8 comprehensive tests for existing agent
/create-tests my-agent-name

Research-Backed Principles

1. Single Agent + Tools > Multi-Agent for Coding

Finding: "Most coding tasks involve fewer truly parallelizable tasks than research" (Anthropic 2025)

Why this matters:

Code changes are deeply dependent on each other
Sub-agents can't coordinate edits to the same file
Agents waste time duplicating work
Multi-agent excels at research (90.2% improvement) because searches are independent
Code is sequential

Application:

Use ONE lead agent with tool-based sub-functions
NOT autonomous sub-agents for coding
Multi-agent only for truly independent tasks:
- Static analysis (no coordination needed)
- Test execution
- Code search/retrieval
NOT for: refactoring, architecture decisions, multi-file changes

2. Right Altitude: Minimal Prompts

Finding: "Find the smallest possible set of high-signal tokens that maximize likelihood of desired outcome"

The Balance: | Too Vague | Right Altitude ✅ | Too Rigid | |-----------|------------------|-----------| | "Write good code" | Clear heuristics + examples | 50-line prompt with edge cases | | Fails to guide behavior | Flexible but specific | Brittle, hard to maintain |

Application:

System prompt: Minimal (~500 tokens)
Clear heuristics, not exhaustive rules
Examples > edge case lists
Show ONE canonical example, not 20 scenarios

3. Just-in-Time Context Loading

Finding: "Agents discover context layer by layer. File metadata guides behavior. Prevents drowning in irrelevant information"

Context Management Layers:

System prompt: Minimal (~500 tokens). Clear heuristics, not exhaustive rules.
Just-in-time retrieval: Tools that agents call to load context on demand (file paths, not full content)
Working memory: Keep only what's needed for the current task

Why this beats pre-loading:

Agents discover context layer by layer
File metadata (size, name, timestamps) guide behavior
Prevents "drowning in irrelevant information"

4. CLAUDE.md Pattern

Finding: Anthropic's Claude Code uses this in production

Create a project context file automatically loaded into every session:

# Project Context

## Bash Commands
- npm run test: Run unit tests
- npm run lint: Check code style
- npm run typecheck: Check TypeScript

## Code Style
- Use ES modules (import/export)
- Destructure imports when possible
- Use async/await, not callbacks

## Common Files & Patterns
- API handlers in src/handlers/
- Business logic in src/logic/
- Tests mirror source structure

## Workflow Rules
- Always run typecheck before committing
- Don't modify test files when writing implementation
- Use git history to understand WHY, not WHAT

Benefits:

Eliminates repetitive context-loading
Shared across team (check into git)
Tuned like any prompt (run through prompt improvers)

5. Tool Clarity

Finding: "Tool ambiguity is one of the biggest failure modes"

Bad tool design:

tool: "search_code"
description: "search code"  # Ambiguous!

Good tool design:

tool: "read_file"
purpose: "Load a specific file for analysis or modification"
when_to_use: "You need to examine or edit a file"
when_not_to_use: "You already have the file content in context"

Key principle: If a human engineer can't definitively say which tool to use, neither can the agent.

6. Extended Thinking for Decomposition

Finding: "Improved instruction-following and reasoning efficiency for complex decomposition"

Before jumping to code, trigger extended thinking:

"Think about how to approach this problem. What files need to change? 
What are the dependencies? What should we test?"

Phrases mapped to thinking budget:

"think" = basic
"think hard" = 2x budget
"think harder" = 3x budget
"ultrathink" = maximum

7. Parallel Tool Calling

Finding: "Parallel tool calling cut research time by up to 90% for complex queries"

Design workflows where agent can call multiple tools simultaneously:

Can do in parallel:

Run linter
Execute tests
Check type errors

NOT in parallel (sequential):

Apply fix, then test

8. Outcome-Focused Evaluation

Finding: "Token usage explains 80% of performance variance. Number of tool calls ~10%. Model choice ~10%"

What to measure:

✅ Does it solve the task?
✅ Token usage reasonable?
✅ Tool calls appropriate?
❌ NOT: "Did it follow exact steps I imagined?"

Application:

Optimize for using enough tokens to solve the problem
Don't minimize tool calls (some redundancy is fine)
Evaluate on real failure cases, not synthetic tests

Agent Structure

Minimal System Prompt Template (~500 tokens)

---
description: "{one-line purpose}"
mode: primary
temperature: 0.1-0.7
tools:
  read: true
  write: true
  edit: true
  bash: true
  glob: true
  grep: true
---

# {Agent Name}

<role>
{Clear, concise role - what this agent does}
</role>

<approach>
1. {First step - usually read/understand}
2. {Second step - usually think/plan}
3. {Third step - usually implement/execute}
4. {Fourth step - usually verify/test}
5. {Fifth step - usually complete/handoff}
</approach>

<heuristics>
- {Key heuristic 1 - how to approach problems}
- {Key heuristic 2 - when to use tools}
- {Key heuristic 3 - how to verify work}
- {Key heuristic 4 - when to stop/report}
</heuristics>

<output>
Always include:
- What you did
- Why you did it that way
- {Domain-specific output requirement}
</output>

<examples>
  <example name="{Canonical Use Case}">
    **User**: "{typical request}"
    
    **Agent**:
    1. {Step 1 with tool usage}
    2. {Step 2 with reasoning}
    3. {Step 3 with output}
    
    **Result**: {Expected outcome}
  </example>
</examples>

Test Suite (8 Essential Tests)

Every agent gets 8 comprehensive tests:

Planning & Approval - Verify plan-first approach
Context Loading - Ensure just-in-time context retrieval
Incremental Implementation - Verify step-by-step execution
Tool Usage - Check correct tool selection and usage
Error Handling - Verify stop-on-failure behavior
Extended Thinking - Check decomposition before coding
Compaction - Verify summarization for long sessions
Completion - Check proper output and handoff

What NOT to Do

Based on failure modes found in production:

Don't:

❌ Create sub-agents for dependent tasks (code is sequential)
❌ Pre-load entire codebase into context (use just-in-time retrieval)
❌ Write exhaustive edge case lists in prompts (brittle, hard to maintain)
❌ Give vague tool descriptions (major failure mode)
❌ Use multi-agent if you could use single agent + tools
❌ Hardcode complex logic in prompts (use tools instead)
❌ Minimize tool calls (some redundancy is fine)

Do:

✅ Let agents discover context via tools
✅ Use examples instead of rules
✅ Keep system prompt minimal (~500 tokens)
✅ Be explicit about effort budgets ("3-5 tool calls, not 50")
✅ Evaluate on real failure cases, not synthetic tests
✅ Measure outcomes: Does it solve the task?

Files Created

When you create a new agent, the system generates:

.opencode/agent/{agent-name}.md
  └─ Minimal system prompt (~500 tokens)

.opencode/context/project/{agent-name}-context.md
  └─ Project context (CLAUDE.md pattern)

evals/agents/{agent-name}/
  ├─ config/
  │   └─ config.yaml
  └─ tests/
      ├─ planning/
      │   └─ planning-approval-001.yaml
      ├─ context-loading/
      │   └─ context-before-code-001.yaml
      ├─ implementation/
      │   ├─ incremental-001.yaml
      │   ├─ tool-usage-001.yaml
      │   └─ extended-thinking-001.yaml
      ├─ error-handling/
      │   └─ stop-on-failure-001.yaml
      ├─ long-horizon/
      │   └─ compaction-001.yaml
      └─ completion/
          └─ handoff-001.yaml

registry.json (updated)

Usage Examples

Example 1: Create Python Development Agent

User: "Create a new agent for Python development with testing and linting"

System creates:
- Agent: python-dev
- System prompt: ~500 tokens
- Tools: read, write, edit, bash, glob, grep
- Context file: Python-specific commands and patterns
- 8 comprehensive tests

Example 2: Create API Testing Agent

User: "Create an agent for API endpoint testing"

System creates:
- Agent: api-tester
- System prompt: ~500 tokens
- Tools: read, bash, glob, grep (no write/edit)
- Context file: API testing patterns and commands
- 8 comprehensive tests

Running Tests

# Run all tests for an agent
cd evals/framework
npm test -- --agent=my-agent-name

# Run specific category
npm test -- --agent=my-agent-name --category=planning

# Run single test
npm test -- --agent=my-agent-name --test=planning-approval-001

Iteration and Improvement

Test with real use cases (not just synthetic tests)
Measure outcomes: Does it solve the task?
Iterate based on actual failures (not imagined edge cases)
Update status to "stable" when proven in production

Research References

Anthropic Multi-Agent Research (Sept-Dec 2025)
- Single agent + tools > multi-agent for coding
- Token usage explains 80% of performance variance
Context Engineering Best Practices (Sept 2025)
- "Find the smallest possible set of high-signal tokens"
- Just-in-time retrieval beats pre-loading
Claude Code Production Patterns
- CLAUDE.md pattern for project context
- Extended thinking for complex decomposition
- Compaction for long-horizon tasks

Support

For questions or issues:

Check existing agents:
- Core agents: .opencode/agent/core/openagent.md, .opencode/agent/core/opencoder.md
- Development agents: .opencode/agent/development/frontend-specialist.md
- Content agents: .opencode/agent/content/copywriter.md
Review test examples: evals/agents/openagent/tests/
See research docs: docs/agents/research-backed-prompt-design.md

README.md

New Agent Creation System

Overview

Quick Start

Create a New Agent

Generate Test Suite

Research-Backed Principles

1. Single Agent + Tools > Multi-Agent for Coding

2. Right Altitude: Minimal Prompts

3. Just-in-Time Context Loading

4. CLAUDE.md Pattern

5. Tool Clarity

6. Extended Thinking for Decomposition

7. Parallel Tool Calling

8. Outcome-Focused Evaluation

Agent Structure

Minimal System Prompt Template (~500 tokens)

Test Suite (8 Essential Tests)

What NOT to Do

Files Created

Usage Examples

Example 1: Create Python Development Agent

Example 2: Create API Testing Agent

Running Tests

Iteration and Improvement

Research References

Support