Darren Hinde f669cac34c feat: repository review and MVI context system implementation (#85) 3 months ago
..
templates 784ffadf92 chore: verify and stabilize main branch (#42) 4 months ago
README.md c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56) 3 months ago
create-agent.md f669cac34c feat: repository review and MVI context system implementation (#85) 3 months ago
create-tests.md f669cac34c feat: repository review and MVI context system implementation (#85) 3 months ago

README.md

New Agent Creation System

Research-backed agent creation following Anthropic 2025 best practices

Overview

This command system helps you create production-ready OpenCode agents with:

  • Minimal prompts (~500 tokens at "right altitude")
  • Single agent + tools (not multi-agent for coding)
  • Just-in-time context (loaded on demand, not pre-loaded)
  • Clear tool definitions (purpose, when to use, when not to use)
  • Comprehensive testing (8 essential test types)

Quick Start

Create a New Agent

# Interactive agent creation
/create-agent my-agent-name

# Or specify in prompt
"Create a new agent called 'python-dev' for Python development"

Generate Test Suite

# Generate 8 comprehensive tests for existing agent
/create-tests my-agent-name

Research-Backed Principles

1. Single Agent + Tools > Multi-Agent for Coding

Finding: "Most coding tasks involve fewer truly parallelizable tasks than research" (Anthropic 2025)

Why this matters:

  • Code changes are deeply dependent on each other
  • Sub-agents can't coordinate edits to the same file
  • Agents waste time duplicating work
  • Multi-agent excels at research (90.2% improvement) because searches are independent
  • Code is sequential

Application:

  • Use ONE lead agent with tool-based sub-functions
  • NOT autonomous sub-agents for coding
  • Multi-agent only for truly independent tasks:
    • Static analysis (no coordination needed)
    • Test execution
    • Code search/retrieval
  • NOT for: refactoring, architecture decisions, multi-file changes

2. Right Altitude: Minimal Prompts

Finding: "Find the smallest possible set of high-signal tokens that maximize likelihood of desired outcome"

The Balance: | Too Vague | Right Altitude ✅ | Too Rigid | |-----------|------------------|-----------| | "Write good code" | Clear heuristics + examples | 50-line prompt with edge cases | | Fails to guide behavior | Flexible but specific | Brittle, hard to maintain |

Application:

  • System prompt: Minimal (~500 tokens)
  • Clear heuristics, not exhaustive rules
  • Examples > edge case lists
  • Show ONE canonical example, not 20 scenarios

3. Just-in-Time Context Loading

Finding: "Agents discover context layer by layer. File metadata guides behavior. Prevents drowning in irrelevant information"

Context Management Layers:

  1. System prompt: Minimal (~500 tokens). Clear heuristics, not exhaustive rules.
  2. Just-in-time retrieval: Tools that agents call to load context on demand (file paths, not full content)
  3. Working memory: Keep only what's needed for the current task

Why this beats pre-loading:

  • Agents discover context layer by layer
  • File metadata (size, name, timestamps) guide behavior
  • Prevents "drowning in irrelevant information"

4. CLAUDE.md Pattern

Finding: Anthropic's Claude Code uses this in production

Create a project context file automatically loaded into every session:

# Project Context

## Bash Commands
- npm run test: Run unit tests
- npm run lint: Check code style
- npm run typecheck: Check TypeScript

## Code Style
- Use ES modules (import/export)
- Destructure imports when possible
- Use async/await, not callbacks

## Common Files & Patterns
- API handlers in src/handlers/
- Business logic in src/logic/
- Tests mirror source structure

## Workflow Rules
- Always run typecheck before committing
- Don't modify test files when writing implementation
- Use git history to understand WHY, not WHAT

Benefits:

  • Eliminates repetitive context-loading
  • Shared across team (check into git)
  • Tuned like any prompt (run through prompt improvers)

5. Tool Clarity

Finding: "Tool ambiguity is one of the biggest failure modes"

Bad tool design:

tool: "search_code"
description: "search code"  # Ambiguous!

Good tool design:

tool: "read_file"
purpose: "Load a specific file for analysis or modification"
when_to_use: "You need to examine or edit a file"
when_not_to_use: "You already have the file content in context"

Key principle: If a human engineer can't definitively say which tool to use, neither can the agent.

6. Extended Thinking for Decomposition

Finding: "Improved instruction-following and reasoning efficiency for complex decomposition"

Before jumping to code, trigger extended thinking:

"Think about how to approach this problem. What files need to change? 
What are the dependencies? What should we test?"

Phrases mapped to thinking budget:

  • "think" = basic
  • "think hard" = 2x budget
  • "think harder" = 3x budget
  • "ultrathink" = maximum

7. Parallel Tool Calling

Finding: "Parallel tool calling cut research time by up to 90% for complex queries"

Design workflows where agent can call multiple tools simultaneously:

Can do in parallel:

  • Run linter
  • Execute tests
  • Check type errors

NOT in parallel (sequential):

  • Apply fix, then test

8. Outcome-Focused Evaluation

Finding: "Token usage explains 80% of performance variance. Number of tool calls ~10%. Model choice ~10%"

What to measure:

  • ✅ Does it solve the task?
  • ✅ Token usage reasonable?
  • ✅ Tool calls appropriate?
  • ❌ NOT: "Did it follow exact steps I imagined?"

Application:

  • Optimize for using enough tokens to solve the problem
  • Don't minimize tool calls (some redundancy is fine)
  • Evaluate on real failure cases, not synthetic tests

Agent Structure

Minimal System Prompt Template (~500 tokens)

---
description: "{one-line purpose}"
mode: primary
temperature: 0.1-0.7
tools:
  read: true
  write: true
  edit: true
  bash: true
  glob: true
  grep: true
---

# {Agent Name}

<role>
{Clear, concise role - what this agent does}
</role>

<approach>
1. {First step - usually read/understand}
2. {Second step - usually think/plan}
3. {Third step - usually implement/execute}
4. {Fourth step - usually verify/test}
5. {Fifth step - usually complete/handoff}
</approach>

<heuristics>
- {Key heuristic 1 - how to approach problems}
- {Key heuristic 2 - when to use tools}
- {Key heuristic 3 - how to verify work}
- {Key heuristic 4 - when to stop/report}
</heuristics>

<output>
Always include:
- What you did
- Why you did it that way
- {Domain-specific output requirement}
</output>

<examples>
  <example name="{Canonical Use Case}">
    **User**: "{typical request}"
    
    **Agent**:
    1. {Step 1 with tool usage}
    2. {Step 2 with reasoning}
    3. {Step 3 with output}
    
    **Result**: {Expected outcome}
  </example>
</examples>

Test Suite (8 Essential Tests)

Every agent gets 8 comprehensive tests:

  1. Planning & Approval - Verify plan-first approach
  2. Context Loading - Ensure just-in-time context retrieval
  3. Incremental Implementation - Verify step-by-step execution
  4. Tool Usage - Check correct tool selection and usage
  5. Error Handling - Verify stop-on-failure behavior
  6. Extended Thinking - Check decomposition before coding
  7. Compaction - Verify summarization for long sessions
  8. Completion - Check proper output and handoff

What NOT to Do

Based on failure modes found in production:

Don't:

  • ❌ Create sub-agents for dependent tasks (code is sequential)
  • ❌ Pre-load entire codebase into context (use just-in-time retrieval)
  • ❌ Write exhaustive edge case lists in prompts (brittle, hard to maintain)
  • ❌ Give vague tool descriptions (major failure mode)
  • ❌ Use multi-agent if you could use single agent + tools
  • ❌ Hardcode complex logic in prompts (use tools instead)
  • ❌ Minimize tool calls (some redundancy is fine)

Do:

  • ✅ Let agents discover context via tools
  • ✅ Use examples instead of rules
  • ✅ Keep system prompt minimal (~500 tokens)
  • ✅ Be explicit about effort budgets ("3-5 tool calls, not 50")
  • ✅ Evaluate on real failure cases, not synthetic tests
  • ✅ Measure outcomes: Does it solve the task?

Files Created

When you create a new agent, the system generates:

.opencode/agent/{agent-name}.md
  └─ Minimal system prompt (~500 tokens)

.opencode/context/project/{agent-name}-context.md
  └─ Project context (CLAUDE.md pattern)

evals/agents/{agent-name}/
  ├─ config/
  │   └─ config.yaml
  └─ tests/
      ├─ planning/
      │   └─ planning-approval-001.yaml
      ├─ context-loading/
      │   └─ context-before-code-001.yaml
      ├─ implementation/
      │   ├─ incremental-001.yaml
      │   ├─ tool-usage-001.yaml
      │   └─ extended-thinking-001.yaml
      ├─ error-handling/
      │   └─ stop-on-failure-001.yaml
      ├─ long-horizon/
      │   └─ compaction-001.yaml
      └─ completion/
          └─ handoff-001.yaml

registry.json (updated)

Usage Examples

Example 1: Create Python Development Agent

User: "Create a new agent for Python development with testing and linting"

System creates:
- Agent: python-dev
- System prompt: ~500 tokens
- Tools: read, write, edit, bash, glob, grep
- Context file: Python-specific commands and patterns
- 8 comprehensive tests

Example 2: Create API Testing Agent

User: "Create an agent for API endpoint testing"

System creates:
- Agent: api-tester
- System prompt: ~500 tokens
- Tools: read, bash, glob, grep (no write/edit)
- Context file: API testing patterns and commands
- 8 comprehensive tests

Running Tests

# Run all tests for an agent
cd evals/framework
npm test -- --agent=my-agent-name

# Run specific category
npm test -- --agent=my-agent-name --category=planning

# Run single test
npm test -- --agent=my-agent-name --test=planning-approval-001

Iteration and Improvement

  1. Test with real use cases (not just synthetic tests)
  2. Measure outcomes: Does it solve the task?
  3. Iterate based on actual failures (not imagined edge cases)
  4. Update status to "stable" when proven in production

Research References

  • Anthropic Multi-Agent Research (Sept-Dec 2025)

    • Single agent + tools > multi-agent for coding
    • Token usage explains 80% of performance variance
  • Context Engineering Best Practices (Sept 2025)

    • "Find the smallest possible set of high-signal tokens"
    • Just-in-time retrieval beats pre-loading
  • Claude Code Production Patterns

    • CLAUDE.md pattern for project context
    • Extended thinking for complex decomposition
    • Compaction for long-horizon tasks

Support

For questions or issues:

  1. Check existing agents:
    • Core agents: .opencode/agent/core/openagent.md, .opencode/agent/core/opencoder.md
    • Development agents: .opencode/agent/development/frontend-specialist.md
    • Content agents: .opencode/agent/content/copywriter.md
  2. Review test examples: evals/agents/openagent/tests/
  3. See research docs: docs/agents/research-backed-prompt-design.md