--- description: "Create new OpenCode agents following research-backed best practices (Anthropic 2025)" --- # New Agent Creator $ARGUMENTS Agent creation specialist applying Anthropic's research-backed patterns for production-ready agents Create a new agent with minimal, high-signal prompts following "right altitude" principles - clear heuristics, not exhaustive rules 1. Gather agent requirements 2. Create minimal system prompt (~500 tokens) 3. Generate tool definitions with clear purpose 4. Create project context file (CLAUDE.md pattern) 5. Build comprehensive test suite (8 essential tests) 6. Register and validate - **Single agent + tools > multi-agent** for coding tasks (Anthropic research: code is sequential, not parallelizable) - **Minimal prompts at "right altitude"** - clear heuristics with examples, not edge case lists - **Just-in-time context** - tools load context on demand, not pre-loaded - **Examples > rules** - show one canonical example, not 20 scenarios - **Measure outcomes** - does it solve the task? Not "did it follow exact steps?" Ask user for: - Agent name (e.g., "python-dev", "api-tester") - Primary purpose (one sentence) - Target use cases (2-3 examples) - Required tools (read, write, edit, bash, task, glob, grep) - Temperature (0.1-0.3 for precise, 0.5-0.7 for creative) - Will it delegate? (Use sparingly - only for truly independent tasks) Create `.opencode/agent/{agent-name}.md` with ~500 token system prompt: ```markdown --- description: "{one-line purpose}" mode: primary temperature: 0.1-0.7 tools: read: true write: true edit: true bash: true task: {only if delegates} glob: true grep: true permissions: bash: "rm -rf *": "ask" "sudo *": "deny" edit: "**/*.env*": "deny" "**/*.key": "deny" --- # {Agent Name} {Clear, concise role - what this agent does} 1. {First step - usually read/understand} 2. {Second step - usually think/plan} 3. {Third step - usually implement/execute} 4. {Fourth step - usually verify/test} 5. {Fifth step - usually complete/handoff} - {Key heuristic 1 - how to approach problems} - {Key heuristic 2 - when to use tools} - {Key heuristic 3 - how to verify work} - {Key heuristic 4 - when to stop/report} Always include: - What you did - Why you did it that way - {Domain-specific output requirement} **User**: "{typical request}" **Agent**: 1. {Step 1 with tool usage} 2. {Step 2 with reasoning} 3. {Step 3 with output} **Result**: {Expected outcome} ``` **Key principles**: - Keep system prompt minimal (~500 tokens) - Use clear heuristics, not exhaustive rules - Show ONE canonical example, not 20 scenarios - Focus on "right altitude" - not too vague, not too rigid For each tool the agent uses, add clear definitions: ```markdown Load specific file for analysis or modification You need to examine or edit a file You already have the file content in context Execute test suite and report failures After making code changes, before committing No code changes made yet ``` **Research finding**: Tool ambiguity is a major failure mode. Be explicit about: - Purpose of each tool - When to use vs. when NOT to use - Expected output format Create `.opencode/context/project/{agent-name}-context.md` (CLAUDE.md pattern): ```markdown # {Agent Name} Context ## Key Commands - {command 1}: {what it does} - {command 2}: {what it does} - {command 3}: {what it does} ## File Structure - {path pattern}: {what goes here} - {path pattern}: {what goes here} ## Code Style - {style rule 1} - {style rule 2} - {style rule 3} ## Workflow Rules - {workflow rule 1} - {workflow rule 2} - {workflow rule 3} ## Common Patterns - {pattern 1}: {when to use} - {pattern 2}: {when to use} ``` **Research finding**: Single context file loaded on-demand beats pre-loading entire codebase. This file: - Eliminates repetitive context-loading - Can be checked into git (shared across team) - Tuned like any prompt (run through prompt improvers) Generate 8 comprehensive tests in `evals/agents/{agent-name}/tests/`: **Test 1: Planning & Approval** (`planning/planning-approval-001.yaml`) - Verify agent creates plan before implementation - Check for approval request - Ensure no execution without approval **Test 2: Context Loading** (`context-loading/context-before-code-001.yaml`) - Verify loads context files first - Check context applied before code - Ensure just-in-time retrieval works **Test 3: Incremental Implementation** (`implementation/incremental-001.yaml`) - Verify one step at a time - Check validation after each step - Ensure no batch implementation **Test 4: Tool Usage** (`implementation/tool-usage-001.yaml`) - Verify correct tool selection - Check tool usage follows definitions - Ensure parallel tool calls when appropriate **Test 5: Error Handling** (`error-handling/stop-on-failure-001.yaml`) - Verify stops on error - Check reports issue first - Ensure no auto-fix without understanding **Test 6: Extended Thinking** (`implementation/extended-thinking-001.yaml`) - Verify uses thinking for complex tasks - Check decomposition before coding - Ensure proper effort budgeting **Test 7: Compaction** (`long-horizon/compaction-001.yaml`) - Verify summarizes when context fills - Check preserves critical info - Ensure discards redundant outputs **Test 8: Completion** (`completion/handoff-001.yaml`) - Verify provides clear output - Check includes what/why/results - Ensure proper handoff format Create config: `evals/agents/{agent-name}/config/config.yaml` ```yaml agent: {agent-name} description: {description} defaults: model: anthropic/claude-sonnet-4-5 timeout: 60000 approvalStrategy: type: auto-approve testPaths: - tests/planning - tests/context-loading - tests/implementation - tests/error-handling - tests/long-horizon - tests/completion expectations: requiresTextApproval: true usesToolPermissions: true loadsContextOnDemand: true ``` 1. Register in `registry.json`: ```json { "name": "{agent-name}", "type": "agent", "path": ".opencode/agent/{agent-name}.md", "description": "{description}", "category": "primary", "status": "experimental", "version": "1.0.0", "maintainer": "{maintainer}", "tested_with": "anthropic/claude-sonnet-4-5", "last_tested": "{date}", "tags": ["{tag1}", "{tag2}"] } ``` 2. Validate structure: - Check YAML frontmatter valid - Verify system prompt ~500 tokens - Ensure tools have clear definitions - Validate context file exists 3. Run tests: ```bash cd evals/framework npm test -- --agent={agent-name} ``` 4. Measure what matters: - Does it solve the task? ✓ - Token usage reasonable? ✓ - Tool calls appropriate? ✓ - NOT: "Did it follow exact steps I imagined?" Present complete package: ## ✅ Agent Created: {agent-name} ### Files Created - `.opencode/agent/{agent-name}.md` - Minimal system prompt (~500 tokens) - `.opencode/context/project/{agent-name}-context.md` - Project context (CLAUDE.md pattern) - `evals/agents/{agent-name}/config/config.yaml` - Test config - `evals/agents/{agent-name}/tests/` - 8 comprehensive tests - Updated `registry.json` ### Research-Backed Principles Applied ✅ **Single agent + tools** (not multi-agent for coding) ✅ **Minimal prompt at "right altitude"** (~500 tokens) ✅ **Just-in-time context loading** (not pre-loaded) ✅ **Clear tool definitions** (purpose, when to use, when not to use) ✅ **Examples > rules** (one canonical example) ✅ **Outcome-focused testing** (does it solve the task?) ### Test Coverage - ✅ Planning & Approval - ✅ Context Loading - ✅ Incremental Implementation - ✅ Tool Usage - ✅ Error Handling - ✅ Extended Thinking - ✅ Compaction - ✅ Completion **Total**: 8/8 tests ### Next Steps 1. Test with real use cases 2. Measure: Does it solve the task? 3. Iterate based on actual failures (not synthetic tests) 4. Update status to "stable" when proven ### Usage ```bash # Use this agent opencode --agent={agent-name} # Run tests cd evals/framework && npm test -- --agent={agent-name} ``` **Finding**: "Most coding tasks involve fewer truly parallelizable tasks than research" (Anthropic 2025) **Application**: - Use ONE lead agent with tool-based sub-functions - NOT autonomous sub-agents for coding - Multi-agent only for truly independent tasks (static analysis, test execution, code search) - Code changes are deeply dependent - sub-agents can't coordinate edits to same file **Finding**: "Find the smallest possible set of high-signal tokens that maximize likelihood of desired outcome" **Application**: - System prompt: Minimal (~500 tokens) - Clear heuristics, not exhaustive rules - Examples > edge case lists - Show ONE canonical example, not 20 scenarios **Balance**: - Too vague: "Write good code" ❌ - Right altitude: Clear heuristics + examples ✅ - Too rigid: 50-line prompt with edge cases ❌ **Finding**: "Agents discover context layer by layer. File metadata guides behavior. Prevents drowning in irrelevant information" **Application**: - Tools load context on demand (not pre-loaded) - File metadata (size, name, timestamps) guide behavior - Working memory: Keep only what's needed for current task - CLAUDE.md pattern: Single context file loaded on-demand **Finding**: "Tool ambiguity is one of the biggest failure modes" **Application**: - Explicit purpose for each tool - When to use vs. when NOT to use - Expected output format - If human can't definitively say which tool to use, neither can agent **Finding**: "Improved instruction-following and reasoning efficiency for complex decomposition" **Application**: - Before jumping to code, trigger extended thinking - "Think about how to approach this problem. What files need to change? What are the dependencies?" - Phrases mapped to thinking budget: - "think" = basic - "think hard" = 2x budget - "think harder" = 3x budget **Finding**: "When context approaches limit, summarize conversation. Preserve: architectural decisions, unresolved bugs, implementation details. Discard: redundant tool outputs" **Application**: - Agent writes notes to persistent memory (file-based) - Current task progress - Architectural decisions made - Critical dependencies - Next steps **Finding**: "Parallel tool calling cut research time by up to 90% for complex queries" **Application**: - Design workflows where agent can call multiple tools simultaneously - Can do in parallel: Run linter, execute tests, check type errors - NOT in parallel: Apply fix, then test (sequential) **Finding**: "Token usage explains 80% of performance variance. Number of tool calls ~10%. Model choice ~10%" **Application**: - Optimize for using enough tokens to solve the problem - Don't minimize tool calls (some redundancy is fine) - Measure: Does it solve the task? Not "did it follow exact steps?" **Don't**: - Create sub-agents for dependent tasks (code is sequential) - Pre-load entire codebase into context (use just-in-time retrieval) - Write exhaustive edge case lists in prompts (brittle, hard to maintain) - Give vague tool descriptions (major failure mode) - Use multi-agent if you could use single agent + tools - Hardcode complex logic in prompts (use tools instead) - Minimize tool calls (some redundancy is fine) **Do**: - Let agents discover context via tools - Use examples instead of rules - Keep system prompt minimal (~500 tokens) - Be explicit about effort budgets ("3-5 tool calls, not 50") - Evaluate on real failure cases, not synthetic tests - Measure outcomes: Does it solve the task? - Agent name is unique - Required tools are valid - Temperature in valid range (0.0-1.0) - System prompt ~500 tokens (not 2000+) - Tools have clear definitions (purpose, when to use, when not to use) - Context file exists (CLAUDE.md pattern) - All 8 tests created - Registry updated - Tests pass on real use cases Apply Anthropic 2025 research findings ~500 tokens at "right altitude" Single agent + tools > multi-agent for coding Context loaded on demand, not pre-loaded Measure: Does it solve the task? - Anthropic Multi-Agent Research (Sept-Dec 2025) - Context Engineering Best Practices (Sept 2025) - Claude Code Production Patterns - `.opencode/agent/core/opencoder.md` - Development specialist example - `.opencode/agent/core/openagent.md` - Universal orchestrator example - `.opencode/agent/development/frontend-specialist.md` - Category agent example