This document describes the context loading tests created to verify OpenAgent correctly loads context files before responding to user queries and executing tasks.
Test Location: evals/agents/openagent/tests/context-loading/
Total Tests: 5 (3 simple, 2 complex multi-turn)
Run Date: 2025-11-26
Pass Rate: 3/5 (60%)
Total Duration: 430 seconds (~7 minutes)
| Test ID | Type | Status | Duration | Notes |
|---|---|---|---|---|
| ctx-simple-testing-approach | Simple | ✅ PASS | 35s | Loaded testing docs correctly |
| ctx-simple-documentation-format | Simple | ✅ PASS | 19s | Loaded docs.md correctly |
| ctx-simple-coding-standards | Simple | ✅ PASS | 20s | Loaded code.md correctly |
| ctx-multi-standards-to-docs | Complex | ❌ FAIL | 109s | No context loaded before execution |
| ctx-multi-error-handling-to-tests | Complex | ❌ FAIL | 246s | Timeout on prompt 4 |
ctx-simple-coding-standards.yamlPrompt: "What are our coding standards for this project?"
Expected Behavior:
code.md or standards.md before respondingResult: ✅ PASSED
.opencode/context/core/standards/code.mdctx-simple-documentation-format.yamlPrompt: "What format should I use for documentation in this project?"
Expected Behavior:
docs.md or documentation.md before respondingResult: ✅ PASSED
.opencode/context/core/standards/docs.mdctx-simple-testing-approach.yamlPrompt: "What's our testing strategy for this project?"
Expected Behavior:
tests.md or testing.md before respondingResult: ✅ PASSED
evals/HOW_TESTS_WORK.mdevals/README.mdevals/TESTING_CONFIDENCE.mdevals/agents/AGENT_TESTING_GUIDE.mdctx-multi-standards-to-docs.yamlScenario: Standards question → Documentation request → Format question
Turn 1: "What are our coding standards?"
standards.md or code.mdTurn 2: "Can you create documentation about these standards in evals/test_tmp/coding-standards-doc.md?"
docs.md (documentation format)evals/test_tmp/Turn 3: "What will the documentation structure look like?"
Result: ❌ FAILED
.opencode/context/core/standards/code.md (2x).opencode/context/core/standards/docs.md (1x)Files Created: evals/test_tmp/coding-standards-doc.md (cleaned up after test)
ctx-multi-error-handling-to-tests.yamlScenario: Error handling question → Test request → Coverage policy
Turn 1: "How should we handle errors in this project?"
standards.md or processes.mdTurn 2: "Can you write tests for error handling in evals/test_tmp/error-handling.test.ts?"
tests.md (testing standards)evals/test_tmp/Turn 3: "What's our test coverage policy?"
Result: ❌ FAILED
✅ Cleanup System Working Correctly
Before Tests:
After Tests:
test_tmp/ contains only:
.gitignoreREADME.mdCleanup Logic: evals/framework/src/sdk/run-sdk-tests.ts
.gitignore and README.mdSimple Context Loading Works: All 3 simple tests passed
Cleanup System Reliable:
test_tmp/ directory isolation workingContext File Discovery:
.opencode/context/core/standards/Multi-Turn Context Loading:
Timeout on Complex Tests:
False Positive Warning:
Increase Timeout for Complex Tests
Fix Context Loading Evaluator
Simplify Complex Tests
Add More Edge Cases
Add Performance Metrics
Batch Test Execution
cd evals/framework
npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"
npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"
npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml" --debug
cd ../results
./serve.sh
Each test follows this structure:
id: test-id
name: "Test Name"
description: |
Detailed description of what the test validates
category: developer
agent: openagent
model: anthropic/claude-sonnet-4-5
# Single prompt OR multi-turn prompts
prompt: "Single prompt text"
# OR
prompts:
- text: "First prompt"
expectContext: true
contextFile: "standards.md"
- text: "approve"
delayMs: 2000
# Expected behavior
behavior:
mustUseTools: [read, write]
requiresContext: true
minToolCalls: 1
# Expected violations
expectedViolations:
- rule: context-loading
shouldViolate: false
severity: error
# Approval strategy
approvalStrategy:
type: auto-approve
timeout: 60000
tags:
- context-loading
- simple-test
Last Updated: 2025-11-26
Test Framework Version: 0.1.0
OpenAgent Version: Latest
Next Review: After fixing context loading evaluator timing logic