Browse Source

chore: cleanup documentation - remove temporary and redundant files

Cleanup Summary:
- Deleted .tmp/ directory (20+ temporary development files)
- Deleted tasks/ directory (old task tracking files)
- Removed 10 redundant/historical markdown files from evals/

Files Deleted:
- evals/DOCUMENTATION_CLEANUP.md (historical)
- evals/SCRIPTS_ORGANIZATION.md (historical)
- evals/agents/AGENT_TESTING_GUIDE.md (redundant)
- evals/agents/openagent/CONTEXT_LOADING_COVERAGE.md (historical)
- evals/agents/openagent/IMPLEMENTATION_SUMMARY.md (historical)
- evals/agents/opencoder/README.md (redundant)
- evals/agents/shared/README.md (redundant)
- evals/framework/scripts/README.md (redundant)
- evals/results/README.md (redundant)
- evals/test_tmp/README.md (redundant)

Result:
- Reduced from 120+ markdown files to 92 (excluding node_modules)
- Reduced evals/ from 20 to 10 markdown files
- Kept essential documentation:
  - evals/README.md (main guide)
  - evals/GETTING_STARTED.md (quick start)
  - evals/HOW_TESTS_WORK.md (technical details)
  - evals/ARCHITECTURE.md (system architecture)
  - evals/framework/README.md (framework docs)
  - evals/framework/SDK_EVAL_README.md (SDK guide)
  - evals/agents/openagent/README.md (test suite)
- Kept dev/ directory (important OpenCode internals documentation)

Benefits:
- Clearer documentation structure
- Less confusion about where to find information
- Removed outdated/historical files
- Easier to navigate and maintain
darrenhinde 4 months ago
parent
commit
f98a23c1d4

+ 0 - 273
evals/DOCUMENTATION_CLEANUP.md

@@ -1,273 +0,0 @@
-# Documentation Cleanup Summary
-
-**Date**: 2025-11-26  
-**Status**: ✅ Complete
-
----
-
-## Changes Made
-
-### Files Deleted (3)
-
-1. **`evals/framework/SESSION_STORAGE_FIX.md`** (173 lines)
-   - **Reason**: Historical fix documentation, no longer relevant
-   - **Status**: ✅ Deleted
-
-2. **`evals/TESTING_CONFIDENCE.md`** (121 lines)
-   - **Reason**: Outdated, superseded by IMPLEMENTATION_SUMMARY.md
-   - **Content**: Old test confidence assessment from before context loading fixes
-   - **Status**: ✅ Deleted
-
-3. **`evals/agents/openagent/TEST_REVIEW.md`** (325 lines)
-   - **Reason**: Outdated test review from Nov 25 (before context loading fixes)
-   - **Content**: Old test results, superseded by CONTEXT_LOADING_COVERAGE.md and IMPLEMENTATION_SUMMARY.md
-   - **Status**: ✅ Deleted
-
-### Files Renamed (1)
-
-1. **`evals/SYSTEM_REVIEW.md` → `evals/ARCHITECTURE.md`**
-   - **Reason**: More descriptive name for system architecture review
-   - **Content**: Comprehensive architecture review (456 lines)
-   - **Status**: ✅ Renamed
-
-### Files Created (2)
-
-1. **`evals/GETTING_STARTED.md`** (NEW - 450 lines)
-   - **Purpose**: Consolidated quick start guide
-   - **Content**: 
-     - Running tests
-     - Understanding results
-     - Creating new tests
-     - Debugging
-     - Common issues
-   - **Replaces**: Scattered information from README.md and HOW_TESTS_WORK.md
-   - **Status**: ✅ Created
-
-2. **`evals/DOCUMENTATION_CLEANUP.md`** (THIS FILE)
-   - **Purpose**: Track documentation cleanup changes
-   - **Status**: ✅ Created
-
-### Files Updated (3)
-
-1. **`evals/README.md`** (322 → 280 lines)
-   - **Changes**:
-     - More concise overview
-     - Points to GETTING_STARTED.md for details
-     - Updated with recent achievements (Nov 26)
-     - Added context loading tests section
-     - Added smart timeout system section
-     - Updated test coverage numbers
-   - **Status**: ✅ Updated
-
-2. **`evals/agents/openagent/README.md`** (85 → 350 lines)
-   - **Changes**:
-     - Comprehensive test coverage section
-     - Detailed context loading tests documentation
-     - Test structure overview
-     - Running instructions
-     - Test design examples
-     - Troubleshooting section
-   - **Status**: ✅ Updated
-
-3. **`evals/HOW_TESTS_WORK.md`** (308 lines)
-   - **Changes**: None (kept as-is for detailed technical reference)
-   - **Status**: ✅ Kept
-
----
-
-## Documentation Structure (After Cleanup)
-
-### Top-Level Documentation
-
-```
-evals/
-├── README.md                     # System overview (UPDATED)
-├── GETTING_STARTED.md            # Quick start guide (NEW)
-├── HOW_TESTS_WORK.md             # Detailed test execution guide
-├── ARCHITECTURE.md               # System architecture review (RENAMED)
-└── DOCUMENTATION_CLEANUP.md      # This file (NEW)
-```
-
-### Framework Documentation
-
-```
-evals/framework/
-├── README.md                     # Framework documentation
-├── SDK_EVAL_README.md            # Complete SDK guide
-├── docs/
-│   ├── architecture-overview.md # Framework architecture
-│   └── test-design-guide.md     # Test design philosophy
-└── run-tests-batch.sh            # Batch test runner
-```
-
-### Agent Documentation
-
-```
-evals/agents/openagent/
-├── README.md                     # OpenAgent test suite (UPDATED)
-├── CONTEXT_LOADING_COVERAGE.md   # Context loading tests
-├── IMPLEMENTATION_SUMMARY.md     # Recent implementation
-└── docs/
-    └── OPENAGENT_RULES.md        # OpenAgent rules reference
-```
-
-### Results Documentation
-
-```
-evals/results/
-├── README.md                     # Results dashboard guide
-├── index.html                    # Interactive dashboard
-└── serve.sh                      # One-command server
-```
-
----
-
-## Documentation Flow
-
-### For New Users
-
-1. **Start**: `README.md` - System overview
-2. **Next**: `GETTING_STARTED.md` - Quick start guide
-3. **Then**: Run tests and view results
-4. **Deep Dive**: `HOW_TESTS_WORK.md` - Detailed explanations
-
-### For Test Authors
-
-1. **Start**: `GETTING_STARTED.md` - Creating tests section
-2. **Reference**: `framework/docs/test-design-guide.md` - Design philosophy
-3. **Examples**: `agents/openagent/README.md` - Test examples
-4. **Rules**: `agents/openagent/docs/OPENAGENT_RULES.md` - Agent rules
-
-### For Developers
-
-1. **Start**: `ARCHITECTURE.md` - System architecture
-2. **Framework**: `framework/SDK_EVAL_README.md` - Complete SDK guide
-3. **Implementation**: `agents/openagent/IMPLEMENTATION_SUMMARY.md` - Recent changes
-4. **Technical**: `HOW_TESTS_WORK.md` - Execution details
-
----
-
-## Benefits of Cleanup
-
-### Before Cleanup
-
-- ❌ 19 markdown files (excluding node_modules)
-- ❌ Outdated information (Nov 25 test reviews)
-- ❌ Duplicate content (testing confidence in multiple places)
-- ❌ Unclear entry point for new users
-- ❌ Historical fix documentation cluttering framework/
-
-### After Cleanup
-
-- ✅ 16 markdown files (3 deleted, 2 new, net -1)
-- ✅ All information current (Nov 26)
-- ✅ No duplicate content
-- ✅ Clear entry point (GETTING_STARTED.md)
-- ✅ Clean framework directory
-- ✅ Better organization
-
----
-
-## Documentation Quality Metrics
-
-### Coverage
-
-| Audience | Documentation | Status |
-|----------|---------------|--------|
-| New Users | GETTING_STARTED.md | ✅ Complete |
-| Test Authors | test-design-guide.md | ✅ Complete |
-| Developers | ARCHITECTURE.md | ✅ Complete |
-| OpenAgent Users | agents/openagent/README.md | ✅ Complete |
-| Results Users | results/README.md | ✅ Complete |
-
-### Accuracy
-
-| Document | Last Updated | Accuracy |
-|----------|--------------|----------|
-| README.md | 2025-11-26 | ✅ Current |
-| GETTING_STARTED.md | 2025-11-26 | ✅ Current |
-| HOW_TESTS_WORK.md | 2025-11-26 | ✅ Current |
-| ARCHITECTURE.md | 2025-11-26 | ✅ Current |
-| agents/openagent/README.md | 2025-11-26 | ✅ Current |
-| CONTEXT_LOADING_COVERAGE.md | 2025-11-26 | ✅ Current |
-| IMPLEMENTATION_SUMMARY.md | 2025-11-26 | ✅ Current |
-
-### Maintainability
-
-- ✅ Clear naming conventions
-- ✅ Logical organization
-- ✅ No duplicate content
-- ✅ Cross-references between docs
-- ✅ Easy to find information
-- ✅ Easy to update
-
----
-
-## Maintenance Guidelines
-
-### When to Update Documentation
-
-1. **After Major Features**
-   - Update README.md with new features
-   - Update GETTING_STARTED.md with new usage examples
-   - Create/update implementation summaries
-
-2. **After Bug Fixes**
-   - Update relevant documentation
-   - Add to troubleshooting sections if needed
-
-3. **Monthly Review**
-   - Check for outdated information
-   - Update test coverage numbers
-   - Review and consolidate if needed
-
-### What to Delete
-
-- Historical fix documentation (after 3 months)
-- Outdated test reviews (superseded by new ones)
-- Duplicate content (consolidate instead)
-- Temporary investigation notes
-
-### What to Keep
-
-- Architecture documentation
-- Test design guides
-- Getting started guides
-- Current implementation summaries
-- Troubleshooting guides
-
----
-
-## Next Review
-
-**Scheduled**: 2025-12-26 (1 month)
-
-**Review Checklist**:
-- [ ] Check for outdated information
-- [ ] Update test coverage numbers
-- [ ] Review new features added
-- [ ] Check for duplicate content
-- [ ] Verify all links work
-- [ ] Update "Last Updated" dates
-
----
-
-## Summary
-
-✅ **3 files deleted** (outdated/duplicate content)  
-✅ **1 file renamed** (better clarity)  
-✅ **2 files created** (better organization)  
-✅ **3 files updated** (current information)  
-✅ **Net result**: Cleaner, more organized, more maintainable documentation
-
-**Documentation is now**:
-- Current (all Nov 26, 2025)
-- Well-organized (clear structure)
-- Easy to navigate (clear entry points)
-- Comprehensive (covers all audiences)
-- Maintainable (no duplicates, clear guidelines)
-
----
-
-**Cleanup Completed**: 2025-11-26  
-**Next Review**: 2025-12-26

+ 0 - 367
evals/SCRIPTS_ORGANIZATION.md

@@ -1,367 +0,0 @@
-# Scripts Organization Summary
-
-**Date**: 2025-11-26  
-**Status**: ✅ Complete
-
----
-
-## Changes Made
-
-### Before Organization
-
-```
-evals/framework/
-├── check-agent.mjs
-├── debug-claude-session.mjs
-├── debug-session.mjs
-├── debug-session.ts
-├── inspect-session.mjs
-├── run-tests-batch.sh
-├── test-agent-direct.ts
-├── test-event-inspector.js
-├── test-session-reader.mjs
-├── test-simplified-approach.mjs
-├── test-timeline.ts
-├── verify-timeline.ts
-└── ... (other framework files)
-```
-
-**Issues**:
-- ❌ 12 scripts cluttering framework root
-- ❌ No clear organization
-- ❌ Hard to find specific scripts
-- ❌ Unclear which scripts are for what purpose
-
----
-
-### After Organization
-
-```
-evals/framework/
-├── scripts/
-│   ├── debug/                    # Debugging scripts (4 files)
-│   │   ├── debug-session.mjs
-│   │   ├── debug-session.ts
-│   │   ├── debug-claude-session.mjs
-│   │   └── inspect-session.mjs
-│   │
-│   ├── test/                     # Test scripts (6 files)
-│   │   ├── test-agent-direct.ts
-│   │   ├── test-event-inspector.js
-│   │   ├── test-session-reader.mjs
-│   │   ├── test-simplified-approach.mjs
-│   │   ├── test-timeline.ts
-│   │   └── verify-timeline.ts
-│   │
-│   ├── utils/                    # Utility scripts (2 files)
-│   │   ├── run-tests-batch.sh
-│   │   └── check-agent.mjs
-│   │
-│   └── README.md                 # Script documentation
-│
-└── ... (other framework files)
-```
-
-**Benefits**:
-- ✅ Clean framework root
-- ✅ Clear organization by purpose
-- ✅ Easy to find scripts
-- ✅ Comprehensive documentation
-
----
-
-## Script Categories
-
-### Debug Scripts (4 files)
-
-Scripts for debugging sessions, events, and agent behavior.
-
-| Script | Purpose | Lines |
-|--------|---------|-------|
-| `debug-session.mjs` | Debug session data and timeline | ~40 |
-| `debug-session.ts` | TypeScript version of session debugger | ~100 |
-| `debug-claude-session.mjs` | Debug Claude-specific sessions | ~50 |
-| `inspect-session.mjs` | Inspect most recent session events | ~80 |
-
-**Usage**:
-```bash
-node scripts/debug/inspect-session.mjs
-node scripts/debug/debug-session.mjs <session-id>
-npx tsx scripts/debug/debug-session.ts <session-id>
-```
-
----
-
-### Test Scripts (6 files)
-
-Scripts for testing framework components during development.
-
-| Script | Purpose | Lines |
-|--------|---------|-------|
-| `test-agent-direct.ts` | Direct agent execution test | ~150 |
-| `test-event-inspector.js` | Test event capture system | ~40 |
-| `test-session-reader.mjs` | Test session reader | ~60 |
-| `test-simplified-approach.mjs` | Test simplified test approach | ~100 |
-| `test-timeline.ts` | Test timeline builder | ~90 |
-| `verify-timeline.ts` | Verify timeline accuracy | ~100 |
-
-**Usage**:
-```bash
-npx tsx scripts/test/test-agent-direct.ts
-node scripts/test/test-event-inspector.js
-npx tsx scripts/test/verify-timeline.ts
-```
-
----
-
-### Utility Scripts (2 files)
-
-General utility scripts for running tests and managing the framework.
-
-| Script | Purpose | Lines |
-|--------|---------|-------|
-| `run-tests-batch.sh` | Run tests in batches | ~100 |
-| `check-agent.mjs` | Check agent availability | ~30 |
-
-**Usage**:
-```bash
-./scripts/utils/run-tests-batch.sh openagent 3 10
-node scripts/utils/check-agent.mjs
-```
-
----
-
-## Documentation Updates
-
-### Files Updated
-
-1. **`evals/README.md`**
-   - Updated `run-tests-batch.sh` path references
-   - Updated directory structure
-
-2. **`evals/GETTING_STARTED.md`**
-   - Updated batch execution examples
-   - Updated script paths
-
-3. **`evals/agents/openagent/README.md`**
-   - Updated batch execution examples
-   - Updated script paths
-
-4. **`evals/agents/openagent/IMPLEMENTATION_SUMMARY.md`**
-   - Updated script references
-   - Updated directory structure
-
-5. **`evals/DOCUMENTATION_CLEANUP.md`**
-   - Updated directory structure
-
-6. **`evals/framework/README.md`**
-   - Added scripts section
-   - Added quick examples
-
-### New Documentation
-
-1. **`evals/framework/scripts/README.md`** (NEW - 200 lines)
-   - Comprehensive script documentation
-   - Usage examples for all scripts
-   - Development workflow guide
-   - Script templates
-
----
-
-## Path Changes
-
-### Old Paths → New Paths
-
-| Old Path | New Path |
-|----------|----------|
-| `run-tests-batch.sh` | `scripts/utils/run-tests-batch.sh` |
-| `check-agent.mjs` | `scripts/utils/check-agent.mjs` |
-| `debug-session.mjs` | `scripts/debug/debug-session.mjs` |
-| `debug-session.ts` | `scripts/debug/debug-session.ts` |
-| `debug-claude-session.mjs` | `scripts/debug/debug-claude-session.mjs` |
-| `inspect-session.mjs` | `scripts/debug/inspect-session.mjs` |
-| `test-agent-direct.ts` | `scripts/test/test-agent-direct.ts` |
-| `test-event-inspector.js` | `scripts/test/test-event-inspector.js` |
-| `test-session-reader.mjs` | `scripts/test/test-session-reader.mjs` |
-| `test-simplified-approach.mjs` | `scripts/test/test-simplified-approach.mjs` |
-| `test-timeline.ts` | `scripts/test/test-timeline.ts` |
-| `verify-timeline.ts` | `scripts/test/verify-timeline.ts` |
-
----
-
-## Migration Guide
-
-### For Users
-
-If you have scripts or documentation referencing the old paths:
-
-```bash
-# Old
-./run-tests-batch.sh openagent 3 10
-
-# New
-./scripts/utils/run-tests-batch.sh openagent 3 10
-```
-
-### For Developers
-
-If you have custom scripts importing from these files:
-
-```javascript
-// Old
-import { SessionReader } from './dist/collector/session-reader.js';
-
-// New (from scripts directory)
-import { SessionReader } from '../../dist/collector/session-reader.js';
-```
-
----
-
-## Benefits
-
-### Organization
-
-- ✅ **Clear structure** - Scripts grouped by purpose
-- ✅ **Easy navigation** - Know where to find scripts
-- ✅ **Clean root** - Framework root no longer cluttered
-- ✅ **Scalable** - Easy to add new scripts
-
-### Documentation
-
-- ✅ **Comprehensive README** - All scripts documented
-- ✅ **Usage examples** - Clear examples for each script
-- ✅ **Development workflow** - Guide for using scripts
-- ✅ **Templates** - Easy to create new scripts
-
-### Maintainability
-
-- ✅ **Easier to maintain** - Clear organization
-- ✅ **Easier to find** - Logical grouping
-- ✅ **Easier to update** - Centralized documentation
-- ✅ **Easier to extend** - Clear patterns
-
----
-
-## Statistics
-
-### Before
-
-- **Total scripts**: 12
-- **In framework root**: 12
-- **Organized**: 0
-- **Documented**: Minimal
-
-### After
-
-- **Total scripts**: 12 (same)
-- **In framework root**: 0
-- **Organized**: 12 (100%)
-- **Documented**: Comprehensive (200+ lines)
-
-### File Count
-
-- **Debug scripts**: 4
-- **Test scripts**: 6
-- **Utility scripts**: 2
-- **Documentation**: 1 (README.md)
-- **Total**: 13 files (12 scripts + 1 doc)
-
----
-
-## Maintenance Guidelines
-
-### Adding New Scripts
-
-1. **Determine category**:
-   - Debug? → `scripts/debug/`
-   - Test? → `scripts/test/`
-   - Utility? → `scripts/utils/`
-
-2. **Create script** in appropriate directory
-
-3. **Update `scripts/README.md`**:
-   - Add to table
-   - Add usage example
-
-4. **Test the script**:
-   ```bash
-   npm run build
-   node scripts/debug/my-script.mjs
-   ```
-
-### Removing Obsolete Scripts
-
-1. **Delete the script file**
-
-2. **Update `scripts/README.md`**:
-   - Remove from table
-   - Remove usage example
-
-3. **Check for references**:
-   ```bash
-   rg "my-script" --type md
-   ```
-
-### Updating Scripts
-
-1. **Make changes to script**
-
-2. **Test changes**:
-   ```bash
-   npm run build
-   node scripts/debug/my-script.mjs
-   ```
-
-3. **Update documentation** if usage changed
-
----
-
-## Next Steps
-
-### Immediate
-
-- ✅ Scripts organized
-- ✅ Documentation updated
-- ✅ References updated
-- ✅ README created
-
-### Future Enhancements
-
-1. **Add more debug scripts**
-   - Session comparison tool
-   - Event diff tool
-   - Performance profiler
-
-2. **Add more test scripts**
-   - Integration test runner
-   - Performance benchmarks
-   - Stress tests
-
-3. **Add more utilities**
-   - Test result analyzer
-   - Coverage reporter
-   - Cleanup utilities
-
----
-
-## Summary
-
-✅ **12 scripts organized** into 3 categories  
-✅ **Framework root cleaned** (0 scripts remaining)  
-✅ **Comprehensive documentation** (200+ lines)  
-✅ **All references updated** (6 files)  
-✅ **Clear structure** for future additions
-
-**Organization is now**:
-- Clean and organized
-- Well-documented
-- Easy to navigate
-- Easy to maintain
-- Easy to extend
-
----
-
-**Organization Completed**: 2025-11-26  
-**Scripts Organized**: 12  
-**Documentation Created**: 1 README (200+ lines)  
-**Files Updated**: 6

+ 0 - 417
evals/agents/AGENT_TESTING_GUIDE.md

@@ -1,417 +0,0 @@
-# Agent Testing Guide - Agent-Agnostic Architecture
-
-## Overview
-
-Our evaluation framework is designed to be **agent-agnostic**, making it easy to test multiple agents with the same infrastructure.
-
----
-
-## Architecture Layers
-
-### **Layer 1: Framework (Agent-Agnostic)**
-```
-evals/framework/
-├── src/
-│   ├── sdk/              # Test runner (works with any agent)
-│   ├── evaluators/       # Generic behavior checks
-│   └── types/            # Shared types
-```
-
-**Purpose:** Shared infrastructure that works with **any agent**
-
-**Key Components:**
-- `TestRunner` - Executes tests for any agent
-- `Evaluators` - Check generic behaviors (approval, context, tools)
-- `EventStreamHandler` - Captures events from any agent
-- `TestCaseSchema` - Universal test format
-
----
-
-### **Layer 2: Agent-Specific Tests**
-```
-evals/agents/
-├── openagent/           # OpenAgent-specific tests
-│   ├── tests/
-│   └── docs/
-├── opencoder/           # OpenCoder-specific tests (future)
-│   ├── tests/
-│   └── docs/
-└── shared/              # Tests for ANY agent
-    └── tests/
-```
-
-**Purpose:** Organize tests by agent for easy management
-
----
-
-## Directory Structure
-
-```
-evals/
-├── framework/                          # SHARED FRAMEWORK
-│   ├── src/
-│   │   ├── sdk/
-│   │   │   ├── test-runner.ts         # Reads 'agent' field from YAML
-│   │   │   ├── client-manager.ts      # Routes to correct agent
-│   │   │   └── test-case-schema.ts    # Universal schema
-│   │   └── evaluators/
-│   │       ├── approval-gate-evaluator.ts    # Works for any agent
-│   │       ├── context-loading-evaluator.ts  # Works for any agent
-│   │       └── tool-usage-evaluator.ts       # Works for any agent
-│   └── package.json
-│
-├── agents/
-│   ├── openagent/                      # OPENAGENT TESTS
-│   │   ├── tests/
-│   │   │   ├── developer/
-│   │   │   │   ├── task-simple-001.yaml      # agent: openagent
-│   │   │   │   ├── ctx-code-001.yaml         # agent: openagent
-│   │   │   │   └── ctx-docs-001.yaml         # agent: openagent
-│   │   │   ├── business/
-│   │   │   │   └── conv-simple-001.yaml      # agent: openagent
-│   │   │   └── edge-case/
-│   │   │       └── fail-stop-001.yaml        # agent: openagent
-│   │   └── docs/
-│   │       └── OPENAGENT_RULES.md            # OpenAgent-specific rules
-│   │
-│   ├── opencoder/                      # OPENCODER TESTS (future)
-│   │   ├── tests/
-│   │   │   ├── developer/
-│   │   │   │   ├── refactor-code-001.yaml    # agent: opencoder
-│   │   │   │   └── optimize-perf-001.yaml    # agent: opencoder
-│   │   └── docs/
-│   │       └── OPENCODER_RULES.md            # OpenCoder-specific rules
-│   │
-│   └── shared/                         # SHARED TESTS (any agent)
-│       ├── tests/
-│       │   └── common/
-│       │       ├── approval-gate-basic.yaml  # agent: ${AGENT}
-│       │       └── tool-usage-basic.yaml     # agent: ${AGENT}
-│       └── README.md
-│
-└── README.md
-```
-
----
-
-## How Agent Selection Works
-
-### **1. Test Specifies Agent**
-
-```yaml
-# openagent/tests/developer/task-simple-001.yaml
-id: task-simple-001
-name: Simple Bash Execution
-agent: openagent              # ← Specifies which agent to test
-prompt: "Run npm install"
-```
-
-### **2. Test Runner Routes to Agent**
-
-```typescript
-// framework/src/sdk/test-runner.ts
-async runTest(testCase: TestCase) {
-  // Get agent from test case
-  const agent = testCase.agent || 'openagent';
-  
-  // Route to specified agent
-  const result = await this.clientManager.sendPrompt(
-    sessionId,
-    testCase.prompt,
-    { agent }  // ← SDK routes to correct agent
-  );
-}
-```
-
-### **3. Evaluators Check Generic Behaviors**
-
-```typescript
-// framework/src/evaluators/approval-gate-evaluator.ts
-export class ApprovalGateEvaluator extends BaseEvaluator {
-  async evaluate(timeline: TimelineEvent[]) {
-    // Check if ANY agent asked for approval
-    // Works for openagent, opencoder, or any future agent
-    
-    const approvalRequested = timeline.some(event => 
-      event.type === 'approval_request'
-    );
-    
-    if (!approvalRequested) {
-      violations.push({
-        type: 'approval-gate-missing',
-        severity: 'error',
-        message: 'Agent executed without requesting approval'
-      });
-    }
-  }
-}
-```
-
----
-
-## Running Tests Per Agent
-
-### **Run All Tests for Specific Agent**
-
-```bash
-# Run ALL OpenAgent tests
-npm run eval:sdk -- --pattern="openagent/**/*.yaml"
-
-# Run ALL OpenCoder tests
-npm run eval:sdk -- --pattern="opencoder/**/*.yaml"
-```
-
-### **Run Specific Category**
-
-```bash
-# Run OpenAgent developer tests
-npm run eval:sdk -- --pattern="openagent/developer/*.yaml"
-
-# Run OpenCoder developer tests
-npm run eval:sdk -- --pattern="opencoder/developer/*.yaml"
-```
-
-### **Run Shared Tests for Different Agents**
-
-```bash
-# Run shared tests for OpenAgent
-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=openagent
-
-# Run shared tests for OpenCoder
-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=opencoder
-```
-
-### **Run Single Test**
-
-```bash
-# Run specific test
-npx tsx src/sdk/show-test-details.ts openagent/developer/task-simple-001.yaml
-```
-
----
-
-## Adding a New Agent
-
-### **Step 1: Create Agent Directory**
-
-```bash
-mkdir -p evals/agents/my-new-agent/tests/{developer,business,edge-case}
-mkdir -p evals/agents/my-new-agent/docs
-```
-
-### **Step 2: Create Agent Rules Document**
-
-```bash
-# Document agent-specific rules
-touch evals/agents/my-new-agent/docs/MY_NEW_AGENT_RULES.md
-```
-
-### **Step 3: Copy Shared Tests**
-
-```bash
-# Copy shared tests as starting point
-cp evals/agents/shared/tests/common/*.yaml \
-   evals/agents/my-new-agent/tests/developer/
-
-# Update agent field
-sed -i 's/agent: openagent/agent: my-new-agent/g' \
-  evals/agents/my-new-agent/tests/developer/*.yaml
-```
-
-### **Step 4: Add Agent-Specific Tests**
-
-```yaml
-# my-new-agent/tests/developer/custom-test-001.yaml
-id: custom-test-001
-name: My New Agent Custom Test
-agent: my-new-agent           # ← Your new agent
-prompt: "Agent-specific prompt"
-
-behavior:
-  mustUseTools: [bash]
-  requiresApproval: true
-
-expectedViolations:
-  - rule: approval-gate
-    shouldViolate: false
-```
-
-### **Step 5: Run Tests**
-
-```bash
-npm run eval:sdk -- --pattern="my-new-agent/**/*.yaml"
-```
-
----
-
-## Test Organization Best Practices
-
-### **1. Agent-Specific Tests**
-Put in `agents/{agent}/tests/`
-
-**When to use:**
-- Tests specific to agent's unique features
-- Tests for agent-specific rules
-- Tests that won't work for other agents
-
-**Example:**
-```yaml
-# openagent/tests/developer/ctx-code-001.yaml
-# OpenAgent-specific: Tests context loading from openagent.md
-agent: openagent
-behavior:
-  requiresContext: true  # OpenAgent-specific rule
-```
-
-### **2. Shared Tests**
-Put in `agents/shared/tests/common/`
-
-**When to use:**
-- Tests that work for ANY agent
-- Tests for universal rules (approval, tool usage)
-- Tests you want to run across multiple agents
-
-**Example:**
-```yaml
-# shared/tests/common/approval-gate-basic.yaml
-# Works for ANY agent
-agent: openagent  # Default, can be overridden
-behavior:
-  requiresApproval: true  # Universal rule
-```
-
-### **3. Category Organization**
-
-```
-tests/
-├── developer/      # Developer workflow tests
-├── business/       # Business/analysis tests
-├── creative/       # Content creation tests
-└── edge-case/      # Edge cases and error handling
-```
-
----
-
-## Evaluator Design (Agent-Agnostic)
-
-### **Good: Generic Behavior Check**
-
-```typescript
-// ✅ Works for any agent
-export class ApprovalGateEvaluator extends BaseEvaluator {
-  async evaluate(timeline: TimelineEvent[]) {
-    // Check generic behavior: did agent ask for approval?
-    const hasApproval = timeline.some(e => e.type === 'approval_request');
-    
-    if (!hasApproval) {
-      violations.push({
-        type: 'approval-gate-missing',
-        message: 'Agent did not request approval'
-      });
-    }
-  }
-}
-```
-
-### **Bad: Agent-Specific Logic**
-
-```typescript
-// ❌ Hardcoded to specific agent
-export class OpenAgentSpecificEvaluator extends BaseEvaluator {
-  async evaluate(timeline: TimelineEvent[]) {
-    // Don't do this - ties evaluator to specific agent
-    if (sessionInfo.agent === 'openagent') {
-      // OpenAgent-specific checks
-    }
-  }
-}
-```
-
----
-
-## Benefits of Agent-Agnostic Design
-
-### **1. Easy to Add New Agents**
-- Copy shared tests
-- Update `agent` field
-- Add agent-specific tests
-- Run tests
-
-### **2. Consistent Behavior Across Agents**
-- Same evaluators check all agents
-- Same test format for all agents
-- Easy to compare agent behaviors
-
-### **3. Reduced Duplication**
-- Shared tests written once
-- Evaluators work for all agents
-- Framework code reused
-
-### **4. Easy Maintenance**
-- Update evaluator once, affects all agents
-- Update shared test once, affects all agents
-- Clear separation of concerns
-
----
-
-## Example: Testing Two Agents
-
-### **OpenAgent Test**
-```yaml
-# openagent/tests/developer/create-file.yaml
-id: openagent-create-file-001
-agent: openagent
-prompt: "Create hello.ts"
-
-behavior:
-  requiresContext: true  # OpenAgent loads code.md
-```
-
-### **OpenCoder Test**
-```yaml
-# opencoder/tests/developer/create-file.yaml
-id: opencoder-create-file-001
-agent: opencoder
-prompt: "Create hello.ts"
-
-behavior:
-  requiresContext: false  # OpenCoder might not need context
-```
-
-### **Shared Test (Works for Both)**
-```yaml
-# shared/tests/common/create-file.yaml
-id: shared-create-file-001
-agent: openagent  # Default
-prompt: "Create hello.ts"
-
-behavior:
-  requiresApproval: true  # Both agents should ask
-```
-
----
-
-## Summary
-
-**Framework Layer:**
-- ✅ Agent-agnostic test runner
-- ✅ Generic evaluators
-- ✅ Universal test schema
-
-**Agent Layer:**
-- ✅ Agent-specific tests in `agents/{agent}/`
-- ✅ Shared tests in `agents/shared/`
-- ✅ Agent-specific rules in `docs/`
-
-**Benefits:**
-- ✅ Easy to add new agents
-- ✅ Consistent behavior validation
-- ✅ Reduced duplication
-- ✅ Clear organization
-
-**To test a new agent:**
-1. Create directory: `agents/my-agent/`
-2. Copy shared tests
-3. Update `agent` field
-4. Add agent-specific tests
-5. Run: `npm run eval:sdk -- --pattern="my-agent/**/*.yaml"`

+ 0 - 298
evals/agents/openagent/CONTEXT_LOADING_COVERAGE.md

@@ -1,298 +0,0 @@
-# Context Loading Test Coverage
-
-## Overview
-
-This document describes the context loading tests created to verify OpenAgent correctly loads context files before responding to user queries and executing tasks.
-
-**Test Location**: `evals/agents/openagent/tests/context-loading/`
-
-**Total Tests**: 5 (3 simple, 2 complex multi-turn)
-
----
-
-## Test Results Summary
-
-**Run Date**: 2025-11-26  
-**Pass Rate**: 3/5 (60%)  
-**Total Duration**: 430 seconds (~7 minutes)
-
-| Test ID | Type | Status | Duration | Notes |
-|---------|------|--------|----------|-------|
-| ctx-simple-testing-approach | Simple | ✅ PASS | 35s | Loaded testing docs correctly |
-| ctx-simple-documentation-format | Simple | ✅ PASS | 19s | Loaded docs.md correctly |
-| ctx-simple-coding-standards | Simple | ✅ PASS | 20s | Loaded code.md correctly |
-| ctx-multi-standards-to-docs | Complex | ❌ FAIL | 109s | No context loaded before execution |
-| ctx-multi-error-handling-to-tests | Complex | ❌ FAIL | 246s | Timeout on prompt 4 |
-
----
-
-## Test Descriptions
-
-### Simple Tests (Read-Only)
-
-#### 1. `ctx-simple-coding-standards.yaml`
-**Prompt**: "What are our coding standards for this project?"
-
-**Expected Behavior**:
-- Load `code.md` or `standards.md` before responding
-- Reference project-specific standards
-
-**Result**: ✅ **PASSED**
-- Agent loaded `.opencode/context/core/standards/code.md`
-- 1 read operation performed
-- No violations detected
-
----
-
-#### 2. `ctx-simple-documentation-format.yaml`
-**Prompt**: "What format should I use for documentation in this project?"
-
-**Expected Behavior**:
-- Load `docs.md` or `documentation.md` before responding
-- Reference project-specific documentation standards
-
-**Result**: ✅ **PASSED**
-- Agent loaded `.opencode/context/core/standards/docs.md`
-- 1 read operation performed
-- No violations detected
-
----
-
-#### 3. `ctx-simple-testing-approach.yaml`
-**Prompt**: "What's our testing strategy for this project?"
-
-**Expected Behavior**:
-- Load `tests.md` or `testing.md` before responding
-- Reference project-specific testing standards
-
-**Result**: ✅ **PASSED**
-- Agent loaded multiple testing-related files:
-  - `evals/HOW_TESTS_WORK.md`
-  - `evals/README.md`
-  - `evals/TESTING_CONFIDENCE.md`
-  - `evals/agents/AGENT_TESTING_GUIDE.md`
-- 4 read operations performed
-- No violations detected
-
----
-
-### Complex Tests (Multi-Turn with File Creation)
-
-#### 4. `ctx-multi-standards-to-docs.yaml`
-**Scenario**: Standards question → Documentation request → Format question
-
-**Turn 1**: "What are our coding standards?"
-- Expected: Load `standards.md` or `code.md`
-
-**Turn 2**: "Can you create documentation about these standards in evals/test_tmp/coding-standards-doc.md?"
-- Expected: Load `docs.md` (documentation format)
-- Expected: Write file to `evals/test_tmp/`
-
-**Turn 3**: "What will the documentation structure look like?"
-- Expected: Reference both standards and docs context
-
-**Result**: ❌ **FAILED**
-- Agent loaded context files correctly:
-  - `.opencode/context/core/standards/code.md` (2x)
-  - `.opencode/context/core/standards/docs.md` (1x)
-- Agent wrote file successfully
-- **Violation**: "No context loaded before execution" (warning)
-- **Issue**: Context loading evaluator flagged timing issue
-
-**Files Created**: `evals/test_tmp/coding-standards-doc.md` (cleaned up after test)
-
----
-
-#### 5. `ctx-multi-error-handling-to-tests.yaml`
-**Scenario**: Error handling question → Test request → Coverage policy
-
-**Turn 1**: "How should we handle errors in this project?"
-- Expected: Load `standards.md` or `processes.md`
-
-**Turn 2**: "Can you write tests for error handling in evals/test_tmp/error-handling.test.ts?"
-- Expected: Load `tests.md` (testing standards)
-- Expected: Write test file to `evals/test_tmp/`
-
-**Turn 3**: "What's our test coverage policy?"
-- Expected: Reference test-related context
-
-**Result**: ❌ **FAILED**
-- **Error**: "Prompt 4 execution timed out"
-- Test exceeded 180-second timeout
-- Likely due to complex multi-turn conversation with file creation
-
----
-
-## Cleanup Verification
-
-✅ **Cleanup System Working Correctly**
-
-**Before Tests**:
-- Cleaned up 1 file from previous runs
-
-**After Tests**:
-- Cleaned up 2 files created during tests
-- `test_tmp/` contains only:
-  - `.gitignore`
-  - `README.md`
-
-**Cleanup Logic**: `evals/framework/src/sdk/run-sdk-tests.ts`
-- Runs before test execution
-- Runs after test execution
-- Preserves only `.gitignore` and `README.md`
-
----
-
-## Key Findings
-
-### ✅ Positive Results
-
-1. **Simple Context Loading Works**: All 3 simple tests passed
-   - Agent correctly identifies and loads relevant context files
-   - Agent reads context BEFORE responding
-   - No violations in simple scenarios
-
-2. **Cleanup System Reliable**: 
-   - Files created during tests are properly cleaned up
-   - No test artifacts left in project root
-   - `test_tmp/` directory isolation working
-
-3. **Context File Discovery**:
-   - Agent successfully finds context files in `.opencode/context/core/standards/`
-   - Agent loads multiple relevant files when appropriate
-
-### ⚠️ Issues Identified
-
-1. **Multi-Turn Context Loading**: 
-   - Complex multi-turn tests show timing issues
-   - Context loading evaluator flagging warnings even when files are loaded
-   - May need to adjust evaluator logic for multi-turn scenarios
-
-2. **Timeout on Complex Tests**:
-   - 180-second timeout insufficient for some multi-turn tests
-   - Test 5 timed out on prompt 4
-   - May need to increase timeout or simplify test scenarios
-
-3. **False Positive Warning**:
-   - Test 4 loaded context correctly but still got "no-context-loaded" warning
-   - Evaluator may not be detecting context loads in multi-turn conversations
-
----
-
-## Recommendations
-
-### Immediate Actions
-
-1. **Increase Timeout for Complex Tests**
-   - Change from 180s to 300s (5 minutes)
-   - Add timeout configuration per test
-
-2. **Fix Context Loading Evaluator**
-   - Review timing detection logic for multi-turn tests
-   - Ensure evaluator tracks context loads across all prompts
-
-3. **Simplify Complex Tests**
-   - Reduce number of turns in multi-turn tests
-   - Focus on specific context loading scenarios
-
-### Future Enhancements
-
-1. **Add More Edge Cases**
-   - Test context loading with missing files
-   - Test context loading with multiple context directories
-   - Test context loading with file attachments
-
-2. **Add Performance Metrics**
-   - Track time between context load and execution
-   - Measure context file read performance
-   - Monitor API rate limits
-
-3. **Batch Test Execution**
-   - Run tests in smaller batches to avoid API timeouts
-   - Add retry logic for transient failures
-   - Implement test result caching
-
----
-
-## Running These Tests
-
-### Run All Context Loading Tests
-```bash
-cd evals/framework
-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"
-```
-
-### Run Individual Test
-```bash
-npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"
-```
-
-### Run with Debug Output
-```bash
-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml" --debug
-```
-
-### View Results Dashboard
-```bash
-cd ../results
-./serve.sh
-```
-
----
-
-## Test File Structure
-
-Each test follows this structure:
-
-```yaml
-id: test-id
-name: "Test Name"
-description: |
-  Detailed description of what the test validates
-  
-category: developer
-agent: openagent
-model: anthropic/claude-sonnet-4-5
-
-# Single prompt OR multi-turn prompts
-prompt: "Single prompt text"
-# OR
-prompts:
-  - text: "First prompt"
-    expectContext: true
-    contextFile: "standards.md"
-  - text: "approve"
-    delayMs: 2000
-
-# Expected behavior
-behavior:
-  mustUseTools: [read, write]
-  requiresContext: true
-  minToolCalls: 1
-
-# Expected violations
-expectedViolations:
-  - rule: context-loading
-    shouldViolate: false
-    severity: error
-
-# Approval strategy
-approvalStrategy:
-  type: auto-approve
-
-timeout: 60000
-
-tags:
-  - context-loading
-  - simple-test
-```
-
----
-
-## Maintenance
-
-**Last Updated**: 2025-11-26  
-**Test Framework Version**: 0.1.0  
-**OpenAgent Version**: Latest  
-
-**Next Review**: After fixing context loading evaluator timing logic

+ 0 - 256
evals/agents/openagent/IMPLEMENTATION_SUMMARY.md

@@ -1,256 +0,0 @@
-# Context Loading Tests - Implementation Summary
-
-**Date**: 2025-11-26  
-**Status**: ✅ **COMPLETE - ALL TESTS PASSING (5/5)**
-
----
-
-## What We Built
-
-### 1. **5 Context Loading Tests** ✅
-Created comprehensive test suite to verify OpenAgent loads context files correctly:
-
-**Simple Tests (3)** - Single prompt, read-only
-- `ctx-simple-coding-standards.yaml` - Coding standards query
-- `ctx-simple-documentation-format.yaml` - Documentation format query  
-- `ctx-simple-testing-approach.yaml` - Testing strategy query
-
-**Complex Tests (2)** - Multi-turn with file creation
-- `ctx-multi-standards-to-docs.yaml` - Standards → Documentation creation
-- `ctx-multi-error-handling-to-tests.yaml` - Error handling → Test creation
-
-### 2. **Smart Timeout System** ✅
-Implemented intelligent timeout handling for multi-turn tests:
-- **Activity monitoring**: Checks if events are still streaming
-- **Base timeout**: 300s (5 minutes) of inactivity triggers timeout
-- **Absolute max**: 600s (10 minutes) hard limit
-- **Prevents false timeouts**: Extends timeout while agent is active
-
-**Code**: `evals/framework/src/sdk/test-runner.ts` - `withSmartTimeout()` method
-
-### 3. **Fixed Context Loading Evaluator** ✅
-Corrected evaluator to properly detect context files in multi-turn sessions:
-
-**Issues Fixed**:
-- ❌ **Before**: File paths extracted from wrong location (`tool.data.input.filePath`)
-- ✅ **After**: Correctly extracts from `tool.data.state.input.filePath`
-- ❌ **Before**: Only checked context before FIRST execution
-- ✅ **After**: Checks context for ALL executions requiring it
-- ❌ **Before**: False positives on multi-turn tests
-- ✅ **After**: Properly tracks context across multiple prompts
-
-**Code**: `evals/framework/src/evaluators/context-loading-evaluator.ts`
-
-### 4. **Batch Test Runner** ✅
-Created helper script for running tests in controlled batches:
-- Configurable batch size (default: 3 tests)
-- Configurable delay between batches (default: 10s)
-- Prevents API rate limits
-- Better resource management
-
-**Script**: `evals/framewor./scripts/utils/run-tests-batch.sh`
-
-**Usage**:
-```bash
-cd evals/framework
-./scripts/utils/run-tests-batch.sh openagent 3 10
-```
-
-### 5. **Cleanup System Verified** ✅
-Confirmed automatic cleanup working correctly:
-- Cleans `test_tmp/` before tests
-- Cleans `test_tmp/` after tests
-- Preserves only `.gitignore` and `README.md`
-- No test artifacts left behind
-
----
-
-## Test Results
-
-### Final Run: 100% Pass Rate 🎉
-
-| Test | Type | Duration | Status | Context Files Loaded |
-|------|------|----------|--------|---------------------|
-| ctx-simple-testing-approach | Simple | 38s | ✅ PASS | 4 files (README, HOW_TESTS_WORK, etc.) |
-| ctx-simple-documentation-format | Simple | 26s | ✅ PASS | docs.md |
-| ctx-simple-coding-standards | Simple | 21s | ✅ PASS | code.md |
-| ctx-multi-standards-to-docs | Complex | 116s | ✅ PASS | code.md, docs.md (44s before execution) |
-| ctx-multi-error-handling-to-tests | Complex | 148s | ✅ PASS | code.md, tests.md (58s before execution) |
-
-**Total Duration**: 349 seconds (~6 minutes)  
-**Pass Rate**: 5/5 (100%)  
-**Violations**: 0
-
----
-
-## Key Findings
-
-### ✅ **OpenAgent Context Loading Works Correctly**
-
-1. **Simple queries**: Agent loads appropriate context files before responding
-2. **Multi-turn conversations**: Agent loads context for each execution phase
-3. **File creation**: Agent loads both standards AND format context before writing
-4. **Timing**: Context loaded 44-58 seconds before execution (plenty of time)
-
-### ✅ **Test Infrastructure is Solid**
-
-1. **Same session tracking**: Multi-turn tests use single session (verified)
-2. **Smart timeout**: Prevents false timeouts while catching real hangs
-3. **Cleanup**: No test artifacts left behind
-4. **Evaluators**: Accurately detect context loading behavior
-
----
-
-## Technical Details
-
-### Session Tracking (Multi-Turn)
-```typescript
-// Single session created once
-const session = await this.client.createSession({ title: testCase.name });
-sessionId = session.id;
-
-// All prompts use SAME session
-for (let i = 0; i < testCase.prompts.length; i++) {
-  await this.client.sendPrompt(sessionId, { text: msg.text, ... });
-}
-```
-
-### Smart Timeout Logic
-```typescript
-// Base timeout: 300s of inactivity
-// Max timeout: 600s absolute
-await this.withSmartTimeout(
-  promptPromise,
-  300000,  // 5 min activity timeout
-  600000,  // 10 min absolute max
-  `Prompt ${i + 1} execution timed out`
-);
-```
-
-### Context File Detection
-```typescript
-// Fixed file path extraction
-const filePath = tool.data?.state?.input?.filePath ||  // ✅ NEW
-                tool.data?.state?.input?.path ||
-                tool.data?.input?.filePath ||          // Old fallback
-                tool.data?.input?.path;
-```
-
----
-
-## Files Modified
-
-### New Files Created
-```
-evals/agents/openagent/tests/context-loading/
-├── ctx-simple-coding-standards.yaml
-├── ctx-simple-documentation-format.yaml
-├── ctx-simple-testing-approach.yaml
-├── ctx-multi-standards-to-docs.yaml
-└── ctx-multi-error-handling-to-tests.yaml
-
-evals/agents/openagent/
-├── CONTEXT_LOADING_COVERAGE.md
-└── IMPLEMENTATION_SUMMARY.md (this file)
-
-evals/framework/
-└── scripts/
-```
-
-### Files Modified
-```
-evals/framework/src/sdk/test-runner.ts
-  - Added withSmartTimeout() method
-  - Updated multi-turn test execution to use smart timeout
-
-evals/framework/src/evaluators/context-loading-evaluator.ts
-  - Fixed file path extraction (tool.data.state.input.filePath)
-  - Added multi-turn execution checking
-  - Improved violation detection
-
-evals/agents/openagent/tests/context-loading/*.yaml
-  - Increased timeout from 180s to 300s for complex tests
-```
-
----
-
-## Recommendations Completed
-
-### ✅ Recommendation 1: Fix Timeout Issue
-- **Status**: COMPLETE
-- **Solution**: Implemented smart timeout with activity monitoring
-- **Result**: No more false timeouts, complex tests complete successfully
-
-### ✅ Recommendation 2: Fix Context Loading Evaluator  
-- **Status**: COMPLETE
-- **Solution**: Fixed file path extraction and multi-turn tracking
-- **Result**: Evaluator correctly detects context loading in all scenarios
-
-### ✅ Recommendation 3: Batch Test Execution
-- **Status**: COMPLETE
-- **Solution**: Created `run-tests-batch.sh` script
-- **Result**: Can run tests in controlled batches with delays
-
----
-
-## How to Use
-
-### Run All Context Loading Tests
-```bash
-cd evals/framework
-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"
-```
-
-### Run Single Test
-```bash
-npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"
-```
-
-### Run in Batches (Avoid API Limits)
-```bash
-./scripts/utils/run-tests-batch.sh openagent 3 10
-# Args: agent, batch_size, delay_seconds
-```
-
-### View Results Dashboard
-```bash
-cd ../results
-./serve.sh
-```
-
----
-
-## Next Steps (Optional Enhancements)
-
-1. **Add More Edge Cases**
-   - Test with missing context files
-   - Test with multiple context directories
-   - Test with file attachments
-
-2. **Performance Metrics**
-   - Track context load time vs execution time
-   - Measure API response times
-   - Monitor rate limit usage
-
-3. **Test Coverage Expansion**
-   - Add tests for other agent behaviors
-   - Test delegation scenarios
-   - Test error handling paths
-
----
-
-## Conclusion
-
-✅ **All objectives achieved**  
-✅ **100% test pass rate**  
-✅ **OpenAgent context loading verified working correctly**  
-✅ **Test infrastructure improved and reliable**  
-✅ **Documentation complete**
-
-The context loading test suite is production-ready and provides comprehensive coverage of OpenAgent's context file loading behavior across both simple and complex multi-turn scenarios.
-
----
-
-**Maintained by**: OpenCode Agents Team  
-**Last Updated**: 2025-11-26  
-**Test Framework Version**: 0.1.0

+ 0 - 41
evals/agents/opencoder/README.md

@@ -1,41 +0,0 @@
-# Opencoder Agent Tests
-
-Tests for the `opencoder` agent - a development-focused agent that executes code tasks directly.
-
-## Agent Characteristics
-
-- **Mode**: Primary development agent
-- **Behavior**: Executes tools directly without text-based approval workflow
-- **Best for**: Code implementation, bash commands, file operations
-- **Approval**: Uses tool permission system (auto-approve in tests)
-
-## Test Categories
-
-### Developer Tests (`tests/developer/`)
-- Bash command execution
-- File operations
-- Code implementation tasks
-
-### Business Tests (`tests/business/`)
-- Data analysis tasks
-- Report generation
-
-### Edge Cases (`tests/edge-case/`)
-- Error handling
-- Permission boundaries
-
-## Running Tests
-
-```bash
-cd evals/framework
-npx tsx src/sdk/run-sdk-tests.ts --agent opencoder
-```
-
-## Key Differences from OpenAgent
-
-| Feature | Opencoder | OpenAgent |
-|---------|-----------|-----------|
-| Approval | Tool permission system | Text-based + tool permission |
-| Workflow | Direct execution | Analyze→Approve→Execute→Validate |
-| Context Loading | On-demand | Mandatory before execution |
-| Best for | Simple tasks | Complex workflows |

+ 0 - 74
evals/agents/shared/README.md

@@ -1,74 +0,0 @@
-# Shared Test Cases
-
-Tests in this directory are **agent-agnostic** and can be used to test **any agent** that follows the same core rules.
-
-## Purpose
-
-Shared tests validate **universal behaviors** that all agents should follow:
-- Approval gate enforcement
-- Tool usage patterns
-- Basic workflow compliance
-- Error handling
-
-## Usage
-
-### Run Shared Tests for OpenAgent
-```bash
-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=openagent
-```
-
-### Run Shared Tests for OpenCoder
-```bash
-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=opencoder
-```
-
-### Override Agent in Test File
-```yaml
-# In the YAML file
-agent: openagent  # Change to opencoder, or any other agent
-```
-
-## Test Categories
-
-### `common/` - Universal Rules
-Tests that apply to **all agents**:
-- `approval-gate-basic.yaml` - Basic approval enforcement
-- `tool-usage-basic.yaml` - Basic tool selection (future)
-- `error-handling-basic.yaml` - Basic error handling (future)
-
-## Adding New Shared Tests
-
-1. Create test in `shared/tests/common/`
-2. Use generic prompts (not agent-specific)
-3. Test universal behaviors only
-4. Tag with `shared-test` and `agent-agnostic`
-5. Document which agents it applies to
-
-## Example
-
-```yaml
-id: shared-example-001
-name: Example Shared Test
-category: edge-case
-agent: openagent  # Default, can be overridden
-
-prompt: "Generic prompt that works for any agent"
-
-behavior:
-  requiresApproval: true  # Universal rule
-
-expectedViolations:
-  - rule: approval-gate
-    shouldViolate: false
-
-tags:
-  - shared-test
-  - agent-agnostic
-```
-
-## Benefits
-
-1. **Reduce Duplication** - Write once, test multiple agents
-2. **Consistency** - Same tests ensure consistent behavior
-3. **Easy Comparison** - Compare agent behaviors side-by-side
-4. **Faster Onboarding** - New agents inherit core test suite

+ 0 - 195
evals/framework/scripts/README.md

@@ -1,195 +0,0 @@
-# Framework Scripts
-
-Utility scripts for debugging, testing, and development.
-
----
-
-## Directory Structure
-
-```
-scripts/
-├── debug/          # Debugging scripts for sessions and events
-├── test/           # Test scripts for framework development
-├── utils/          # Utility scripts (batch runner, etc.)
-└── README.md       # This file
-```
-
----
-
-## Debug Scripts (`debug/`)
-
-Scripts for debugging sessions, events, and agent behavior.
-
-| Script | Purpose | Usage |
-|--------|---------|-------|
-| `debug-session.mjs` | Debug session data and timeline | `node scripts/debug/debug-session.mjs <session-id>` |
-| `debug-session.ts` | TypeScript version of session debugger | `npx tsx scripts/debug/debug-session.ts <session-id>` |
-| `debug-claude-session.mjs` | Debug Claude-specific sessions | `node scripts/debug/debug-claude-session.mjs <session-id>` |
-| `inspect-session.mjs` | Inspect most recent session events | `node scripts/debug/inspect-session.mjs` |
-
-### Examples
-
-```bash
-# Debug a specific session
-node scripts/debug/debug-session.mjs ses_abc123
-
-# Inspect latest session
-node scripts/debug/inspect-session.mjs
-
-# Debug with TypeScript
-npx tsx scripts/debug/debug-session.ts ses_abc123
-```
-
----
-
-## Test Scripts (`test/`)
-
-Scripts for testing framework components during development.
-
-| Script | Purpose | Usage |
-|--------|---------|-------|
-| `test-agent-direct.ts` | Direct agent execution test | `npx tsx scripts/test/test-agent-direct.ts` |
-| `test-event-inspector.js` | Test event capture system | `node scripts/test/test-event-inspector.js` |
-| `test-session-reader.mjs` | Test session reader | `node scripts/test/test-session-reader.mjs` |
-| `test-simplified-approach.mjs` | Test simplified test approach | `node scripts/test/test-simplified-approach.mjs` |
-| `test-timeline.ts` | Test timeline builder | `npx tsx scripts/test/test-timeline.ts` |
-| `verify-timeline.ts` | Verify timeline accuracy | `npx tsx scripts/test/verify-timeline.ts` |
-
-### Examples
-
-```bash
-# Test agent execution
-npx tsx scripts/test/test-agent-direct.ts
-
-# Test event capture
-node scripts/test/test-event-inspector.js
-
-# Verify timeline
-npx tsx scripts/test/verify-timeline.ts
-```
-
----
-
-## Utility Scripts (`utils/`)
-
-General utility scripts for running tests and managing the framework.
-
-| Script | Purpose | Usage |
-|--------|---------|-------|
-| `run-tests-batch.sh` | Run tests in batches | `./scripts/utils/run-tests-batch.sh <agent> <batch-size> <delay>` |
-| `check-agent.mjs` | Check agent availability | `node scripts/utils/check-agent.mjs` |
-
-### Examples
-
-```bash
-# Run tests in batches of 3 with 10s delay
-./scripts/utils/run-tests-batch.sh openagent 3 10
-
-# Check if agent is available
-node scripts/utils/check-agent.mjs
-```
-
----
-
-## Development Workflow
-
-### Debugging a Failed Test
-
-1. Run test with debug flag:
-   ```bash
-   npm run eval:sdk -- --pattern="my-test.yaml" --debug
-   ```
-
-2. Note the session ID from output
-
-3. Inspect the session:
-   ```bash
-   node scripts/debug/inspect-session.mjs
-   # or
-   node scripts/debug/debug-session.mjs <session-id>
-   ```
-
-4. Check timeline events:
-   ```bash
-   npx tsx scripts/debug/debug-session.ts <session-id>
-   ```
-
-### Testing Framework Changes
-
-1. Make changes to framework code
-
-2. Build:
-   ```bash
-   npm run build
-   ```
-
-3. Test specific component:
-   ```bash
-   npx tsx scripts/test/test-timeline.ts
-   ```
-
-4. Run full test suite:
-   ```bash
-   npm run eval:sdk
-   ```
-
----
-
-## Script Dependencies
-
-All scripts require the framework to be built first:
-
-```bash
-npm run build
-```
-
-Some scripts use:
-- `@opencode-ai/sdk` - For SDK client
-- `tsx` - For TypeScript execution
-- Framework dist files - Built TypeScript output
-
----
-
-## Adding New Scripts
-
-### Debug Script Template
-
-```javascript
-// scripts/debug/my-debug-script.mjs
-import { SessionReader } from '../../dist/collector/session-reader.js';
-import { createOpencodeClient } from '@opencode-ai/sdk';
-
-const client = createOpencodeClient({
-  baseUrl: 'http://localhost:3721'
-});
-
-// Your debug logic here
-```
-
-### Test Script Template
-
-```typescript
-// scripts/test/my-test-script.ts
-#!/usr/bin/env npx tsx
-
-import { TestRunner } from '../../dist/sdk/test-runner.js';
-
-async function runTest() {
-  // Your test logic here
-}
-
-runTest().catch(console.error);
-```
-
----
-
-## Maintenance
-
-- **Keep scripts organized** - Put debug scripts in `debug/`, test scripts in `test/`
-- **Update this README** - When adding new scripts
-- **Remove obsolete scripts** - Delete scripts that are no longer needed
-- **Document usage** - Add clear usage examples
-
----
-
-**Last Updated**: 2025-11-26

+ 0 - 279
evals/results/README.md

@@ -1,279 +0,0 @@
-# 📊 Test Results Dashboard
-
-Interactive dashboard for visualizing OpenCode agent test results.
-
-## ⚡ Quick Reference
-
-```bash
-# Run tests
-cd evals/framework && npm run eval:sdk -- --agent=opencoder
-
-# View dashboard (auto-opens browser, auto-shuts down)
-cd evals/results && ./serve.sh
-```
-
-That's it! 🎉
-
----
-
-## Quick Start
-
-1. **Run Tests:**
-   ```bash
-   cd evals/framework
-   npm run eval:sdk -- --agent=opencoder
-   npm run eval:sdk -- --agent=openagent
-   ```
-
-2. **View Dashboard:**
-   
-   **Option A: One-Command Solution (Easiest)** ⭐
-   ```bash
-   cd evals/results
-   ./serve.sh
-   ```
-   - Auto-opens browser
-   - Loads dashboard
-   - Auto-shuts down after 15 seconds
-   - Dashboard stays cached in browser!
-   
-   **Custom timeout:**
-   ```bash
-   ./serve.sh 8000 30  # Port 8000, 30 second timeout
-   ```
-   
-   **Option B: Keep Server Running**
-   ```bash
-   cd evals/results
-   python3 -m http.server 8000
-   ```
-   Press Ctrl+C to stop manually
-   
-   **Option C: Direct File Access**
-   ```bash
-   open evals/results/index.html
-   ```
-   ⚠️ Note: Some browsers block loading JSON from local files. If you see an error, use Option A or B.
-
-## Features
-
-### 📈 Overview Stats
-- **Total Tests** - Count across all agents
-- **Pass Rate** - Percentage of passing tests
-- **Failed Tests** - Number of failures
-- **Avg Duration** - Average test execution time
-
-### 📊 Trend Chart
-- Visual representation of pass rate over time
-- Shows last 30 days of test runs
-- Helps identify regressions
-
-### 🔍 Filters
-- **Agent** - Filter by openagent, opencoder, etc.
-- **Category** - Developer, business, creative, edge-case
-- **Status** - All, passed only, or failed only
-- **Time Range** - Latest, today, last 7 days, last 30 days
-
-### 🔎 Search
-- Real-time search across test IDs
-- Case-insensitive matching
-
-### 📋 Test Table
-- **Sortable Columns** - Click any header to sort
-- **Expandable Rows** - Click a row to see details
-- **Violation Details** - See error messages and severity
-
-### 🌙 Dark Mode
-- Toggle with moon/sun icon in header
-- Preference saved to localStorage
-- Easy on the eyes for long sessions
-
-### 📥 Export
-- Export filtered results to CSV
-- Includes all test metadata
-- Perfect for external analysis
-
-## File Structure
-
-```
-results/
-├── index.html              # Dashboard (open this)
-├── serve.sh                # Helper script to start HTTP server
-├── latest.json             # Most recent test run
-├── history/
-│   └── 2025-11/
-│       ├── 26-115759-opencoder.json
-│       └── 26-115850-openagent.json
-├── .gitignore              # Retention policy
-└── README.md               # This file
-```
-
-## JSON Format
-
-Each result file contains:
-
-```json
-{
-  "meta": {
-    "timestamp": "2025-11-26T11:59:36.365Z",
-    "agent": "openagent",
-    "model": "opencode/grok-code-fast",
-    "framework_version": "0.1.0",
-    "git_commit": "f872007"
-  },
-  "summary": {
-    "total": 8,
-    "passed": 6,
-    "failed": 2,
-    "duration_ms": 32450,
-    "pass_rate": 0.75
-  },
-  "by_category": {
-    "developer": { "passed": 5, "total": 6 },
-    "business": { "passed": 1, "total": 1 },
-    "edge-case": { "passed": 0, "total": 1 }
-  },
-  "tests": [
-    {
-      "id": "task-simple-001",
-      "category": "developer",
-      "passed": true,
-      "duration_ms": 4200,
-      "events": 23,
-      "approvals": 2,
-      "violations": {
-        "total": 0,
-        "errors": 0,
-        "warnings": 0
-      }
-    }
-  ]
-}
-```
-
-## Retention Policy
-
-Results are automatically managed:
-
-- ✅ **Latest Run** - Always kept (`latest.json`)
-- ✅ **Current Month** - All results committed to git
-- ✅ **Previous Month** - All results committed to git
-- ❌ **Older than 60 days** - Kept locally, not committed
-
-This keeps the repo size manageable while preserving recent history.
-
-## Tips
-
-### Quick View Workflow
-The fastest way to view results:
-```bash
-cd evals/results && ./serve.sh
-```
-- ✅ Opens browser automatically
-- ✅ Loads all data
-- ✅ Shuts down after 15 seconds
-- ✅ Dashboard stays functional (data cached)
-- ✅ No manual cleanup needed
-
-**Want to keep exploring?** Press Ctrl+C during countdown to keep server running.
-
-### Comparing Agents
-1. Set **Time Range** to "Latest Run"
-2. Set **Agent** to "All Agents"
-3. Compare pass rates and durations
-
-### Finding Flaky Tests
-1. Set **Time Range** to "Last 30 Days"
-2. Look for tests that alternate between pass/fail
-3. Check violation details for patterns
-
-### Tracking Improvements
-1. Run tests regularly (daily/weekly)
-2. Watch the trend chart for improvements
-3. Export CSV for deeper analysis
-
-### Debugging Failures
-1. Filter **Status** to "Failed Only"
-2. Click on a failed test row
-3. Review violation details
-4. Check error messages and severity
-
-## Browser Compatibility
-
-- ✅ Chrome/Edge (recommended)
-- ✅ Firefox
-- ✅ Safari
-- ⚠️ IE11 (not supported)
-
-## Performance
-
-- **Dashboard Size:** ~31KB (no dependencies except Chart.js CDN)
-- **Load Time:** < 1 second for 100 tests
-- **Memory:** Minimal (pure JavaScript, no frameworks)
-
-## How It Works
-
-### Auto-Shutdown Feature
-The `serve.sh` script:
-1. Starts HTTP server on port 8000
-2. Opens dashboard in your browser
-3. Waits 15 seconds for data to load
-4. Shuts down server automatically
-5. Dashboard continues working (data cached in browser)
-
-**Why does it still work after shutdown?**
-- The browser caches the JSON data
-- All filtering/sorting happens in JavaScript
-- No server needed after initial load
-- Refresh the page to load new data (server will need to restart)
-
-### Stopping Manually
-If you start the server manually:
-```bash
-# Find the process
-lsof -ti:8000
-
-# Kill it
-kill $(lsof -ti:8000)
-```
-
-Or just press Ctrl+C in the terminal.
-
-## Troubleshooting
-
-### Dashboard shows "No results found"
-- Run tests first: `npm run eval:sdk`
-- Check that `latest.json` exists
-- Refresh the page
-
-### Chart not displaying
-- Check browser console for errors
-- Ensure Chart.js CDN is accessible
-- Try refreshing the page
-
-### Dark mode not persisting
-- Check browser localStorage is enabled
-- Clear cache and try again
-
-## Future Enhancements
-
-Potential improvements:
-- [ ] Historical comparison (compare two runs)
-- [ ] Test duration trends per test
-- [ ] Violation type breakdown chart
-- [ ] Agent performance comparison chart
-- [ ] Auto-refresh option
-- [ ] Shareable URLs with filters
-- [ ] CI/CD badge generation
-
-## Contributing
-
-To improve the dashboard:
-
-1. Edit `index.html` (all code is in one file)
-2. Test locally by opening in browser
-3. Submit PR with description of changes
-
-## License
-
-MIT - Same as OpenCode Agents project

+ 0 - 29
evals/test_tmp/README.md

@@ -1,29 +0,0 @@
-# Test Artifacts
-
-This directory contains temporary files created during test execution.
-It should be cleaned up after tests complete.
-
-**DO NOT COMMIT FILES IN THIS DIRECTORY**
-
-## Installation
-
-To install the project dependencies, navigate to the evaluation framework directory and run:
-
-```bash
-cd evals/framework
-npm install
-```
-
-This will install all required dependencies including:
-- `@opencode-ai/sdk` - OpenCode AI SDK
-- `yaml` - YAML parser for test cases
-- `zod` - Schema validation
-- `glob` - File pattern matching
-
-### Development Dependencies
-
-For development and testing, the following tools are also installed:
-- TypeScript compiler
-- Vitest testing framework
-- ESLint for code linting
-- tsx for TypeScript execution