4 months ago · 9675e30ca4
--- a/evals/DOCUMENTATION_CLEANUP.md
+++ b/evals/DOCUMENTATION_CLEANUP.md
@@ -1,273 +0,0 @@
 
																-# Documentation Cleanup Summary
															
 
																-
															
 
																-**Date**: 2025-11-26  
															
 
																-**Status**: ✅ Complete
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Changes Made
															
 
																-
															
 
																-### Files Deleted (3)
															
 
																-
															
 
																-1. **`evals/framework/SESSION_STORAGE_FIX.md`** (173 lines)
															
 
																-   - **Reason**: Historical fix documentation, no longer relevant
															
 
																-   - **Status**: ✅ Deleted
															
 
																-
															
 
																-2. **`evals/TESTING_CONFIDENCE.md`** (121 lines)
															
 
																-   - **Reason**: Outdated, superseded by IMPLEMENTATION_SUMMARY.md
															
 
																-   - **Content**: Old test confidence assessment from before context loading fixes
															
 
																-   - **Status**: ✅ Deleted
															
 
																-
															
 
																-3. **`evals/agents/openagent/TEST_REVIEW.md`** (325 lines)
															
 
																-   - **Reason**: Outdated test review from Nov 25 (before context loading fixes)
															
 
																-   - **Content**: Old test results, superseded by CONTEXT_LOADING_COVERAGE.md and IMPLEMENTATION_SUMMARY.md
															
 
																-   - **Status**: ✅ Deleted
															
 
																-
															
 
																-### Files Renamed (1)
															
 
																-
															
 
																-1. **`evals/SYSTEM_REVIEW.md` → `evals/ARCHITECTURE.md`**
															
 
																-   - **Reason**: More descriptive name for system architecture review
															
 
																-   - **Content**: Comprehensive architecture review (456 lines)
															
 
																-   - **Status**: ✅ Renamed
															
 
																-
															
 
																-### Files Created (2)
															
 
																-
															
 
																-1. **`evals/GETTING_STARTED.md`** (NEW - 450 lines)
															
 
																-   - **Purpose**: Consolidated quick start guide
															
 
																-   - **Content**: 
															
 
																-     - Running tests
															
 
																-     - Understanding results
															
 
																-     - Creating new tests
															
 
																-     - Debugging
															
 
																-     - Common issues
															
 
																-   - **Replaces**: Scattered information from README.md and HOW_TESTS_WORK.md
															
 
																-   - **Status**: ✅ Created
															
 
																-
															
 
																-2. **`evals/DOCUMENTATION_CLEANUP.md`** (THIS FILE)
															
 
																-   - **Purpose**: Track documentation cleanup changes
															
 
																-   - **Status**: ✅ Created
															
 
																-
															
 
																-### Files Updated (3)
															
 
																-
															
 
																-1. **`evals/README.md`** (322 → 280 lines)
															
 
																-   - **Changes**:
															
 
																-     - More concise overview
															
 
																-     - Points to GETTING_STARTED.md for details
															
 
																-     - Updated with recent achievements (Nov 26)
															
 
																-     - Added context loading tests section
															
 
																-     - Added smart timeout system section
															
 
																-     - Updated test coverage numbers
															
 
																-   - **Status**: ✅ Updated
															
 
																-
															
 
																-2. **`evals/agents/openagent/README.md`** (85 → 350 lines)
															
 
																-   - **Changes**:
															
 
																-     - Comprehensive test coverage section
															
 
																-     - Detailed context loading tests documentation
															
 
																-     - Test structure overview
															
 
																-     - Running instructions
															
 
																-     - Test design examples
															
 
																-     - Troubleshooting section
															
 
																-   - **Status**: ✅ Updated
															
 
																-
															
 
																-3. **`evals/HOW_TESTS_WORK.md`** (308 lines)
															
 
																-   - **Changes**: None (kept as-is for detailed technical reference)
															
 
																-   - **Status**: ✅ Kept
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Documentation Structure (After Cleanup)
															
 
																-
															
 
																-### Top-Level Documentation
															
 
																-
															
 
																-```
															
 
																-evals/
															
 
																-├── README.md                     # System overview (UPDATED)
															
 
																-├── GETTING_STARTED.md            # Quick start guide (NEW)
															
 
																-├── HOW_TESTS_WORK.md             # Detailed test execution guide
															
 
																-├── ARCHITECTURE.md               # System architecture review (RENAMED)
															
 
																-└── DOCUMENTATION_CLEANUP.md      # This file (NEW)
															
 
																-```
															
 
																-
															
 
																-### Framework Documentation
															
 
																-
															
 
																-```
															
 
																-evals/framework/
															
 
																-├── README.md                     # Framework documentation
															
 
																-├── SDK_EVAL_README.md            # Complete SDK guide
															
 
																-├── docs/
															
 
																-│   ├── architecture-overview.md # Framework architecture
															
 
																-│   └── test-design-guide.md     # Test design philosophy
															
 
																-└── run-tests-batch.sh            # Batch test runner
															
 
																-```
															
 
																-
															
 
																-### Agent Documentation
															
 
																-
															
 
																-```
															
 
																-evals/agents/openagent/
															
 
																-├── README.md                     # OpenAgent test suite (UPDATED)
															
 
																-├── CONTEXT_LOADING_COVERAGE.md   # Context loading tests
															
 
																-├── IMPLEMENTATION_SUMMARY.md     # Recent implementation
															
 
																-└── docs/
															
 
																-    └── OPENAGENT_RULES.md        # OpenAgent rules reference
															
 
																-```
															
 
																-
															
 
																-### Results Documentation
															
 
																-
															
 
																-```
															
 
																-evals/results/
															
 
																-├── README.md                     # Results dashboard guide
															
 
																-├── index.html                    # Interactive dashboard
															
 
																-└── serve.sh                      # One-command server
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Documentation Flow
															
 
																-
															
 
																-### For New Users
															
 
																-
															
 
																-1. **Start**: `README.md` - System overview
															
 
																-2. **Next**: `GETTING_STARTED.md` - Quick start guide
															
 
																-3. **Then**: Run tests and view results
															
 
																-4. **Deep Dive**: `HOW_TESTS_WORK.md` - Detailed explanations
															
 
																-
															
 
																-### For Test Authors
															
 
																-
															
 
																-1. **Start**: `GETTING_STARTED.md` - Creating tests section
															
 
																-2. **Reference**: `framework/docs/test-design-guide.md` - Design philosophy
															
 
																-3. **Examples**: `agents/openagent/README.md` - Test examples
															
 
																-4. **Rules**: `agents/openagent/docs/OPENAGENT_RULES.md` - Agent rules
															
 
																-
															
 
																-### For Developers
															
 
																-
															
 
																-1. **Start**: `ARCHITECTURE.md` - System architecture
															
 
																-2. **Framework**: `framework/SDK_EVAL_README.md` - Complete SDK guide
															
 
																-3. **Implementation**: `agents/openagent/IMPLEMENTATION_SUMMARY.md` - Recent changes
															
 
																-4. **Technical**: `HOW_TESTS_WORK.md` - Execution details
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Benefits of Cleanup
															
 
																-
															
 
																-### Before Cleanup
															
 
																-
															
 
																-- ❌ 19 markdown files (excluding node_modules)
															
 
																-- ❌ Outdated information (Nov 25 test reviews)
															
 
																-- ❌ Duplicate content (testing confidence in multiple places)
															
 
																-- ❌ Unclear entry point for new users
															
 
																-- ❌ Historical fix documentation cluttering framework/
															
 
																-
															
 
																-### After Cleanup
															
 
																-
															
 
																-- ✅ 16 markdown files (3 deleted, 2 new, net -1)
															
 
																-- ✅ All information current (Nov 26)
															
 
																-- ✅ No duplicate content
															
 
																-- ✅ Clear entry point (GETTING_STARTED.md)
															
 
																-- ✅ Clean framework directory
															
 
																-- ✅ Better organization
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Documentation Quality Metrics
															
 
																-
															
 
																-### Coverage
															
 
																-
															
 
																-| Audience | Documentation | Status |
															
 
																-|----------|---------------|--------|
															
 
																-| New Users | GETTING_STARTED.md | ✅ Complete |
															
 
																-| Test Authors | test-design-guide.md | ✅ Complete |
															
 
																-| Developers | ARCHITECTURE.md | ✅ Complete |
															
 
																-| OpenAgent Users | agents/openagent/README.md | ✅ Complete |
															
 
																-| Results Users | results/README.md | ✅ Complete |
															
 
																-
															
 
																-### Accuracy
															
 
																-
															
 
																-| Document | Last Updated | Accuracy |
															
 
																-|----------|--------------|----------|
															
 
																-| README.md | 2025-11-26 | ✅ Current |
															
 
																-| GETTING_STARTED.md | 2025-11-26 | ✅ Current |
															
 
																-| HOW_TESTS_WORK.md | 2025-11-26 | ✅ Current |
															
 
																-| ARCHITECTURE.md | 2025-11-26 | ✅ Current |
															
 
																-| agents/openagent/README.md | 2025-11-26 | ✅ Current |
															
 
																-| CONTEXT_LOADING_COVERAGE.md | 2025-11-26 | ✅ Current |
															
 
																-| IMPLEMENTATION_SUMMARY.md | 2025-11-26 | ✅ Current |
															
 
																-
															
 
																-### Maintainability
															
 
																-
															
 
																-- ✅ Clear naming conventions
															
 
																-- ✅ Logical organization
															
 
																-- ✅ No duplicate content
															
 
																-- ✅ Cross-references between docs
															
 
																-- ✅ Easy to find information
															
 
																-- ✅ Easy to update
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Maintenance Guidelines
															
 
																-
															
 
																-### When to Update Documentation
															
 
																-
															
 
																-1. **After Major Features**
															
 
																-   - Update README.md with new features
															
 
																-   - Update GETTING_STARTED.md with new usage examples
															
 
																-   - Create/update implementation summaries
															
 
																-
															
 
																-2. **After Bug Fixes**
															
 
																-   - Update relevant documentation
															
 
																-   - Add to troubleshooting sections if needed
															
 
																-
															
 
																-3. **Monthly Review**
															
 
																-   - Check for outdated information
															
 
																-   - Update test coverage numbers
															
 
																-   - Review and consolidate if needed
															
 
																-
															
 
																-### What to Delete
															
 
																-
															
 
																-- Historical fix documentation (after 3 months)
															
 
																-- Outdated test reviews (superseded by new ones)
															
 
																-- Duplicate content (consolidate instead)
															
 
																-- Temporary investigation notes
															
 
																-
															
 
																-### What to Keep
															
 
																-
															
 
																-- Architecture documentation
															
 
																-- Test design guides
															
 
																-- Getting started guides
															
 
																-- Current implementation summaries
															
 
																-- Troubleshooting guides
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Next Review
															
 
																-
															
 
																-**Scheduled**: 2025-12-26 (1 month)
															
 
																-
															
 
																-**Review Checklist**:
															
 
																-- [ ] Check for outdated information
															
 
																-- [ ] Update test coverage numbers
															
 
																-- [ ] Review new features added
															
 
																-- [ ] Check for duplicate content
															
 
																-- [ ] Verify all links work
															
 
																-- [ ] Update "Last Updated" dates
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Summary
															
 
																-
															
 
																-✅ **3 files deleted** (outdated/duplicate content)  
															
 
																-✅ **1 file renamed** (better clarity)  
															
 
																-✅ **2 files created** (better organization)  
															
 
																-✅ **3 files updated** (current information)  
															
 
																-✅ **Net result**: Cleaner, more organized, more maintainable documentation
															
 
																-
															
 
																-**Documentation is now**:
															
 
																-- Current (all Nov 26, 2025)
															
 
																-- Well-organized (clear structure)
															
 
																-- Easy to navigate (clear entry points)
															
 
																-- Comprehensive (covers all audiences)
															
 
																-- Maintainable (no duplicates, clear guidelines)
															
 
																-
															
 
																----
															
 
																-
															
 
																-**Cleanup Completed**: 2025-11-26  
															
 
																-**Next Review**: 2025-12-26
															
--- a/evals/SCRIPTS_ORGANIZATION.md
+++ b/evals/SCRIPTS_ORGANIZATION.md
@@ -1,367 +0,0 @@
 
																-# Scripts Organization Summary
															
 
																-
															
 
																-**Date**: 2025-11-26  
															
 
																-**Status**: ✅ Complete
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Changes Made
															
 
																-
															
 
																-### Before Organization
															
 
																-
															
 
																-```
															
 
																-evals/framework/
															
 
																-├── check-agent.mjs
															
 
																-├── debug-claude-session.mjs
															
 
																-├── debug-session.mjs
															
 
																-├── debug-session.ts
															
 
																-├── inspect-session.mjs
															
 
																-├── run-tests-batch.sh
															
 
																-├── test-agent-direct.ts
															
 
																-├── test-event-inspector.js
															
 
																-├── test-session-reader.mjs
															
 
																-├── test-simplified-approach.mjs
															
 
																-├── test-timeline.ts
															
 
																-├── verify-timeline.ts
															
 
																-└── ... (other framework files)
															
 
																-```
															
 
																-
															
 
																-**Issues**:
															
 
																-- ❌ 12 scripts cluttering framework root
															
 
																-- ❌ No clear organization
															
 
																-- ❌ Hard to find specific scripts
															
 
																-- ❌ Unclear which scripts are for what purpose
															
 
																-
															
 
																----
															
 
																-
															
 
																-### After Organization
															
 
																-
															
 
																-```
															
 
																-evals/framework/
															
 
																-├── scripts/
															
 
																-│   ├── debug/                    # Debugging scripts (4 files)
															
 
																-│   │   ├── debug-session.mjs
															
 
																-│   │   ├── debug-session.ts
															
 
																-│   │   ├── debug-claude-session.mjs
															
 
																-│   │   └── inspect-session.mjs
															
 
																-│   │
															
 
																-│   ├── test/                     # Test scripts (6 files)
															
 
																-│   │   ├── test-agent-direct.ts
															
 
																-│   │   ├── test-event-inspector.js
															
 
																-│   │   ├── test-session-reader.mjs
															
 
																-│   │   ├── test-simplified-approach.mjs
															
 
																-│   │   ├── test-timeline.ts
															
 
																-│   │   └── verify-timeline.ts
															
 
																-│   │
															
 
																-│   ├── utils/                    # Utility scripts (2 files)
															
 
																-│   │   ├── run-tests-batch.sh
															
 
																-│   │   └── check-agent.mjs
															
 
																-│   │
															
 
																-│   └── README.md                 # Script documentation
															
 
																-│
															
 
																-└── ... (other framework files)
															
 
																-```
															
 
																-
															
 
																-**Benefits**:
															
 
																-- ✅ Clean framework root
															
 
																-- ✅ Clear organization by purpose
															
 
																-- ✅ Easy to find scripts
															
 
																-- ✅ Comprehensive documentation
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Script Categories
															
 
																-
															
 
																-### Debug Scripts (4 files)
															
 
																-
															
 
																-Scripts for debugging sessions, events, and agent behavior.
															
 
																-
															
 
																-| Script | Purpose | Lines |
															
 
																-|--------|---------|-------|
															
 
																-| `debug-session.mjs` | Debug session data and timeline | ~40 |
															
 
																-| `debug-session.ts` | TypeScript version of session debugger | ~100 |
															
 
																-| `debug-claude-session.mjs` | Debug Claude-specific sessions | ~50 |
															
 
																-| `inspect-session.mjs` | Inspect most recent session events | ~80 |
															
 
																-
															
 
																-**Usage**:
															
 
																-```bash
															
 
																-node scripts/debug/inspect-session.mjs
															
 
																-node scripts/debug/debug-session.mjs <session-id>
															
 
																-npx tsx scripts/debug/debug-session.ts <session-id>
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-### Test Scripts (6 files)
															
 
																-
															
 
																-Scripts for testing framework components during development.
															
 
																-
															
 
																-| Script | Purpose | Lines |
															
 
																-|--------|---------|-------|
															
 
																-| `test-agent-direct.ts` | Direct agent execution test | ~150 |
															
 
																-| `test-event-inspector.js` | Test event capture system | ~40 |
															
 
																-| `test-session-reader.mjs` | Test session reader | ~60 |
															
 
																-| `test-simplified-approach.mjs` | Test simplified test approach | ~100 |
															
 
																-| `test-timeline.ts` | Test timeline builder | ~90 |
															
 
																-| `verify-timeline.ts` | Verify timeline accuracy | ~100 |
															
 
																-
															
 
																-**Usage**:
															
 
																-```bash
															
 
																-npx tsx scripts/test/test-agent-direct.ts
															
 
																-node scripts/test/test-event-inspector.js
															
 
																-npx tsx scripts/test/verify-timeline.ts
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-### Utility Scripts (2 files)
															
 
																-
															
 
																-General utility scripts for running tests and managing the framework.
															
 
																-
															
 
																-| Script | Purpose | Lines |
															
 
																-|--------|---------|-------|
															
 
																-| `run-tests-batch.sh` | Run tests in batches | ~100 |
															
 
																-| `check-agent.mjs` | Check agent availability | ~30 |
															
 
																-
															
 
																-**Usage**:
															
 
																-```bash
															
 
																-./scripts/utils/run-tests-batch.sh openagent 3 10
															
 
																-node scripts/utils/check-agent.mjs
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Documentation Updates
															
 
																-
															
 
																-### Files Updated
															
 
																-
															
 
																-1. **`evals/README.md`**
															
 
																-   - Updated `run-tests-batch.sh` path references
															
 
																-   - Updated directory structure
															
 
																-
															
 
																-2. **`evals/GETTING_STARTED.md`**
															
 
																-   - Updated batch execution examples
															
 
																-   - Updated script paths
															
 
																-
															
 
																-3. **`evals/agents/openagent/README.md`**
															
 
																-   - Updated batch execution examples
															
 
																-   - Updated script paths
															
 
																-
															
 
																-4. **`evals/agents/openagent/IMPLEMENTATION_SUMMARY.md`**
															
 
																-   - Updated script references
															
 
																-   - Updated directory structure
															
 
																-
															
 
																-5. **`evals/DOCUMENTATION_CLEANUP.md`**
															
 
																-   - Updated directory structure
															
 
																-
															
 
																-6. **`evals/framework/README.md`**
															
 
																-   - Added scripts section
															
 
																-   - Added quick examples
															
 
																-
															
 
																-### New Documentation
															
 
																-
															
 
																-1. **`evals/framework/scripts/README.md`** (NEW - 200 lines)
															
 
																-   - Comprehensive script documentation
															
 
																-   - Usage examples for all scripts
															
 
																-   - Development workflow guide
															
 
																-   - Script templates
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Path Changes
															
 
																-
															
 
																-### Old Paths → New Paths
															
 
																-
															
 
																-| Old Path | New Path |
															
 
																-|----------|----------|
															
 
																-| `run-tests-batch.sh` | `scripts/utils/run-tests-batch.sh` |
															
 
																-| `check-agent.mjs` | `scripts/utils/check-agent.mjs` |
															
 
																-| `debug-session.mjs` | `scripts/debug/debug-session.mjs` |
															
 
																-| `debug-session.ts` | `scripts/debug/debug-session.ts` |
															
 
																-| `debug-claude-session.mjs` | `scripts/debug/debug-claude-session.mjs` |
															
 
																-| `inspect-session.mjs` | `scripts/debug/inspect-session.mjs` |
															
 
																-| `test-agent-direct.ts` | `scripts/test/test-agent-direct.ts` |
															
 
																-| `test-event-inspector.js` | `scripts/test/test-event-inspector.js` |
															
 
																-| `test-session-reader.mjs` | `scripts/test/test-session-reader.mjs` |
															
 
																-| `test-simplified-approach.mjs` | `scripts/test/test-simplified-approach.mjs` |
															
 
																-| `test-timeline.ts` | `scripts/test/test-timeline.ts` |
															
 
																-| `verify-timeline.ts` | `scripts/test/verify-timeline.ts` |
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Migration Guide
															
 
																-
															
 
																-### For Users
															
 
																-
															
 
																-If you have scripts or documentation referencing the old paths:
															
 
																-
															
 
																-```bash
															
 
																-# Old
															
 
																-./run-tests-batch.sh openagent 3 10
															
 
																-
															
 
																-# New
															
 
																-./scripts/utils/run-tests-batch.sh openagent 3 10
															
 
																-```
															
 
																-
															
 
																-### For Developers
															
 
																-
															
 
																-If you have custom scripts importing from these files:
															
 
																-
															
 
																-```javascript
															
 
																-// Old
															
 
																-import { SessionReader } from './dist/collector/session-reader.js';
															
 
																-
															
 
																-// New (from scripts directory)
															
 
																-import { SessionReader } from '../../dist/collector/session-reader.js';
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Benefits
															
 
																-
															
 
																-### Organization
															
 
																-
															
 
																-- ✅ **Clear structure** - Scripts grouped by purpose
															
 
																-- ✅ **Easy navigation** - Know where to find scripts
															
 
																-- ✅ **Clean root** - Framework root no longer cluttered
															
 
																-- ✅ **Scalable** - Easy to add new scripts
															
 
																-
															
 
																-### Documentation
															
 
																-
															
 
																-- ✅ **Comprehensive README** - All scripts documented
															
 
																-- ✅ **Usage examples** - Clear examples for each script
															
 
																-- ✅ **Development workflow** - Guide for using scripts
															
 
																-- ✅ **Templates** - Easy to create new scripts
															
 
																-
															
 
																-### Maintainability
															
 
																-
															
 
																-- ✅ **Easier to maintain** - Clear organization
															
 
																-- ✅ **Easier to find** - Logical grouping
															
 
																-- ✅ **Easier to update** - Centralized documentation
															
 
																-- ✅ **Easier to extend** - Clear patterns
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Statistics
															
 
																-
															
 
																-### Before
															
 
																-
															
 
																-- **Total scripts**: 12
															
 
																-- **In framework root**: 12
															
 
																-- **Organized**: 0
															
 
																-- **Documented**: Minimal
															
 
																-
															
 
																-### After
															
 
																-
															
 
																-- **Total scripts**: 12 (same)
															
 
																-- **In framework root**: 0
															
 
																-- **Organized**: 12 (100%)
															
 
																-- **Documented**: Comprehensive (200+ lines)
															
 
																-
															
 
																-### File Count
															
 
																-
															
 
																-- **Debug scripts**: 4
															
 
																-- **Test scripts**: 6
															
 
																-- **Utility scripts**: 2
															
 
																-- **Documentation**: 1 (README.md)
															
 
																-- **Total**: 13 files (12 scripts + 1 doc)
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Maintenance Guidelines
															
 
																-
															
 
																-### Adding New Scripts
															
 
																-
															
 
																-1. **Determine category**:
															
 
																-   - Debug? → `scripts/debug/`
															
 
																-   - Test? → `scripts/test/`
															
 
																-   - Utility? → `scripts/utils/`
															
 
																-
															
 
																-2. **Create script** in appropriate directory
															
 
																-
															
 
																-3. **Update `scripts/README.md`**:
															
 
																-   - Add to table
															
 
																-   - Add usage example
															
 
																-
															
 
																-4. **Test the script**:
															
 
																-   ```bash
															
 
																-   npm run build
															
 
																-   node scripts/debug/my-script.mjs
															
 
																-   ```
															
 
																-
															
 
																-### Removing Obsolete Scripts
															
 
																-
															
 
																-1. **Delete the script file**
															
 
																-
															
 
																-2. **Update `scripts/README.md`**:
															
 
																-   - Remove from table
															
 
																-   - Remove usage example
															
 
																-
															
 
																-3. **Check for references**:
															
 
																-   ```bash
															
 
																-   rg "my-script" --type md
															
 
																-   ```
															
 
																-
															
 
																-### Updating Scripts
															
 
																-
															
 
																-1. **Make changes to script**
															
 
																-
															
 
																-2. **Test changes**:
															
 
																-   ```bash
															
 
																-   npm run build
															
 
																-   node scripts/debug/my-script.mjs
															
 
																-   ```
															
 
																-
															
 
																-3. **Update documentation** if usage changed
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Next Steps
															
 
																-
															
 
																-### Immediate
															
 
																-
															
 
																-- ✅ Scripts organized
															
 
																-- ✅ Documentation updated
															
 
																-- ✅ References updated
															
 
																-- ✅ README created
															
 
																-
															
 
																-### Future Enhancements
															
 
																-
															
 
																-1. **Add more debug scripts**
															
 
																-   - Session comparison tool
															
 
																-   - Event diff tool
															
 
																-   - Performance profiler
															
 
																-
															
 
																-2. **Add more test scripts**
															
 
																-   - Integration test runner
															
 
																-   - Performance benchmarks
															
 
																-   - Stress tests
															
 
																-
															
 
																-3. **Add more utilities**
															
 
																-   - Test result analyzer
															
 
																-   - Coverage reporter
															
 
																-   - Cleanup utilities
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Summary
															
 
																-
															
 
																-✅ **12 scripts organized** into 3 categories  
															
 
																-✅ **Framework root cleaned** (0 scripts remaining)  
															
 
																-✅ **Comprehensive documentation** (200+ lines)  
															
 
																-✅ **All references updated** (6 files)  
															
 
																-✅ **Clear structure** for future additions
															
 
																-
															
 
																-**Organization is now**:
															
 
																-- Clean and organized
															
 
																-- Well-documented
															
 
																-- Easy to navigate
															
 
																-- Easy to maintain
															
 
																-- Easy to extend
															
 
																-
															
 
																----
															
 
																-
															
 
																-**Organization Completed**: 2025-11-26  
															
 
																-**Scripts Organized**: 12  
															
 
																-**Documentation Created**: 1 README (200+ lines)  
															
 
																-**Files Updated**: 6
															
--- a/evals/agents/AGENT_TESTING_GUIDE.md
+++ b/evals/agents/AGENT_TESTING_GUIDE.md
@@ -1,417 +0,0 @@
 
																-# Agent Testing Guide - Agent-Agnostic Architecture
															
 
																-
															
 
																-## Overview
															
 
																-
															
 
																-Our evaluation framework is designed to be **agent-agnostic**, making it easy to test multiple agents with the same infrastructure.
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Architecture Layers
															
 
																-
															
 
																-### **Layer 1: Framework (Agent-Agnostic)**
															
 
																-```
															
 
																-evals/framework/
															
 
																-├── src/
															
 
																-│   ├── sdk/              # Test runner (works with any agent)
															
 
																-│   ├── evaluators/       # Generic behavior checks
															
 
																-│   └── types/            # Shared types
															
 
																-```
															
 
																-
															
 
																-**Purpose:** Shared infrastructure that works with **any agent**
															
 
																-
															
 
																-**Key Components:**
															
 
																-- `TestRunner` - Executes tests for any agent
															
 
																-- `Evaluators` - Check generic behaviors (approval, context, tools)
															
 
																-- `EventStreamHandler` - Captures events from any agent
															
 
																-- `TestCaseSchema` - Universal test format
															
 
																-
															
 
																----
															
 
																-
															
 
																-### **Layer 2: Agent-Specific Tests**
															
 
																-```
															
 
																-evals/agents/
															
 
																-├── openagent/           # OpenAgent-specific tests
															
 
																-│   ├── tests/
															
 
																-│   └── docs/
															
 
																-├── opencoder/           # OpenCoder-specific tests (future)
															
 
																-│   ├── tests/
															
 
																-│   └── docs/
															
 
																-└── shared/              # Tests for ANY agent
															
 
																-    └── tests/
															
 
																-```
															
 
																-
															
 
																-**Purpose:** Organize tests by agent for easy management
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Directory Structure
															
 
																-
															
 
																-```
															
 
																-evals/
															
 
																-├── framework/                          # SHARED FRAMEWORK
															
 
																-│   ├── src/
															
 
																-│   │   ├── sdk/
															
 
																-│   │   │   ├── test-runner.ts         # Reads 'agent' field from YAML
															
 
																-│   │   │   ├── client-manager.ts      # Routes to correct agent
															
 
																-│   │   │   └── test-case-schema.ts    # Universal schema
															
 
																-│   │   └── evaluators/
															
 
																-│   │       ├── approval-gate-evaluator.ts    # Works for any agent
															
 
																-│   │       ├── context-loading-evaluator.ts  # Works for any agent
															
 
																-│   │       └── tool-usage-evaluator.ts       # Works for any agent
															
 
																-│   └── package.json
															
 
																-│
															
 
																-├── agents/
															
 
																-│   ├── openagent/                      # OPENAGENT TESTS
															
 
																-│   │   ├── tests/
															
 
																-│   │   │   ├── developer/
															
 
																-│   │   │   │   ├── task-simple-001.yaml      # agent: openagent
															
 
																-│   │   │   │   ├── ctx-code-001.yaml         # agent: openagent
															
 
																-│   │   │   │   └── ctx-docs-001.yaml         # agent: openagent
															
 
																-│   │   │   ├── business/
															
 
																-│   │   │   │   └── conv-simple-001.yaml      # agent: openagent
															
 
																-│   │   │   └── edge-case/
															
 
																-│   │   │       └── fail-stop-001.yaml        # agent: openagent
															
 
																-│   │   └── docs/
															
 
																-│   │       └── OPENAGENT_RULES.md            # OpenAgent-specific rules
															
 
																-│   │
															
 
																-│   ├── opencoder/                      # OPENCODER TESTS (future)
															
 
																-│   │   ├── tests/
															
 
																-│   │   │   ├── developer/
															
 
																-│   │   │   │   ├── refactor-code-001.yaml    # agent: opencoder
															
 
																-│   │   │   │   └── optimize-perf-001.yaml    # agent: opencoder
															
 
																-│   │   └── docs/
															
 
																-│   │       └── OPENCODER_RULES.md            # OpenCoder-specific rules
															
 
																-│   │
															
 
																-│   └── shared/                         # SHARED TESTS (any agent)
															
 
																-│       ├── tests/
															
 
																-│       │   └── common/
															
 
																-│       │       ├── approval-gate-basic.yaml  # agent: ${AGENT}
															
 
																-│       │       └── tool-usage-basic.yaml     # agent: ${AGENT}
															
 
																-│       └── README.md
															
 
																-│
															
 
																-└── README.md
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## How Agent Selection Works
															
 
																-
															
 
																-### **1. Test Specifies Agent**
															
 
																-
															
 
																-```yaml
															
 
																-# openagent/tests/developer/task-simple-001.yaml
															
 
																-id: task-simple-001
															
 
																-name: Simple Bash Execution
															
 
																-agent: openagent              # ← Specifies which agent to test
															
 
																-prompt: "Run npm install"
															
 
																-```
															
 
																-
															
 
																-### **2. Test Runner Routes to Agent**
															
 
																-
															
 
																-```typescript
															
 
																-// framework/src/sdk/test-runner.ts
															
 
																-async runTest(testCase: TestCase) {
															
 
																-  // Get agent from test case
															
 
																-  const agent = testCase.agent || 'openagent';
															
 
																-  
															
 
																-  // Route to specified agent
															
 
																-  const result = await this.clientManager.sendPrompt(
															
 
																-    sessionId,
															
 
																-    testCase.prompt,
															
 
																-    { agent }  // ← SDK routes to correct agent
															
 
																-  );
															
 
																-}
															
 
																-```
															
 
																-
															
 
																-### **3. Evaluators Check Generic Behaviors**
															
 
																-
															
 
																-```typescript
															
 
																-// framework/src/evaluators/approval-gate-evaluator.ts
															
 
																-export class ApprovalGateEvaluator extends BaseEvaluator {
															
 
																-  async evaluate(timeline: TimelineEvent[]) {
															
 
																-    // Check if ANY agent asked for approval
															
 
																-    // Works for openagent, opencoder, or any future agent
															
 
																-    
															
 
																-    const approvalRequested = timeline.some(event => 
															
 
																-      event.type === 'approval_request'
															
 
																-    );
															
 
																-    
															
 
																-    if (!approvalRequested) {
															
 
																-      violations.push({
															
 
																-        type: 'approval-gate-missing',
															
 
																-        severity: 'error',
															
 
																-        message: 'Agent executed without requesting approval'
															
 
																-      });
															
 
																-    }
															
 
																-  }
															
 
																-}
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Running Tests Per Agent
															
 
																-
															
 
																-### **Run All Tests for Specific Agent**
															
 
																-
															
 
																-```bash
															
 
																-# Run ALL OpenAgent tests
															
 
																-npm run eval:sdk -- --pattern="openagent/**/*.yaml"
															
 
																-
															
 
																-# Run ALL OpenCoder tests
															
 
																-npm run eval:sdk -- --pattern="opencoder/**/*.yaml"
															
 
																-```
															
 
																-
															
 
																-### **Run Specific Category**
															
 
																-
															
 
																-```bash
															
 
																-# Run OpenAgent developer tests
															
 
																-npm run eval:sdk -- --pattern="openagent/developer/*.yaml"
															
 
																-
															
 
																-# Run OpenCoder developer tests
															
 
																-npm run eval:sdk -- --pattern="opencoder/developer/*.yaml"
															
 
																-```
															
 
																-
															
 
																-### **Run Shared Tests for Different Agents**
															
 
																-
															
 
																-```bash
															
 
																-# Run shared tests for OpenAgent
															
 
																-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=openagent
															
 
																-
															
 
																-# Run shared tests for OpenCoder
															
 
																-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=opencoder
															
 
																-```
															
 
																-
															
 
																-### **Run Single Test**
															
 
																-
															
 
																-```bash
															
 
																-# Run specific test
															
 
																-npx tsx src/sdk/show-test-details.ts openagent/developer/task-simple-001.yaml
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Adding a New Agent
															
 
																-
															
 
																-### **Step 1: Create Agent Directory**
															
 
																-
															
 
																-```bash
															
 
																-mkdir -p evals/agents/my-new-agent/tests/{developer,business,edge-case}
															
 
																-mkdir -p evals/agents/my-new-agent/docs
															
 
																-```
															
 
																-
															
 
																-### **Step 2: Create Agent Rules Document**
															
 
																-
															
 
																-```bash
															
 
																-# Document agent-specific rules
															
 
																-touch evals/agents/my-new-agent/docs/MY_NEW_AGENT_RULES.md
															
 
																-```
															
 
																-
															
 
																-### **Step 3: Copy Shared Tests**
															
 
																-
															
 
																-```bash
															
 
																-# Copy shared tests as starting point
															
 
																-cp evals/agents/shared/tests/common/*.yaml \
															
 
																-   evals/agents/my-new-agent/tests/developer/
															
 
																-
															
 
																-# Update agent field
															
 
																-sed -i 's/agent: openagent/agent: my-new-agent/g' \
															
 
																-  evals/agents/my-new-agent/tests/developer/*.yaml
															
 
																-```
															
 
																-
															
 
																-### **Step 4: Add Agent-Specific Tests**
															
 
																-
															
 
																-```yaml
															
 
																-# my-new-agent/tests/developer/custom-test-001.yaml
															
 
																-id: custom-test-001
															
 
																-name: My New Agent Custom Test
															
 
																-agent: my-new-agent           # ← Your new agent
															
 
																-prompt: "Agent-specific prompt"
															
 
																-
															
 
																-behavior:
															
 
																-  mustUseTools: [bash]
															
 
																-  requiresApproval: true
															
 
																-
															
 
																-expectedViolations:
															
 
																-  - rule: approval-gate
															
 
																-    shouldViolate: false
															
 
																-```
															
 
																-
															
 
																-### **Step 5: Run Tests**
															
 
																-
															
 
																-```bash
															
 
																-npm run eval:sdk -- --pattern="my-new-agent/**/*.yaml"
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Test Organization Best Practices
															
 
																-
															
 
																-### **1. Agent-Specific Tests**
															
 
																-Put in `agents/{agent}/tests/`
															
 
																-
															
 
																-**When to use:**
															
 
																-- Tests specific to agent's unique features
															
 
																-- Tests for agent-specific rules
															
 
																-- Tests that won't work for other agents
															
 
																-
															
 
																-**Example:**
															
 
																-```yaml
															
 
																-# openagent/tests/developer/ctx-code-001.yaml
															
 
																-# OpenAgent-specific: Tests context loading from openagent.md
															
 
																-agent: openagent
															
 
																-behavior:
															
 
																-  requiresContext: true  # OpenAgent-specific rule
															
 
																-```
															
 
																-
															
 
																-### **2. Shared Tests**
															
 
																-Put in `agents/shared/tests/common/`
															
 
																-
															
 
																-**When to use:**
															
 
																-- Tests that work for ANY agent
															
 
																-- Tests for universal rules (approval, tool usage)
															
 
																-- Tests you want to run across multiple agents
															
 
																-
															
 
																-**Example:**
															
 
																-```yaml
															
 
																-# shared/tests/common/approval-gate-basic.yaml
															
 
																-# Works for ANY agent
															
 
																-agent: openagent  # Default, can be overridden
															
 
																-behavior:
															
 
																-  requiresApproval: true  # Universal rule
															
 
																-```
															
 
																-
															
 
																-### **3. Category Organization**
															
 
																-
															
 
																-```
															
 
																-tests/
															
 
																-├── developer/      # Developer workflow tests
															
 
																-├── business/       # Business/analysis tests
															
 
																-├── creative/       # Content creation tests
															
 
																-└── edge-case/      # Edge cases and error handling
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Evaluator Design (Agent-Agnostic)
															
 
																-
															
 
																-### **Good: Generic Behavior Check**
															
 
																-
															
 
																-```typescript
															
 
																-// ✅ Works for any agent
															
 
																-export class ApprovalGateEvaluator extends BaseEvaluator {
															
 
																-  async evaluate(timeline: TimelineEvent[]) {
															
 
																-    // Check generic behavior: did agent ask for approval?
															
 
																-    const hasApproval = timeline.some(e => e.type === 'approval_request');
															
 
																-    
															
 
																-    if (!hasApproval) {
															
 
																-      violations.push({
															
 
																-        type: 'approval-gate-missing',
															
 
																-        message: 'Agent did not request approval'
															
 
																-      });
															
 
																-    }
															
 
																-  }
															
 
																-}
															
 
																-```
															
 
																-
															
 
																-### **Bad: Agent-Specific Logic**
															
 
																-
															
 
																-```typescript
															
 
																-// ❌ Hardcoded to specific agent
															
 
																-export class OpenAgentSpecificEvaluator extends BaseEvaluator {
															
 
																-  async evaluate(timeline: TimelineEvent[]) {
															
 
																-    // Don't do this - ties evaluator to specific agent
															
 
																-    if (sessionInfo.agent === 'openagent') {
															
 
																-      // OpenAgent-specific checks
															
 
																-    }
															
 
																-  }
															
 
																-}
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Benefits of Agent-Agnostic Design
															
 
																-
															
 
																-### **1. Easy to Add New Agents**
															
 
																-- Copy shared tests
															
 
																-- Update `agent` field
															
 
																-- Add agent-specific tests
															
 
																-- Run tests
															
 
																-
															
 
																-### **2. Consistent Behavior Across Agents**
															
 
																-- Same evaluators check all agents
															
 
																-- Same test format for all agents
															
 
																-- Easy to compare agent behaviors
															
 
																-
															
 
																-### **3. Reduced Duplication**
															
 
																-- Shared tests written once
															
 
																-- Evaluators work for all agents
															
 
																-- Framework code reused
															
 
																-
															
 
																-### **4. Easy Maintenance**
															
 
																-- Update evaluator once, affects all agents
															
 
																-- Update shared test once, affects all agents
															
 
																-- Clear separation of concerns
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Example: Testing Two Agents
															
 
																-
															
 
																-### **OpenAgent Test**
															
 
																-```yaml
															
 
																-# openagent/tests/developer/create-file.yaml
															
 
																-id: openagent-create-file-001
															
 
																-agent: openagent
															
 
																-prompt: "Create hello.ts"
															
 
																-
															
 
																-behavior:
															
 
																-  requiresContext: true  # OpenAgent loads code.md
															
 
																-```
															
 
																-
															
 
																-### **OpenCoder Test**
															
 
																-```yaml
															
 
																-# opencoder/tests/developer/create-file.yaml
															
 
																-id: opencoder-create-file-001
															
 
																-agent: opencoder
															
 
																-prompt: "Create hello.ts"
															
 
																-
															
 
																-behavior:
															
 
																-  requiresContext: false  # OpenCoder might not need context
															
 
																-```
															
 
																-
															
 
																-### **Shared Test (Works for Both)**
															
 
																-```yaml
															
 
																-# shared/tests/common/create-file.yaml
															
 
																-id: shared-create-file-001
															
 
																-agent: openagent  # Default
															
 
																-prompt: "Create hello.ts"
															
 
																-
															
 
																-behavior:
															
 
																-  requiresApproval: true  # Both agents should ask
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Summary
															
 
																-
															
 
																-**Framework Layer:**
															
 
																-- ✅ Agent-agnostic test runner
															
 
																-- ✅ Generic evaluators
															
 
																-- ✅ Universal test schema
															
 
																-
															
 
																-**Agent Layer:**
															
 
																-- ✅ Agent-specific tests in `agents/{agent}/`
															
 
																-- ✅ Shared tests in `agents/shared/`
															
 
																-- ✅ Agent-specific rules in `docs/`
															
 
																-
															
 
																-**Benefits:**
															
 
																-- ✅ Easy to add new agents
															
 
																-- ✅ Consistent behavior validation
															
 
																-- ✅ Reduced duplication
															
 
																-- ✅ Clear organization
															
 
																-
															
 
																-**To test a new agent:**
															
 
																-1. Create directory: `agents/my-agent/`
															
 
																-2. Copy shared tests
															
 
																-3. Update `agent` field
															
 
																-4. Add agent-specific tests
															
 
																-5. Run: `npm run eval:sdk -- --pattern="my-agent/**/*.yaml"`
															
--- a/evals/agents/openagent/CONTEXT_LOADING_COVERAGE.md
+++ b/evals/agents/openagent/CONTEXT_LOADING_COVERAGE.md
@@ -1,298 +0,0 @@
 
																-# Context Loading Test Coverage
															
 
																-
															
 
																-## Overview
															
 
																-
															
 
																-This document describes the context loading tests created to verify OpenAgent correctly loads context files before responding to user queries and executing tasks.
															
 
																-
															
 
																-**Test Location**: `evals/agents/openagent/tests/context-loading/`
															
 
																-
															
 
																-**Total Tests**: 5 (3 simple, 2 complex multi-turn)
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Test Results Summary
															
 
																-
															
 
																-**Run Date**: 2025-11-26  
															
 
																-**Pass Rate**: 3/5 (60%)  
															
 
																-**Total Duration**: 430 seconds (~7 minutes)
															
 
																-
															
 
																-| Test ID | Type | Status | Duration | Notes |
															
 
																-|---------|------|--------|----------|-------|
															
 
																-| ctx-simple-testing-approach | Simple | ✅ PASS | 35s | Loaded testing docs correctly |
															
 
																-| ctx-simple-documentation-format | Simple | ✅ PASS | 19s | Loaded docs.md correctly |
															
 
																-| ctx-simple-coding-standards | Simple | ✅ PASS | 20s | Loaded code.md correctly |
															
 
																-| ctx-multi-standards-to-docs | Complex | ❌ FAIL | 109s | No context loaded before execution |
															
 
																-| ctx-multi-error-handling-to-tests | Complex | ❌ FAIL | 246s | Timeout on prompt 4 |
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Test Descriptions
															
 
																-
															
 
																-### Simple Tests (Read-Only)
															
 
																-
															
 
																-#### 1. `ctx-simple-coding-standards.yaml`
															
 
																-**Prompt**: "What are our coding standards for this project?"
															
 
																-
															
 
																-**Expected Behavior**:
															
 
																-- Load `code.md` or `standards.md` before responding
															
 
																-- Reference project-specific standards
															
 
																-
															
 
																-**Result**: ✅ **PASSED**
															
 
																-- Agent loaded `.opencode/context/core/standards/code.md`
															
 
																-- 1 read operation performed
															
 
																-- No violations detected
															
 
																-
															
 
																----
															
 
																-
															
 
																-#### 2. `ctx-simple-documentation-format.yaml`
															
 
																-**Prompt**: "What format should I use for documentation in this project?"
															
 
																-
															
 
																-**Expected Behavior**:
															
 
																-- Load `docs.md` or `documentation.md` before responding
															
 
																-- Reference project-specific documentation standards
															
 
																-
															
 
																-**Result**: ✅ **PASSED**
															
 
																-- Agent loaded `.opencode/context/core/standards/docs.md`
															
 
																-- 1 read operation performed
															
 
																-- No violations detected
															
 
																-
															
 
																----
															
 
																-
															
 
																-#### 3. `ctx-simple-testing-approach.yaml`
															
 
																-**Prompt**: "What's our testing strategy for this project?"
															
 
																-
															
 
																-**Expected Behavior**:
															
 
																-- Load `tests.md` or `testing.md` before responding
															
 
																-- Reference project-specific testing standards
															
 
																-
															
 
																-**Result**: ✅ **PASSED**
															
 
																-- Agent loaded multiple testing-related files:
															
 
																-  - `evals/HOW_TESTS_WORK.md`
															
 
																-  - `evals/README.md`
															
 
																-  - `evals/TESTING_CONFIDENCE.md`
															
 
																-  - `evals/agents/AGENT_TESTING_GUIDE.md`
															
 
																-- 4 read operations performed
															
 
																-- No violations detected
															
 
																-
															
 
																----
															
 
																-
															
 
																-### Complex Tests (Multi-Turn with File Creation)
															
 
																-
															
 
																-#### 4. `ctx-multi-standards-to-docs.yaml`
															
 
																-**Scenario**: Standards question → Documentation request → Format question
															
 
																-
															
 
																-**Turn 1**: "What are our coding standards?"
															
 
																-- Expected: Load `standards.md` or `code.md`
															
 
																-
															
 
																-**Turn 2**: "Can you create documentation about these standards in evals/test_tmp/coding-standards-doc.md?"
															
 
																-- Expected: Load `docs.md` (documentation format)
															
 
																-- Expected: Write file to `evals/test_tmp/`
															
 
																-
															
 
																-**Turn 3**: "What will the documentation structure look like?"
															
 
																-- Expected: Reference both standards and docs context
															
 
																-
															
 
																-**Result**: ❌ **FAILED**
															
 
																-- Agent loaded context files correctly:
															
 
																-  - `.opencode/context/core/standards/code.md` (2x)
															
 
																-  - `.opencode/context/core/standards/docs.md` (1x)
															
 
																-- Agent wrote file successfully
															
 
																-- **Violation**: "No context loaded before execution" (warning)
															
 
																-- **Issue**: Context loading evaluator flagged timing issue
															
 
																-
															
 
																-**Files Created**: `evals/test_tmp/coding-standards-doc.md` (cleaned up after test)
															
 
																-
															
 
																----
															
 
																-
															
 
																-#### 5. `ctx-multi-error-handling-to-tests.yaml`
															
 
																-**Scenario**: Error handling question → Test request → Coverage policy
															
 
																-
															
 
																-**Turn 1**: "How should we handle errors in this project?"
															
 
																-- Expected: Load `standards.md` or `processes.md`
															
 
																-
															
 
																-**Turn 2**: "Can you write tests for error handling in evals/test_tmp/error-handling.test.ts?"
															
 
																-- Expected: Load `tests.md` (testing standards)
															
 
																-- Expected: Write test file to `evals/test_tmp/`
															
 
																-
															
 
																-**Turn 3**: "What's our test coverage policy?"
															
 
																-- Expected: Reference test-related context
															
 
																-
															
 
																-**Result**: ❌ **FAILED**
															
 
																-- **Error**: "Prompt 4 execution timed out"
															
 
																-- Test exceeded 180-second timeout
															
 
																-- Likely due to complex multi-turn conversation with file creation
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Cleanup Verification
															
 
																-
															
 
																-✅ **Cleanup System Working Correctly**
															
 
																-
															
 
																-**Before Tests**:
															
 
																-- Cleaned up 1 file from previous runs
															
 
																-
															
 
																-**After Tests**:
															
 
																-- Cleaned up 2 files created during tests
															
 
																-- `test_tmp/` contains only:
															
 
																-  - `.gitignore`
															
 
																-  - `README.md`
															
 
																-
															
 
																-**Cleanup Logic**: `evals/framework/src/sdk/run-sdk-tests.ts`
															
 
																-- Runs before test execution
															
 
																-- Runs after test execution
															
 
																-- Preserves only `.gitignore` and `README.md`
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Key Findings
															
 
																-
															
 
																-### ✅ Positive Results
															
 
																-
															
 
																-1. **Simple Context Loading Works**: All 3 simple tests passed
															
 
																-   - Agent correctly identifies and loads relevant context files
															
 
																-   - Agent reads context BEFORE responding
															
 
																-   - No violations in simple scenarios
															
 
																-
															
 
																-2. **Cleanup System Reliable**: 
															
 
																-   - Files created during tests are properly cleaned up
															
 
																-   - No test artifacts left in project root
															
 
																-   - `test_tmp/` directory isolation working
															
 
																-
															
 
																-3. **Context File Discovery**:
															
 
																-   - Agent successfully finds context files in `.opencode/context/core/standards/`
															
 
																-   - Agent loads multiple relevant files when appropriate
															
 
																-
															
 
																-### ⚠️ Issues Identified
															
 
																-
															
 
																-1. **Multi-Turn Context Loading**: 
															
 
																-   - Complex multi-turn tests show timing issues
															
 
																-   - Context loading evaluator flagging warnings even when files are loaded
															
 
																-   - May need to adjust evaluator logic for multi-turn scenarios
															
 
																-
															
 
																-2. **Timeout on Complex Tests**:
															
 
																-   - 180-second timeout insufficient for some multi-turn tests
															
 
																-   - Test 5 timed out on prompt 4
															
 
																-   - May need to increase timeout or simplify test scenarios
															
 
																-
															
 
																-3. **False Positive Warning**:
															
 
																-   - Test 4 loaded context correctly but still got "no-context-loaded" warning
															
 
																-   - Evaluator may not be detecting context loads in multi-turn conversations
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Recommendations
															
 
																-
															
 
																-### Immediate Actions
															
 
																-
															
 
																-1. **Increase Timeout for Complex Tests**
															
 
																-   - Change from 180s to 300s (5 minutes)
															
 
																-   - Add timeout configuration per test
															
 
																-
															
 
																-2. **Fix Context Loading Evaluator**
															
 
																-   - Review timing detection logic for multi-turn tests
															
 
																-   - Ensure evaluator tracks context loads across all prompts
															
 
																-
															
 
																-3. **Simplify Complex Tests**
															
 
																-   - Reduce number of turns in multi-turn tests
															
 
																-   - Focus on specific context loading scenarios
															
 
																-
															
 
																-### Future Enhancements
															
 
																-
															
 
																-1. **Add More Edge Cases**
															
 
																-   - Test context loading with missing files
															
 
																-   - Test context loading with multiple context directories
															
 
																-   - Test context loading with file attachments
															
 
																-
															
 
																-2. **Add Performance Metrics**
															
 
																-   - Track time between context load and execution
															
 
																-   - Measure context file read performance
															
 
																-   - Monitor API rate limits
															
 
																-
															
 
																-3. **Batch Test Execution**
															
 
																-   - Run tests in smaller batches to avoid API timeouts
															
 
																-   - Add retry logic for transient failures
															
 
																-   - Implement test result caching
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Running These Tests
															
 
																-
															
 
																-### Run All Context Loading Tests
															
 
																-```bash
															
 
																-cd evals/framework
															
 
																-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"
															
 
																-```
															
 
																-
															
 
																-### Run Individual Test
															
 
																-```bash
															
 
																-npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"
															
 
																-```
															
 
																-
															
 
																-### Run with Debug Output
															
 
																-```bash
															
 
																-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml" --debug
															
 
																-```
															
 
																-
															
 
																-### View Results Dashboard
															
 
																-```bash
															
 
																-cd ../results
															
 
																-./serve.sh
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Test File Structure
															
 
																-
															
 
																-Each test follows this structure:
															
 
																-
															
 
																-```yaml
															
 
																-id: test-id
															
 
																-name: "Test Name"
															
 
																-description: |
															
 
																-  Detailed description of what the test validates
															
 
																-  
															
 
																-category: developer
															
 
																-agent: openagent
															
 
																-model: anthropic/claude-sonnet-4-5
															
 
																-
															
 
																-# Single prompt OR multi-turn prompts
															
 
																-prompt: "Single prompt text"
															
 
																-# OR
															
 
																-prompts:
															
 
																-  - text: "First prompt"
															
 
																-    expectContext: true
															
 
																-    contextFile: "standards.md"
															
 
																-  - text: "approve"
															
 
																-    delayMs: 2000
															
 
																-
															
 
																-# Expected behavior
															
 
																-behavior:
															
 
																-  mustUseTools: [read, write]
															
 
																-  requiresContext: true
															
 
																-  minToolCalls: 1
															
 
																-
															
 
																-# Expected violations
															
 
																-expectedViolations:
															
 
																-  - rule: context-loading
															
 
																-    shouldViolate: false
															
 
																-    severity: error
															
 
																-
															
 
																-# Approval strategy
															
 
																-approvalStrategy:
															
 
																-  type: auto-approve
															
 
																-
															
 
																-timeout: 60000
															
 
																-
															
 
																-tags:
															
 
																-  - context-loading
															
 
																-  - simple-test
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Maintenance
															
 
																-
															
 
																-**Last Updated**: 2025-11-26  
															
 
																-**Test Framework Version**: 0.1.0  
															
 
																-**OpenAgent Version**: Latest  
															
 
																-
															
 
																-**Next Review**: After fixing context loading evaluator timing logic
															
--- a/evals/agents/openagent/IMPLEMENTATION_SUMMARY.md
+++ b/evals/agents/openagent/IMPLEMENTATION_SUMMARY.md
@@ -1,256 +0,0 @@
 
																-# Context Loading Tests - Implementation Summary
															
 
																-
															
 
																-**Date**: 2025-11-26  
															
 
																-**Status**: ✅ **COMPLETE - ALL TESTS PASSING (5/5)**
															
 
																-
															
 
																----
															
 
																-
															
 
																-## What We Built
															
 
																-
															
 
																-### 1. **5 Context Loading Tests** ✅
															
 
																-Created comprehensive test suite to verify OpenAgent loads context files correctly:
															
 
																-
															
 
																-**Simple Tests (3)** - Single prompt, read-only
															
 
																-- `ctx-simple-coding-standards.yaml` - Coding standards query
															
 
																-- `ctx-simple-documentation-format.yaml` - Documentation format query  
															
 
																-- `ctx-simple-testing-approach.yaml` - Testing strategy query
															
 
																-
															
 
																-**Complex Tests (2)** - Multi-turn with file creation
															
 
																-- `ctx-multi-standards-to-docs.yaml` - Standards → Documentation creation
															
 
																-- `ctx-multi-error-handling-to-tests.yaml` - Error handling → Test creation
															
 
																-
															
 
																-### 2. **Smart Timeout System** ✅
															
 
																-Implemented intelligent timeout handling for multi-turn tests:
															
 
																-- **Activity monitoring**: Checks if events are still streaming
															
 
																-- **Base timeout**: 300s (5 minutes) of inactivity triggers timeout
															
 
																-- **Absolute max**: 600s (10 minutes) hard limit
															
 
																-- **Prevents false timeouts**: Extends timeout while agent is active
															
 
																-
															
 
																-**Code**: `evals/framework/src/sdk/test-runner.ts` - `withSmartTimeout()` method
															
 
																-
															
 
																-### 3. **Fixed Context Loading Evaluator** ✅
															
 
																-Corrected evaluator to properly detect context files in multi-turn sessions:
															
 
																-
															
 
																-**Issues Fixed**:
															
 
																-- ❌ **Before**: File paths extracted from wrong location (`tool.data.input.filePath`)
															
 
																-- ✅ **After**: Correctly extracts from `tool.data.state.input.filePath`
															
 
																-- ❌ **Before**: Only checked context before FIRST execution
															
 
																-- ✅ **After**: Checks context for ALL executions requiring it
															
 
																-- ❌ **Before**: False positives on multi-turn tests
															
 
																-- ✅ **After**: Properly tracks context across multiple prompts
															
 
																-
															
 
																-**Code**: `evals/framework/src/evaluators/context-loading-evaluator.ts`
															
 
																-
															
 
																-### 4. **Batch Test Runner** ✅
															
 
																-Created helper script for running tests in controlled batches:
															
 
																-- Configurable batch size (default: 3 tests)
															
 
																-- Configurable delay between batches (default: 10s)
															
 
																-- Prevents API rate limits
															
 
																-- Better resource management
															
 
																-
															
 
																-**Script**: `evals/framewor./scripts/utils/run-tests-batch.sh`
															
 
																-
															
 
																-**Usage**:
															
 
																-```bash
															
 
																-cd evals/framework
															
 
																-./scripts/utils/run-tests-batch.sh openagent 3 10
															
 
																-```
															
 
																-
															
 
																-### 5. **Cleanup System Verified** ✅
															
 
																-Confirmed automatic cleanup working correctly:
															
 
																-- Cleans `test_tmp/` before tests
															
 
																-- Cleans `test_tmp/` after tests
															
 
																-- Preserves only `.gitignore` and `README.md`
															
 
																-- No test artifacts left behind
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Test Results
															
 
																-
															
 
																-### Final Run: 100% Pass Rate 🎉
															
 
																-
															
 
																-| Test | Type | Duration | Status | Context Files Loaded |
															
 
																-|------|------|----------|--------|---------------------|
															
 
																-| ctx-simple-testing-approach | Simple | 38s | ✅ PASS | 4 files (README, HOW_TESTS_WORK, etc.) |
															
 
																-| ctx-simple-documentation-format | Simple | 26s | ✅ PASS | docs.md |
															
 
																-| ctx-simple-coding-standards | Simple | 21s | ✅ PASS | code.md |
															
 
																-| ctx-multi-standards-to-docs | Complex | 116s | ✅ PASS | code.md, docs.md (44s before execution) |
															
 
																-| ctx-multi-error-handling-to-tests | Complex | 148s | ✅ PASS | code.md, tests.md (58s before execution) |
															
 
																-
															
 
																-**Total Duration**: 349 seconds (~6 minutes)  
															
 
																-**Pass Rate**: 5/5 (100%)  
															
 
																-**Violations**: 0
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Key Findings
															
 
																-
															
 
																-### ✅ **OpenAgent Context Loading Works Correctly**
															
 
																-
															
 
																-1. **Simple queries**: Agent loads appropriate context files before responding
															
 
																-2. **Multi-turn conversations**: Agent loads context for each execution phase
															
 
																-3. **File creation**: Agent loads both standards AND format context before writing
															
 
																-4. **Timing**: Context loaded 44-58 seconds before execution (plenty of time)
															
 
																-
															
 
																-### ✅ **Test Infrastructure is Solid**
															
 
																-
															
 
																-1. **Same session tracking**: Multi-turn tests use single session (verified)
															
 
																-2. **Smart timeout**: Prevents false timeouts while catching real hangs
															
 
																-3. **Cleanup**: No test artifacts left behind
															
 
																-4. **Evaluators**: Accurately detect context loading behavior
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Technical Details
															
 
																-
															
 
																-### Session Tracking (Multi-Turn)
															
 
																-```typescript
															
 
																-// Single session created once
															
 
																-const session = await this.client.createSession({ title: testCase.name });
															
 
																-sessionId = session.id;
															
 
																-
															
 
																-// All prompts use SAME session
															
 
																-for (let i = 0; i < testCase.prompts.length; i++) {
															
 
																-  await this.client.sendPrompt(sessionId, { text: msg.text, ... });
															
 
																-}
															
 
																-```
															
 
																-
															
 
																-### Smart Timeout Logic
															
 
																-```typescript
															
 
																-// Base timeout: 300s of inactivity
															
 
																-// Max timeout: 600s absolute
															
 
																-await this.withSmartTimeout(
															
 
																-  promptPromise,
															
 
																-  300000,  // 5 min activity timeout
															
 
																-  600000,  // 10 min absolute max
															
 
																-  `Prompt ${i + 1} execution timed out`
															
 
																-);
															
 
																-```
															
 
																-
															
 
																-### Context File Detection
															
 
																-```typescript
															
 
																-// Fixed file path extraction
															
 
																-const filePath = tool.data?.state?.input?.filePath ||  // ✅ NEW
															
 
																-                tool.data?.state?.input?.path ||
															
 
																-                tool.data?.input?.filePath ||          // Old fallback
															
 
																-                tool.data?.input?.path;
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Files Modified
															
 
																-
															
 
																-### New Files Created
															
 
																-```
															
 
																-evals/agents/openagent/tests/context-loading/
															
 
																-├── ctx-simple-coding-standards.yaml
															
 
																-├── ctx-simple-documentation-format.yaml
															
 
																-├── ctx-simple-testing-approach.yaml
															
 
																-├── ctx-multi-standards-to-docs.yaml
															
 
																-└── ctx-multi-error-handling-to-tests.yaml
															
 
																-
															
 
																-evals/agents/openagent/
															
 
																-├── CONTEXT_LOADING_COVERAGE.md
															
 
																-└── IMPLEMENTATION_SUMMARY.md (this file)
															
 
																-
															
 
																-evals/framework/
															
 
																-└── scripts/
															
 
																-```
															
 
																-
															
 
																-### Files Modified
															
 
																-```
															
 
																-evals/framework/src/sdk/test-runner.ts
															
 
																-  - Added withSmartTimeout() method
															
 
																-  - Updated multi-turn test execution to use smart timeout
															
 
																-
															
 
																-evals/framework/src/evaluators/context-loading-evaluator.ts
															
 
																-  - Fixed file path extraction (tool.data.state.input.filePath)
															
 
																-  - Added multi-turn execution checking
															
 
																-  - Improved violation detection
															
 
																-
															
 
																-evals/agents/openagent/tests/context-loading/*.yaml
															
 
																-  - Increased timeout from 180s to 300s for complex tests
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Recommendations Completed
															
 
																-
															
 
																-### ✅ Recommendation 1: Fix Timeout Issue
															
 
																-- **Status**: COMPLETE
															
 
																-- **Solution**: Implemented smart timeout with activity monitoring
															
 
																-- **Result**: No more false timeouts, complex tests complete successfully
															
 
																-
															
 
																-### ✅ Recommendation 2: Fix Context Loading Evaluator  
															
 
																-- **Status**: COMPLETE
															
 
																-- **Solution**: Fixed file path extraction and multi-turn tracking
															
 
																-- **Result**: Evaluator correctly detects context loading in all scenarios
															
 
																-
															
 
																-### ✅ Recommendation 3: Batch Test Execution
															
 
																-- **Status**: COMPLETE
															
 
																-- **Solution**: Created `run-tests-batch.sh` script
															
 
																-- **Result**: Can run tests in controlled batches with delays
															
 
																-
															
 
																----
															
 
																-
															
 
																-## How to Use
															
 
																-
															
 
																-### Run All Context Loading Tests
															
 
																-```bash
															
 
																-cd evals/framework
															
 
																-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"
															
 
																-```
															
 
																-
															
 
																-### Run Single Test
															
 
																-```bash
															
 
																-npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"
															
 
																-```
															
 
																-
															
 
																-### Run in Batches (Avoid API Limits)
															
 
																-```bash
															
 
																-./scripts/utils/run-tests-batch.sh openagent 3 10
															
 
																-# Args: agent, batch_size, delay_seconds
															
 
																-```
															
 
																-
															
 
																-### View Results Dashboard
															
 
																-```bash
															
 
																-cd ../results
															
 
																-./serve.sh
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Next Steps (Optional Enhancements)
															
 
																-
															
 
																-1. **Add More Edge Cases**
															
 
																-   - Test with missing context files
															
 
																-   - Test with multiple context directories
															
 
																-   - Test with file attachments
															
 
																-
															
 
																-2. **Performance Metrics**
															
 
																-   - Track context load time vs execution time
															
 
																-   - Measure API response times
															
 
																-   - Monitor rate limit usage
															
 
																-
															
 
																-3. **Test Coverage Expansion**
															
 
																-   - Add tests for other agent behaviors
															
 
																-   - Test delegation scenarios
															
 
																-   - Test error handling paths
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Conclusion
															
 
																-
															
 
																-✅ **All objectives achieved**  
															
 
																-✅ **100% test pass rate**  
															
 
																-✅ **OpenAgent context loading verified working correctly**  
															
 
																-✅ **Test infrastructure improved and reliable**  
															
 
																-✅ **Documentation complete**
															
 
																-
															
 
																-The context loading test suite is production-ready and provides comprehensive coverage of OpenAgent's context file loading behavior across both simple and complex multi-turn scenarios.
															
 
																-
															
 
																----
															
 
																-
															
 
																-**Maintained by**: OpenCode Agents Team  
															
 
																-**Last Updated**: 2025-11-26  
															
 
																-**Test Framework Version**: 0.1.0
															
--- a/evals/agents/opencoder/README.md
+++ b/evals/agents/opencoder/README.md
@@ -1,41 +0,0 @@
 
																-# Opencoder Agent Tests
															
 
																-
															
 
																-Tests for the `opencoder` agent - a development-focused agent that executes code tasks directly.
															
 
																-
															
 
																-## Agent Characteristics
															
 
																-
															
 
																-- **Mode**: Primary development agent
															
 
																-- **Behavior**: Executes tools directly without text-based approval workflow
															
 
																-- **Best for**: Code implementation, bash commands, file operations
															
 
																-- **Approval**: Uses tool permission system (auto-approve in tests)
															
 
																-
															
 
																-## Test Categories
															
 
																-
															
 
																-### Developer Tests (`tests/developer/`)
															
 
																-- Bash command execution
															
 
																-- File operations
															
 
																-- Code implementation tasks
															
 
																-
															
 
																-### Business Tests (`tests/business/`)
															
 
																-- Data analysis tasks
															
 
																-- Report generation
															
 
																-
															
 
																-### Edge Cases (`tests/edge-case/`)
															
 
																-- Error handling
															
 
																-- Permission boundaries
															
 
																-
															
 
																-## Running Tests
															
 
																-
															
 
																-```bash
															
 
																-cd evals/framework
															
 
																-npx tsx src/sdk/run-sdk-tests.ts --agent opencoder
															
 
																-```
															
 
																-
															
 
																-## Key Differences from OpenAgent
															
 
																-
															
 
																-| Feature | Opencoder | OpenAgent |
															
 
																-|---------|-----------|-----------|
															
 
																-| Approval | Tool permission system | Text-based + tool permission |
															
 
																-| Workflow | Direct execution | Analyze→Approve→Execute→Validate |
															
 
																-| Context Loading | On-demand | Mandatory before execution |
															
 
																-| Best for | Simple tasks | Complex workflows |
															
--- a/evals/agents/shared/README.md
+++ b/evals/agents/shared/README.md
@@ -1,74 +0,0 @@
 
																-# Shared Test Cases
															
 
																-
															
 
																-Tests in this directory are **agent-agnostic** and can be used to test **any agent** that follows the same core rules.
															
 
																-
															
 
																-## Purpose
															
 
																-
															
 
																-Shared tests validate **universal behaviors** that all agents should follow:
															
 
																-- Approval gate enforcement
															
 
																-- Tool usage patterns
															
 
																-- Basic workflow compliance
															
 
																-- Error handling
															
 
																-
															
 
																-## Usage
															
 
																-
															
 
																-### Run Shared Tests for OpenAgent
															
 
																-```bash
															
 
																-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=openagent
															
 
																-```
															
 
																-
															
 
																-### Run Shared Tests for OpenCoder
															
 
																-```bash
															
 
																-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=opencoder
															
 
																-```
															
 
																-
															
 
																-### Override Agent in Test File
															
 
																-```yaml
															
 
																-# In the YAML file
															
 
																-agent: openagent  # Change to opencoder, or any other agent
															
 
																-```
															
 
																-
															
 
																-## Test Categories
															
 
																-
															
 
																-### `common/` - Universal Rules
															
 
																-Tests that apply to **all agents**:
															
 
																-- `approval-gate-basic.yaml` - Basic approval enforcement
															
 
																-- `tool-usage-basic.yaml` - Basic tool selection (future)
															
 
																-- `error-handling-basic.yaml` - Basic error handling (future)
															
 
																-
															
 
																-## Adding New Shared Tests
															
 
																-
															
 
																-1. Create test in `shared/tests/common/`
															
 
																-2. Use generic prompts (not agent-specific)
															
 
																-3. Test universal behaviors only
															
 
																-4. Tag with `shared-test` and `agent-agnostic`
															
 
																-5. Document which agents it applies to
															
 
																-
															
 
																-## Example
															
 
																-
															
 
																-```yaml
															
 
																-id: shared-example-001
															
 
																-name: Example Shared Test
															
 
																-category: edge-case
															
 
																-agent: openagent  # Default, can be overridden
															
 
																-
															
 
																-prompt: "Generic prompt that works for any agent"
															
 
																-
															
 
																-behavior:
															
 
																-  requiresApproval: true  # Universal rule
															
 
																-
															
 
																-expectedViolations:
															
 
																-  - rule: approval-gate
															
 
																-    shouldViolate: false
															
 
																-
															
 
																-tags:
															
 
																-  - shared-test
															
 
																-  - agent-agnostic
															
 
																-```
															
 
																-
															
 
																-## Benefits
															
 
																-
															
 
																-1. **Reduce Duplication** - Write once, test multiple agents
															
 
																-2. **Consistency** - Same tests ensure consistent behavior
															
 
																-3. **Easy Comparison** - Compare agent behaviors side-by-side
															
 
																-4. **Faster Onboarding** - New agents inherit core test suite
															
--- a/evals/framework/scripts/README.md
+++ b/evals/framework/scripts/README.md
@@ -1,195 +0,0 @@
 
																-# Framework Scripts
															
 
																-
															
 
																-Utility scripts for debugging, testing, and development.
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Directory Structure
															
 
																-
															
 
																-```
															
 
																-scripts/
															
 
																-├── debug/          # Debugging scripts for sessions and events
															
 
																-├── test/           # Test scripts for framework development
															
 
																-├── utils/          # Utility scripts (batch runner, etc.)
															
 
																-└── README.md       # This file
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Debug Scripts (`debug/`)
															
 
																-
															
 
																-Scripts for debugging sessions, events, and agent behavior.
															
 
																-
															
 
																-| Script | Purpose | Usage |
															
 
																-|--------|---------|-------|
															
 
																-| `debug-session.mjs` | Debug session data and timeline | `node scripts/debug/debug-session.mjs <session-id>` |
															
 
																-| `debug-session.ts` | TypeScript version of session debugger | `npx tsx scripts/debug/debug-session.ts <session-id>` |
															
 
																-| `debug-claude-session.mjs` | Debug Claude-specific sessions | `node scripts/debug/debug-claude-session.mjs <session-id>` |
															
 
																-| `inspect-session.mjs` | Inspect most recent session events | `node scripts/debug/inspect-session.mjs` |
															
 
																-
															
 
																-### Examples
															
 
																-
															
 
																-```bash
															
 
																-# Debug a specific session
															
 
																-node scripts/debug/debug-session.mjs ses_abc123
															
 
																-
															
 
																-# Inspect latest session
															
 
																-node scripts/debug/inspect-session.mjs
															
 
																-
															
 
																-# Debug with TypeScript
															
 
																-npx tsx scripts/debug/debug-session.ts ses_abc123
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Test Scripts (`test/`)
															
 
																-
															
 
																-Scripts for testing framework components during development.
															
 
																-
															
 
																-| Script | Purpose | Usage |
															
 
																-|--------|---------|-------|
															
 
																-| `test-agent-direct.ts` | Direct agent execution test | `npx tsx scripts/test/test-agent-direct.ts` |
															
 
																-| `test-event-inspector.js` | Test event capture system | `node scripts/test/test-event-inspector.js` |
															
 
																-| `test-session-reader.mjs` | Test session reader | `node scripts/test/test-session-reader.mjs` |
															
 
																-| `test-simplified-approach.mjs` | Test simplified test approach | `node scripts/test/test-simplified-approach.mjs` |
															
 
																-| `test-timeline.ts` | Test timeline builder | `npx tsx scripts/test/test-timeline.ts` |
															
 
																-| `verify-timeline.ts` | Verify timeline accuracy | `npx tsx scripts/test/verify-timeline.ts` |
															
 
																-
															
 
																-### Examples
															
 
																-
															
 
																-```bash
															
 
																-# Test agent execution
															
 
																-npx tsx scripts/test/test-agent-direct.ts
															
 
																-
															
 
																-# Test event capture
															
 
																-node scripts/test/test-event-inspector.js
															
 
																-
															
 
																-# Verify timeline
															
 
																-npx tsx scripts/test/verify-timeline.ts
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Utility Scripts (`utils/`)
															
 
																-
															
 
																-General utility scripts for running tests and managing the framework.
															
 
																-
															
 
																-| Script | Purpose | Usage |
															
 
																-|--------|---------|-------|
															
 
																-| `run-tests-batch.sh` | Run tests in batches | `./scripts/utils/run-tests-batch.sh <agent> <batch-size> <delay>` |
															
 
																-| `check-agent.mjs` | Check agent availability | `node scripts/utils/check-agent.mjs` |
															
 
																-
															
 
																-### Examples
															
 
																-
															
 
																-```bash
															
 
																-# Run tests in batches of 3 with 10s delay
															
 
																-./scripts/utils/run-tests-batch.sh openagent 3 10
															
 
																-
															
 
																-# Check if agent is available
															
 
																-node scripts/utils/check-agent.mjs
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Development Workflow
															
 
																-
															
 
																-### Debugging a Failed Test
															
 
																-
															
 
																-1. Run test with debug flag:
															
 
																-   ```bash
															
 
																-   npm run eval:sdk -- --pattern="my-test.yaml" --debug
															
 
																-   ```
															
 
																-
															
 
																-2. Note the session ID from output
															
 
																-
															
 
																-3. Inspect the session:
															
 
																-   ```bash
															
 
																-   node scripts/debug/inspect-session.mjs
															
 
																-   # or
															
 
																-   node scripts/debug/debug-session.mjs <session-id>
															
 
																-   ```
															
 
																-
															
 
																-4. Check timeline events:
															
 
																-   ```bash
															
 
																-   npx tsx scripts/debug/debug-session.ts <session-id>
															
 
																-   ```
															
 
																-
															
 
																-### Testing Framework Changes
															
 
																-
															
 
																-1. Make changes to framework code
															
 
																-
															
 
																-2. Build:
															
 
																-   ```bash
															
 
																-   npm run build
															
 
																-   ```
															
 
																-
															
 
																-3. Test specific component:
															
 
																-   ```bash
															
 
																-   npx tsx scripts/test/test-timeline.ts
															
 
																-   ```
															
 
																-
															
 
																-4. Run full test suite:
															
 
																-   ```bash
															
 
																-   npm run eval:sdk
															
 
																-   ```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Script Dependencies
															
 
																-
															
 
																-All scripts require the framework to be built first:
															
 
																-
															
 
																-```bash
															
 
																-npm run build
															
 
																-```
															
 
																-
															
 
																-Some scripts use:
															
 
																-- `@opencode-ai/sdk` - For SDK client
															
 
																-- `tsx` - For TypeScript execution
															
 
																-- Framework dist files - Built TypeScript output
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Adding New Scripts
															
 
																-
															
 
																-### Debug Script Template
															
 
																-
															
 
																-```javascript
															
 
																-// scripts/debug/my-debug-script.mjs
															
 
																-import { SessionReader } from '../../dist/collector/session-reader.js';
															
 
																-import { createOpencodeClient } from '@opencode-ai/sdk';
															
 
																-
															
 
																-const client = createOpencodeClient({
															
 
																-  baseUrl: 'http://localhost:3721'
															
 
																-});
															
 
																-
															
 
																-// Your debug logic here
															
 
																-```
															
 
																-
															
 
																-### Test Script Template
															
 
																-
															
 
																-```typescript
															
 
																-// scripts/test/my-test-script.ts
															
 
																-#!/usr/bin/env npx tsx
															
 
																-
															
 
																-import { TestRunner } from '../../dist/sdk/test-runner.js';
															
 
																-
															
 
																-async function runTest() {
															
 
																-  // Your test logic here
															
 
																-}
															
 
																-
															
 
																-runTest().catch(console.error);
															
 
																-```
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Maintenance
															
 
																-
															
 
																-- **Keep scripts organized** - Put debug scripts in `debug/`, test scripts in `test/`
															
 
																-- **Update this README** - When adding new scripts
															
 
																-- **Remove obsolete scripts** - Delete scripts that are no longer needed
															
 
																-- **Document usage** - Add clear usage examples
															
 
																-
															
 
																----
															
 
																-
															
 
																-**Last Updated**: 2025-11-26
															
--- a/evals/results/README.md
+++ b/evals/results/README.md
@@ -1,279 +0,0 @@
 
																-# 📊 Test Results Dashboard
															
 
																-
															
 
																-Interactive dashboard for visualizing OpenCode agent test results.
															
 
																-
															
 
																-## ⚡ Quick Reference
															
 
																-
															
 
																-```bash
															
 
																-# Run tests
															
 
																-cd evals/framework && npm run eval:sdk -- --agent=opencoder
															
 
																-
															
 
																-# View dashboard (auto-opens browser, auto-shuts down)
															
 
																-cd evals/results && ./serve.sh
															
 
																-```
															
 
																-
															
 
																-That's it! 🎉
															
 
																-
															
 
																----
															
 
																-
															
 
																-## Quick Start
															
 
																-
															
 
																-1. **Run Tests:**
															
 
																-   ```bash
															
 
																-   cd evals/framework
															
 
																-   npm run eval:sdk -- --agent=opencoder
															
 
																-   npm run eval:sdk -- --agent=openagent
															
 
																-   ```
															
 
																-
															
 
																-2. **View Dashboard:**
															
 
																-   
															
 
																-   **Option A: One-Command Solution (Easiest)** ⭐
															
 
																-   ```bash
															
 
																-   cd evals/results
															
 
																-   ./serve.sh
															
 
																-   ```
															
 
																-   - Auto-opens browser
															
 
																-   - Loads dashboard
															
 
																-   - Auto-shuts down after 15 seconds
															
 
																-   - Dashboard stays cached in browser!
															
 
																-   
															
 
																-   **Custom timeout:**
															
 
																-   ```bash
															
 
																-   ./serve.sh 8000 30  # Port 8000, 30 second timeout
															
 
																-   ```
															
 
																-   
															
 
																-   **Option B: Keep Server Running**
															
 
																-   ```bash
															
 
																-   cd evals/results
															
 
																-   python3 -m http.server 8000
															
 
																-   ```
															
 
																-   Press Ctrl+C to stop manually
															
 
																-   
															
 
																-   **Option C: Direct File Access**
															
 
																-   ```bash
															
 
																-   open evals/results/index.html
															
 
																-   ```
															
 
																-   ⚠️ Note: Some browsers block loading JSON from local files. If you see an error, use Option A or B.
															
 
																-
															
 
																-## Features
															
 
																-
															
 
																-### 📈 Overview Stats
															
 
																-- **Total Tests** - Count across all agents
															
 
																-- **Pass Rate** - Percentage of passing tests
															
 
																-- **Failed Tests** - Number of failures
															
 
																-- **Avg Duration** - Average test execution time
															
 
																-
															
 
																-### 📊 Trend Chart
															
 
																-- Visual representation of pass rate over time
															
 
																-- Shows last 30 days of test runs
															
 
																-- Helps identify regressions
															
 
																-
															
 
																-### 🔍 Filters
															
 
																-- **Agent** - Filter by openagent, opencoder, etc.
															
 
																-- **Category** - Developer, business, creative, edge-case
															
 
																-- **Status** - All, passed only, or failed only
															
 
																-- **Time Range** - Latest, today, last 7 days, last 30 days
															
 
																-
															
 
																-### 🔎 Search
															
 
																-- Real-time search across test IDs
															
 
																-- Case-insensitive matching
															
 
																-
															
 
																-### 📋 Test Table
															
 
																-- **Sortable Columns** - Click any header to sort
															
 
																-- **Expandable Rows** - Click a row to see details
															
 
																-- **Violation Details** - See error messages and severity
															
 
																-
															
 
																-### 🌙 Dark Mode
															
 
																-- Toggle with moon/sun icon in header
															
 
																-- Preference saved to localStorage
															
 
																-- Easy on the eyes for long sessions
															
 
																-
															
 
																-### 📥 Export
															
 
																-- Export filtered results to CSV
															
 
																-- Includes all test metadata
															
 
																-- Perfect for external analysis
															
 
																-
															
 
																-## File Structure
															
 
																-
															
 
																-```
															
 
																-results/
															
 
																-├── index.html              # Dashboard (open this)
															
 
																-├── serve.sh                # Helper script to start HTTP server
															
 
																-├── latest.json             # Most recent test run
															
 
																-├── history/
															
 
																-│   └── 2025-11/
															
 
																-│       ├── 26-115759-opencoder.json
															
 
																-│       └── 26-115850-openagent.json
															
 
																-├── .gitignore              # Retention policy
															
 
																-└── README.md               # This file
															
 
																-```
															
 
																-
															
 
																-## JSON Format
															
 
																-
															
 
																-Each result file contains:
															
 
																-
															
 
																-```json
															
 
																-{
															
 
																-  "meta": {
															
 
																-    "timestamp": "2025-11-26T11:59:36.365Z",
															
 
																-    "agent": "openagent",
															
 
																-    "model": "opencode/grok-code-fast",
															
 
																-    "framework_version": "0.1.0",
															
 
																-    "git_commit": "f872007"
															
 
																-  },
															
 
																-  "summary": {
															
 
																-    "total": 8,
															
 
																-    "passed": 6,
															
 
																-    "failed": 2,
															
 
																-    "duration_ms": 32450,
															
 
																-    "pass_rate": 0.75
															
 
																-  },
															
 
																-  "by_category": {
															
 
																-    "developer": { "passed": 5, "total": 6 },
															
 
																-    "business": { "passed": 1, "total": 1 },
															
 
																-    "edge-case": { "passed": 0, "total": 1 }
															
 
																-  },
															
 
																-  "tests": [
															
 
																-    {
															
 
																-      "id": "task-simple-001",
															
 
																-      "category": "developer",
															
 
																-      "passed": true,
															
 
																-      "duration_ms": 4200,
															
 
																-      "events": 23,
															
 
																-      "approvals": 2,
															
 
																-      "violations": {
															
 
																-        "total": 0,
															
 
																-        "errors": 0,
															
 
																-        "warnings": 0
															
 
																-      }
															
 
																-    }
															
 
																-  ]
															
 
																-}
															
 
																-```
															
 
																-
															
 
																-## Retention Policy
															
 
																-
															
 
																-Results are automatically managed:
															
 
																-
															
 
																-- ✅ **Latest Run** - Always kept (`latest.json`)
															
 
																-- ✅ **Current Month** - All results committed to git
															
 
																-- ✅ **Previous Month** - All results committed to git
															
 
																-- ❌ **Older than 60 days** - Kept locally, not committed
															
 
																-
															
 
																-This keeps the repo size manageable while preserving recent history.
															
 
																-
															
 
																-## Tips
															
 
																-
															
 
																-### Quick View Workflow
															
 
																-The fastest way to view results:
															
 
																-```bash
															
 
																-cd evals/results && ./serve.sh
															
 
																-```
															
 
																-- ✅ Opens browser automatically
															
 
																-- ✅ Loads all data
															
 
																-- ✅ Shuts down after 15 seconds
															
 
																-- ✅ Dashboard stays functional (data cached)
															
 
																-- ✅ No manual cleanup needed
															
 
																-
															
 
																-**Want to keep exploring?** Press Ctrl+C during countdown to keep server running.
															
 
																-
															
 
																-### Comparing Agents
															
 
																-1. Set **Time Range** to "Latest Run"
															
 
																-2. Set **Agent** to "All Agents"
															
 
																-3. Compare pass rates and durations
															
 
																-
															
 
																-### Finding Flaky Tests
															
 
																-1. Set **Time Range** to "Last 30 Days"
															
 
																-2. Look for tests that alternate between pass/fail
															
 
																-3. Check violation details for patterns
															
 
																-
															
 
																-### Tracking Improvements
															
 
																-1. Run tests regularly (daily/weekly)
															
 
																-2. Watch the trend chart for improvements
															
 
																-3. Export CSV for deeper analysis
															
 
																-
															
 
																-### Debugging Failures
															
 
																-1. Filter **Status** to "Failed Only"
															
 
																-2. Click on a failed test row
															
 
																-3. Review violation details
															
 
																-4. Check error messages and severity
															
 
																-
															
 
																-## Browser Compatibility
															
 
																-
															
 
																-- ✅ Chrome/Edge (recommended)
															
 
																-- ✅ Firefox
															
 
																-- ✅ Safari
															
 
																-- ⚠️ IE11 (not supported)
															
 
																-
															
 
																-## Performance
															
 
																-
															
 
																-- **Dashboard Size:** ~31KB (no dependencies except Chart.js CDN)
															
 
																-- **Load Time:** < 1 second for 100 tests
															
 
																-- **Memory:** Minimal (pure JavaScript, no frameworks)
															
 
																-
															
 
																-## How It Works
															
 
																-
															
 
																-### Auto-Shutdown Feature
															
 
																-The `serve.sh` script:
															
 
																-1. Starts HTTP server on port 8000
															
 
																-2. Opens dashboard in your browser
															
 
																-3. Waits 15 seconds for data to load
															
 
																-4. Shuts down server automatically
															
 
																-5. Dashboard continues working (data cached in browser)
															
 
																-
															
 
																-**Why does it still work after shutdown?**
															
 
																-- The browser caches the JSON data
															
 
																-- All filtering/sorting happens in JavaScript
															
 
																-- No server needed after initial load
															
 
																-- Refresh the page to load new data (server will need to restart)
															
 
																-
															
 
																-### Stopping Manually
															
 
																-If you start the server manually:
															
 
																-```bash
															
 
																-# Find the process
															
 
																-lsof -ti:8000
															
 
																-
															
 
																-# Kill it
															
 
																-kill $(lsof -ti:8000)
															
 
																-```
															
 
																-
															
 
																-Or just press Ctrl+C in the terminal.
															
 
																-
															
 
																-## Troubleshooting
															
 
																-
															
 
																-### Dashboard shows "No results found"
															
 
																-- Run tests first: `npm run eval:sdk`
															
 
																-- Check that `latest.json` exists
															
 
																-- Refresh the page
															
 
																-
															
 
																-### Chart not displaying
															
 
																-- Check browser console for errors
															
 
																-- Ensure Chart.js CDN is accessible
															
 
																-- Try refreshing the page
															
 
																-
															
 
																-### Dark mode not persisting
															
 
																-- Check browser localStorage is enabled
															
 
																-- Clear cache and try again
															
 
																-
															
 
																-## Future Enhancements
															
 
																-
															
 
																-Potential improvements:
															
 
																-- [ ] Historical comparison (compare two runs)
															
 
																-- [ ] Test duration trends per test
															
 
																-- [ ] Violation type breakdown chart
															
 
																-- [ ] Agent performance comparison chart
															
 
																-- [ ] Auto-refresh option
															
 
																-- [ ] Shareable URLs with filters
															
 
																-- [ ] CI/CD badge generation
															
 
																-
															
 
																-## Contributing
															
 
																-
															
 
																-To improve the dashboard:
															
 
																-
															
 
																-1. Edit `index.html` (all code is in one file)
															
 
																-2. Test locally by opening in browser
															
 
																-3. Submit PR with description of changes
															
 
																-
															
 
																-## License
															
 
																-
															
 
																-MIT - Same as OpenCode Agents project
															
--- a/evals/test_tmp/README.md
+++ b/evals/test_tmp/README.md
@@ -1,29 +0,0 @@
 
																-# Test Artifacts
															
 
																-
															
 
																-This directory contains temporary files created during test execution.
															
 
																-It should be cleaned up after tests complete.
															
 
																-
															
 
																-**DO NOT COMMIT FILES IN THIS DIRECTORY**
															
 
																-
															
 
																-## Installation
															
 
																-
															
 
																-To install the project dependencies, navigate to the evaluation framework directory and run:
															
 
																-
															
 
																-```bash
															
 
																-cd evals/framework
															
 
																-npm install
															
 
																-```
															
 
																-
															
 
																-This will install all required dependencies including:
															
 
																-- `@opencode-ai/sdk` - OpenCode AI SDK
															
 
																-- `yaml` - YAML parser for test cases
															
 
																-- `zod` - Schema validation
															
 
																-- `glob` - File pattern matching
															
 
																-
															
 
																-### Development Dependencies
															
 
																-
															
 
																-For development and testing, the following tools are also installed:
															
 
																-- TypeScript compiler
															
 
																-- Vitest testing framework
															
 
																-- ESLint for code linting
															
 
																-- tsx for TypeScript execution