|
|
hai 3 meses | |
|---|---|---|
| .. | ||
| scripts | hai 3 meses | |
| src | hai 3 meses | |
| .eval-config.example.json | hai 4 meses | |
| .gitignore | hai 4 meses | |
| INTEGRATION_TESTS.md | hai 3 meses | |
| LLM_INTEGRATION_VALIDATION.md | hai 3 meses | |
| README.md | hai 3 meses | |
| demo-enhanced-features.sh | hai 4 meses | |
| package.json | hai 4 meses | |
| test-path-resolution.mjs | hai 3 meses | |
| tsconfig.json | hai 4 meses | |
Core framework for evaluating agent behavior. For user documentation, see ../README.md.
framework/
├── src/
│ ├── sdk/ # Test execution
│ │ ├── test-runner.ts # Main orchestrator
│ │ ├── test-executor.ts # Executes individual tests
│ │ ├── client-manager.ts
│ │ └── event-stream-handler.ts
│ ├── evaluators/ # Rule validators
│ │ ├── base-evaluator.ts
│ │ ├── approval-gate-evaluator.ts
│ │ ├── context-loading-evaluator.ts
│ │ ├── execution-balance-evaluator.ts
│ │ ├── tool-usage-evaluator.ts
│ │ ├── behavior-evaluator.ts
│ │ ├── delegation-evaluator.ts
│ │ ├── stop-on-failure-evaluator.ts
│ │ └── performance-metrics-evaluator.ts # NEW
│ ├── logging/ # Multi-agent logging (NEW)
│ │ ├── types.ts
│ │ ├── session-tracker.ts
│ │ ├── logger.ts
│ │ ├── formatters.ts
│ │ ├── index.ts
│ │ └── __tests__/ # 37 unit tests
│ ├── collector/ # Session data
│ │ ├── session-reader.ts
│ │ └── timeline-builder.ts
│ └── types/
│ └── index.ts
└── package.json
Checks that approval is requested before risky operations (bash, write, edit, task).
Verifies context files are loaded before acting on tasks.
NEW: Supports explicit context file specification via expectedContextFiles in test YAML.
behavior.expectedContextFilesEnsures read operations happen before write operations.
Validates dedicated tools are used instead of bash antipatterns.
Checks expected tools are used and forbidden tools are avoided.
Validates complex tasks are delegated to subagents.
Ensures agent stops on errors instead of auto-fixing.
Collects performance data for analysis:
Always passes - used for metrics collection only.
The framework now includes comprehensive multi-agent logging that tracks delegation hierarchies in real-time.
--debug flag# Non-verbose mode (default) - shows child agent completion
npm run eval:sdk -- --agent=openagent --pattern="**/test.yaml"
# Verbose mode (debug) - shows full delegation hierarchy
npm run eval:sdk -- --agent=openagent --pattern="**/test.yaml" --debug
Running tests...
✓ Child agent completed (OpenAgent, 2.9s)
Running evaluator: approval-gate...
┌────────────────────────────────────────────────────────────┐
│ 🎯 PARENT: OpenAgent (ses_xxx...) │
└────────────────────────────────────────────────────────────┘
🔧 TOOL: task
├─ subagent: simple-responder
└─ Creating child session...
┌────────────────────────────────────────────────────────────┐
│ 🎯 CHILD: simple-responder (ses_yyy...) │
│ Parent: ses_xxx... │
│ Depth: 1 │
└────────────────────────────────────────────────────────────┘
🤖 Agent: AWESOME TESTING
✅ CHILD COMPLETE (2.9s)
✅ PARENT COMPLETE (20.9s)
See src/logging/README.md for API documentation.
src/evaluators/my-evaluator.ts:import { BaseEvaluator } from './base-evaluator.js';
import { TimelineEvent, SessionInfo, EvaluationResult } from '../types/index.js';
export class MyEvaluator extends BaseEvaluator {
name = 'my-evaluator';
description = 'What this evaluator checks';
async evaluate(timeline: TimelineEvent[], sessionInfo: SessionInfo): Promise<EvaluationResult> {
const checks = [];
const violations = [];
const evidence = [];
// Your evaluation logic here
const toolCalls = this.getToolCalls(timeline);
// Example check
const passed = toolCalls.length > 0;
checks.push({
name: 'has-tool-calls',
passed,
weight: 100,
evidence: [this.createEvidence('tool-count', `Found ${toolCalls.length} tool calls`, {})]
});
if (!passed) {
violations.push(this.createViolation(
'no-tool-calls',
'error',
'No tool calls found',
Date.now(),
{}
));
}
return this.buildResult(this.name, checks, violations, evidence, {});
}
}
test-runner.ts:import { MyEvaluator } from '../evaluators/my-evaluator.js';
// In setupEvaluators():
this.evaluatorRunner = new EvaluatorRunner({
evaluators: [
// ... existing evaluators
new MyEvaluator(),
],
});
test-case-schema.ts:export const ExpectedViolationSchema = z.object({
rule: z.enum([
// ... existing rules
'my-evaluator',
]),
// ...
});
# Install
npm install
# Build
npm run build
# Run tests
npm test
# Run SDK tests
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"
interface TimelineEvent {
timestamp: number;
type: 'user_message' | 'assistant_message' | 'tool_call' | 'text';
data: any;
}
interface EvaluationResult {
evaluator: string;
passed: boolean;
score: number;
violations: Violation[];
evidence: Evidence[];
checks: Check[];
}
interface Violation {
type: string;
severity: 'error' | 'warning' | 'info';
message: string;
timestamp: number;
evidence?: any;
}
// Get all tool calls
const toolCalls = this.getToolCalls(timeline);
// Get specific tool calls
const bashCalls = this.getToolCallsByName(timeline, 'bash');
// Get assistant messages
const messages = this.getAssistantMessages(timeline);
// Get read tools (read, glob, grep, list)
const reads = this.getReadTools(timeline);
// Get execution tools (bash, write, edit, task)
const executions = this.getExecutionTools(timeline);
// Create violation
this.createViolation(type, severity, message, timestamp, evidence);
// Create evidence
this.createEvidence(type, description, data, timestamp?);
// Build result
this.buildResult(name, checks, violations, evidence, metadata);