Roadmap
Sprint timeline for Agents-eval. See architecture.md for technical decisions (ADRs).
| Sprint | Status | Goal | Reference |
|---|---|---|---|
| Sprint 1 | Delivered | Three-tiered evaluation framework | Sprint 1 |
| Sprint 2 | Delivered | Eval wiring, trace capture, Logfire+Phoenix, Streamlit dashboard | PRD Sprint 2 |
| Sprint 3 | Delivered | Plugin architecture, GUI wiring, test alignment, optional weave, trace quality | PRD Sprint 3 |
| Sprint 4 | Delivered | Operational resilience, Claude Code baseline comparison (solo + teams) | PRD Sprint 4 |
| Sprint 5 | Delivered | Runtime fixes, GUI enhancements, architecture improvements, code quality review | PRD Sprint 5 |
| Sprint 6 | Delivered | Benchmarking infrastructure, CC baseline completion, security hardening, test quality | PRD Sprint 6 |
| Sprint 7 | Delivered | Documentation, examples, test refactoring, GUI improvements, unified providers, CC engine | PRD Sprint 7 |
| Sprint 8 | Delivered | Tool bug fix, API key/model cleanup, CC engine consolidation, graph alignment, dead code removal, report generation, judge settings UX, GUI a11y/UX | PRD Sprint 8 |
| Sprint 9 | Delivered | Correctness & security hardening — dead code deletion, format string sanitization, PDF size guard, API key env cleanup, security hardening, judge accuracy, AgentConfig typing, type safety fixes, test suite quality sweep | PRD Sprint 9 |
| Sprint 10 | Substantially Delivered | CC evaluation pipeline parity (STORY-010: main() CC/MAS branch, extract_cc_review_text, cc_result_to_graph_trace, engine_type, GUI CC execution, reference reviews, process group kill); graph viz polish (STORY-011); inspect.getsource removal (STORY-015). STORY-012/013/014 not started. | PRD Sprint 10 |
| Sprint 11 | Delivered | Observability, UX polish, test quality: end-of-run artifact summary (ArtifactRegistry), GUI sidebar tabs, CC engine empty query fix (build_cc_query), CC JSONL stream persistence, search tool HTTP resilience, sub-agent validation JSON parsing fix, query persistence fix, assert isinstance→behavioral replacements, conftest consolidation, dispatch registry refactor, config model consolidation, examples modernization (8 total) | PRD Sprint 11 |
| Sprint 12 | Delivered | CC teams mode fixes (stream event parsing, cc_teams flag passthrough, engine_type fix), scoring system fixes (Tier 3 empty-trace skip, composite trace awareness, time_taken timestamps, semantic score dedup, continuous task_success), per-run output directories (RunContext consolidation) | PRD Sprint 12 |
| Sprint 13 | Delivered | GUI audit remediation & theming — accessibility fixes (ARIA live regions, landmarks, keyboard traps, graph alt text), theming system (3 curated themes, selector widget, graph color integration), UX improvements (onboarding, validation placement, report caching, navigation consistency, string consolidation, type-aware output rendering) | PRD Sprint 13 |
Backlog — Candidate Evaluation Metrics¶
Unscheduled metrics identified from production frameworks and research. No sprint assigned.
| Metric | Source | Current Gap | Impact |
|---|---|---|---|
fix_rate |
SWE-EVO [2512.18470] | Binary task success only | High |
evaluator_consensus |
TEAM-PHI (Agents4Science) | Single LLM judge | High |
delegation_depth |
HDO (Agents4Science) | No hierarchy verification | High |
handoff_quality |
Arize Multi-Agent | No inter-agent transition | High |
rubric_alignment |
[2512.23707] | No self-grading assessment | High |
coordination_topology |
Evolutionary Boids (Agents4Science) | No breadth vs depth | Medium |
path_convergence |
Arize Phoenix | No path efficiency | Medium |
Backlog — Known Issues¶
- Delegation Tool Retry Exhaustion:
delegate_synthesisexceeds PydanticAI’s max retry count of 3. The model repeatedly passes incorrect arguments (structured data instead of plain-text query, or invented parameter names likereportinstead ofquery), exhausting retries without a successful call. Blocks reliable sweep execution for thesynthesisercomposition. Potential mitigations: increase retry limit, add argument coercion at tool boundary, simplify delegation tool signature. - Provider Token Limit Exceeded: Cumulative token count exceeds provider-configured
total_tokens_limitduring multi-agent runs, aborting execution. Example: Cerebrasgpt-oss-120bexceeded its 60,000-token limit (actual: 66,165 tokens). Multi-agent compositions are particularly susceptible since each sub-agent delegation adds to the cumulative count. Potential mitigations: dynamic per-agent token budgeting, context summarization between delegation steps, provider-aware limit configuration inPROVIDER_REGISTRY.