Skip to content

Roadmap

Sprint timeline for Agents-eval. See architecture.md for technical decisions (ADRs).

Sprint Status Goal Reference
Sprint 1 Delivered Three-tiered evaluation framework Sprint 1
Sprint 2 Delivered Eval wiring, trace capture, Logfire+Phoenix, Streamlit dashboard PRD Sprint 2
Sprint 3 Delivered Plugin architecture, GUI wiring, test alignment, optional weave, trace quality PRD Sprint 3
Sprint 4 Delivered Operational resilience, Claude Code baseline comparison (solo + teams) PRD Sprint 4
Sprint 5 Delivered Runtime fixes, GUI enhancements, architecture improvements, code quality review PRD Sprint 5
Sprint 6 Delivered Benchmarking infrastructure, CC baseline completion, security hardening, test quality PRD Sprint 6
Sprint 7 Delivered Documentation, examples, test refactoring, GUI improvements, unified providers, CC engine PRD Sprint 7
Sprint 8 Delivered Tool bug fix, API key/model cleanup, CC engine consolidation, graph alignment, dead code removal, report generation, judge settings UX, GUI a11y/UX PRD Sprint 8
Sprint 9 Delivered Correctness & security hardening — dead code deletion, format string sanitization, PDF size guard, API key env cleanup, security hardening, judge accuracy, AgentConfig typing, type safety fixes, test suite quality sweep PRD Sprint 9
Sprint 10 Substantially Delivered CC evaluation pipeline parity (STORY-010: main() CC/MAS branch, extract_cc_review_text, cc_result_to_graph_trace, engine_type, GUI CC execution, reference reviews, process group kill); graph viz polish (STORY-011); inspect.getsource removal (STORY-015). STORY-012/013/014 not started. PRD Sprint 10
Sprint 11 Delivered Observability, UX polish, test quality: end-of-run artifact summary (ArtifactRegistry), GUI sidebar tabs, CC engine empty query fix (build_cc_query), CC JSONL stream persistence, search tool HTTP resilience, sub-agent validation JSON parsing fix, query persistence fix, assert isinstance→behavioral replacements, conftest consolidation, dispatch registry refactor, config model consolidation, examples modernization (8 total) PRD Sprint 11
Sprint 12 Delivered CC teams mode fixes (stream event parsing, cc_teams flag passthrough, engine_type fix), scoring system fixes (Tier 3 empty-trace skip, composite trace awareness, time_taken timestamps, semantic score dedup, continuous task_success), per-run output directories (RunContext consolidation) PRD Sprint 12
Sprint 13 Delivered GUI audit remediation & theming — accessibility fixes (ARIA live regions, landmarks, keyboard traps, graph alt text), theming system (3 curated themes, selector widget, graph color integration), UX improvements (onboarding, validation placement, report caching, navigation consistency, string consolidation, type-aware output rendering) PRD Sprint 13

Backlog — Candidate Evaluation Metrics

Unscheduled metrics identified from production frameworks and research. No sprint assigned.

Metric Source Current Gap Impact
fix_rate SWE-EVO [2512.18470] Binary task success only High
evaluator_consensus TEAM-PHI (Agents4Science) Single LLM judge High
delegation_depth HDO (Agents4Science) No hierarchy verification High
handoff_quality Arize Multi-Agent No inter-agent transition High
rubric_alignment [2512.23707] No self-grading assessment High
coordination_topology Evolutionary Boids (Agents4Science) No breadth vs depth Medium
path_convergence Arize Phoenix No path efficiency Medium

Backlog — Known Issues

  • Delegation Tool Retry Exhaustion: delegate_synthesis exceeds PydanticAI’s max retry count of 3. The model repeatedly passes incorrect arguments (structured data instead of plain-text query, or invented parameter names like report instead of query), exhausting retries without a successful call. Blocks reliable sweep execution for the synthesiser composition. Potential mitigations: increase retry limit, add argument coercion at tool boundary, simplify delegation tool signature.
  • Provider Token Limit Exceeded: Cumulative token count exceeds provider-configured total_tokens_limit during multi-agent runs, aborting execution. Example: Cerebras gpt-oss-120b exceeded its 60,000-token limit (actual: 66,165 tokens). Multi-agent compositions are particularly susceptible since each sub-agent delegation adds to the cumulative count. Potential mitigations: dynamic per-agent token budgeting, context summarization between delegation steps, provider-aware limit configuration in PROVIDER_REGISTRY.