PRD Sprint2 Ralph
title: Product Requirements Document: Agents-eval Sprint 2 version: 3.4.0 created: 2025-09-01 updated: 2026-02-12
Project Overview¶
Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager → Researcher → Analyst → Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).
Sprint 2 focuses on connecting generation and evaluation: capturing real agent execution graphs, running evaluation automatically after review generation, and producing a comparative summary of graph-based coordination metrics vs conventional text similarity metrics. All evaluation tiers are fully implemented (157 tests); the gap is wiring them into the generation flow with real trace data.
Functional Requirements¶
Sprint 2: Graph vs Text Evaluation Pipeline¶
Feature 1: Migrate EvaluationConfig to Pydantic Settings¶
Description: Replace JSON-based EvaluationConfig (config_eval.json) with JudgeSettings(BaseSettings) using JUDGE_ env prefix. Defaults in code, overridable via .env or env vars. Follows same pattern as existing CommonSettings (EVAL_ prefix).
Acceptance Criteria:
-
JudgeSettings(BaseSettings)withJUDGE_env prefix replacesEvaluationConfig - Typed defaults in code: tier weights, timeouts, model selection, enabled tiers
-
EvaluationPipelineusesJudgeSettingsinstead of loadingconfig_eval.json - Existing evaluation tests pass with settings-based config
- Timeout fields use bounded validators (gt=0, le=300)
- Time tracking pattern standardized across all tiers
- Existing test fixtures updated: pipeline uses JudgeSettings, JSON fixtures removed
-
make validatepasses
Technical Requirements:
- Create
src/app/evals/settings.pywithJudgeSettings(BaseSettings)(model_config withJUDGE_prefix,.envfile) - Defaults mirror current
config_eval.jsonvalues (tier1_max_seconds=1.0, tier2_max_seconds=10.0, etc.) - Update
EvaluationPipeline.__init__()to acceptJudgeSettingsinstead ofconfig_path - Keep
config_eval.jsontemporarily but it is no longer loaded at runtime - Reuse pattern from
src/app/common/settings.py
Files:
src/app/evals/settings.py(new —JudgeSettings)src/app/evals/evaluation_config.py(deprecate, replace usages)src/app/evals/evaluation_pipeline.py(useJudgeSettings)src/app/evals/composite_scorer.py(useJudgeSettingsfor weights)
Feature 2: Wire Evaluation After Review Generation¶
Description: Connect run_manager() output to EvaluationPipeline.evaluate_comprehensive(). Add --skip-eval CLI flag.
Acceptance Criteria:
- After
run_manager()completes,EvaluationPipelineruns automatically -
--skip-evalCLI flag disables evaluation - Graceful skip when no ground-truth reviews available
-
make validatepasses
Technical Requirements:
- Import
EvaluationPipelineinapp.py, call after line 134 - Pipeline uses
JudgeSettingsfrom Feature 1 - Add
--skip-evaltoparse_args()inrun_cli.py
Files:
src/app/app.pysrc/run_cli.py
Feature 3: Capture GraphTraceData During MAS Execution¶
Description: Wire TraceCollector into agent orchestration so GraphTraceData is populated from real agent runs.
Acceptance Criteria:
- Agent-to-agent delegations logged via
trace_collector.log_agent_interaction() - Tool calls logged via
trace_collector.log_tool_call() - Timing data captured for each delegation step
-
GraphTraceDatapassed toevaluate_comprehensive()with real data -
GraphTraceDataconstructed viamodel_validate()instead of manual.get()extraction -
make validatepasses
Technical Requirements:
- Initialize
TraceCollectorinrun_manager()orsetup_agent_env() - Instrument delegation calls in
agent_system.py - Pass populated
GraphTraceDatafromapp.pyto pipeline
Files:
src/app/agents/agent_system.pysrc/app/agents/orchestration.pysrc/app/app.py
Feature 4: Graph vs Text Metric Comparison Output¶
Description: Log comparative summary showing Tier 1 (text) vs Tier 3 (graph) scores after evaluation.
Acceptance Criteria:
- Log shows Tier 1 overall score vs Tier 3 overall score
- Individual graph metrics displayed (path_convergence, tool_selection_accuracy, communication_overhead, coordination_centrality, task_distribution_balance)
- Individual text metrics displayed (cosine_score, jaccard_score, semantic_score)
- Composite score shows per-tier contribution
-
make validatepasses
Files:
src/app/app.pysrc/app/evals/evaluation_pipeline.py(optional enhancement)
Feature 4b: Migrate Opik to Logfire + Phoenix Local Tracing¶
Description: Replace Opik tracing integration (11 Docker containers, ~155s startup) with Logfire SDK + Arize Phoenix. logfire.instrument_pydantic_ai() auto-instruments all PydanticAI agents natively, eliminating manual OpikInstrumentationManager, @track decorators, and get_opik_decorator() wrappers. Phoenix receives traces via OTLP and provides a local web UI — all via pip install with zero Docker dependencies.
Acceptance Criteria:
-
pyproject.tomlreplacesopik>=1.8.0witharize-phoenixandopeninference-instrumentation-pydantic-ai -
JudgeSettingsreplacesopik_*fields withlogfire_enabled,logfire_send_to_cloud,phoenix_endpoint,logfire_service_name -
LogfireConfigreplacesOpikConfiginload_configs.py -
logfire_instrumentation.pyreplacesopik_instrumentation.pyusinglogfire.instrument_pydantic_ai()auto-instrumentation -
agent_system.pyremoves manual@opik_decoratorwrappers from delegation tools -
evaluation_pipeline.pyremoves Opik import block and_apply_opik_decorator()/_record_opik_metadata()methods -
CommonSettings.enable_opikrenamed toenable_logfire - Makefile adds
start_phoenix,stop_phoenix,status_phoenixtargets (Opik targets kept as legacy) -
.env.examplereplacesOPIK_*vars withJUDGE_PHOENIX_*/JUDGE_LOGFIRE_*vars -
make validatepasses
Technical Requirements:
- Keep
docker-compose.opik.yamlas optional legacy (not deleted) - Keep
TraceCollector(trace_processors.py) unchanged — independent local SQLite/JSONL system - Logfire auto-instrumentation replaces all manual decorator wiring
- Graceful degradation when Phoenix is not running
Files:
pyproject.tomlsrc/app/evals/settings.pysrc/app/utils/load_configs.pysrc/app/agents/opik_instrumentation.py(delete)src/app/agents/logfire_instrumentation.py(new)src/app/agents/agent_system.pysrc/app/evals/evaluation_pipeline.pysrc/app/common/settings.py.env.exampleMakefiletests/evals/test_judge_settings.pytests/common/test_common_settings.py
Feature 4c: Streamlit Evaluation Dashboard + Agent Graph Visualization¶
Description: Add two new Streamlit pages: an Evaluation Results dashboard displaying Tier 1/2/3 scores with graph vs text metric comparison, and an Agent Graph page rendering the NetworkX delegation graph interactively via Pyvis. Phoenix (localhost:6006) is cross-linked from the sidebar for deep trace inspection.
Acceptance Criteria:
- “Evaluation Results” page displays Tier 1/2/3 scores from
CompositeResult - Bar chart compares graph metrics vs text metrics (Tier 1 vs Tier 3)
- Individual metric scores displayed in table format
- “Agent Graph” page renders
export_trace_to_networkx()output as interactive Pyvis graph - Agent nodes and tool nodes visually distinguished (color/shape)
- Sidebar includes Phoenix link with status indicator
- Pages render gracefully with empty/mock data when evaluation hasn’t run
-
pyvisadded to gui dependency group inpyproject.toml -
make validatepasses
Technical Requirements:
- Use
graph_analysis.export_trace_to_networkx()(line 426) for graph data - Use
CompositeResult/Tier1Result/Tier3Resultmodels for evaluation data - Pyvis
Network.from_nx(graph)→ HTML →st.components.v1.html() - Cross-link to Phoenix at
http://localhost:6006(not embed) - Follow existing GUI patterns in
src/gui/
Files:
src/gui/pages/evaluation.py(new)src/gui/pages/agent_graph.py(new)src/gui/config/config.py(add pages to PAGES list)src/gui/components/sidebar.py(add Phoenix link)src/run_gui.py(route new pages)pyproject.toml(add pyvis to gui group)
Non-Functional Requirements¶
- Maintainability:
- Use modular design patterns for easy updates and maintenance.
- Implement logging and error handling for debugging and monitoring.
- Performance:
- Ensure low latency in evaluation pipeline execution.
- Optimize for memory usage during graph analysis.
- Documentation:
- Comprehensive documentation for setup, usage, and testing.
- Docstrings for all new functions and classes (Google style format).
Out of Scope¶
- A2A protocol migration (PydanticAI stays)
- Agent system restructuring (
src/app/agents/unchanged except trace instrumentation) - Streamlit UI redesign (existing UI stays as-is)
- pytest-bdd / Gherkin scenarios (use pytest + hypothesis instead)
- HuggingFace
datasetslibrary (use GitHub API downloader instead) - Google Gemini SDK (
google-genai) — use OpenAI-spec compatible providers only - VCR-based network mocking (use @patch for unit tests)
- Browser-based E2E tests (Playwright/Selenium deferred)
- CC-style evaluation baselines (deferred)
- E2E integration tests and multi-channel deployment (deferred)
Notes for Ralph Loop¶
Story Breakdown - Sprint 2 (6 stories total):
- Feature 1 (Settings Migration) → STORY-001: Migrate EvaluationConfig to JudgeSettings pydantic-settings
- Feature 2 (Wire Evaluation) → STORY-002: Wire evaluate_comprehensive after run_manager (depends: STORY-001)
- Feature 3 (Trace Capture) → STORY-003: Capture GraphTraceData during MAS execution (depends: STORY-002)
- Feature 4 (Comparison Output) → STORY-004: Add graph vs text metric comparison logging (depends: STORY-003)
- Feature 4b (Opik → Logfire Migration) → STORY-005: Migrate Opik to Logfire + Phoenix local tracing (depends: STORY-001)
- Feature 4c (Streamlit Dashboard) → STORY-006: Streamlit evaluation dashboard + agent graph visualization (depends: STORY-005)