PRD Sprint2 Ralph

title: Product Requirements Document: Agents-eval Sprint 2 version: 3.4.0 created: 2025-09-01 updated: 2026-02-12

Project Overview¶

Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager → Researcher → Analyst → Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).

Sprint 2 focuses on connecting generation and evaluation: capturing real agent execution graphs, running evaluation automatically after review generation, and producing a comparative summary of graph-based coordination metrics vs conventional text similarity metrics. All evaluation tiers are fully implemented (157 tests); the gap is wiring them into the generation flow with real trace data.

Functional Requirements¶

Sprint 2: Graph vs Text Evaluation Pipeline¶

Feature 1: Migrate EvaluationConfig to Pydantic Settings¶

Description: Replace JSON-based EvaluationConfig (config_eval.json) with JudgeSettings(BaseSettings) using JUDGE_ env prefix. Defaults in code, overridable via .env or env vars. Follows same pattern as existing CommonSettings (EVAL_ prefix).

Acceptance Criteria:

JudgeSettings(BaseSettings) with JUDGE_ env prefix replaces EvaluationConfig
Typed defaults in code: tier weights, timeouts, model selection, enabled tiers
EvaluationPipeline uses JudgeSettings instead of loading config_eval.json
Existing evaluation tests pass with settings-based config
Timeout fields use bounded validators (gt=0, le=300)
Time tracking pattern standardized across all tiers
Existing test fixtures updated: pipeline uses JudgeSettings, JSON fixtures removed
make validate passes

Technical Requirements:

Create src/app/evals/settings.py with JudgeSettings(BaseSettings) (model_config with JUDGE_ prefix, .env file)
Defaults mirror current config_eval.json values (tier1_max_seconds=1.0, tier2_max_seconds=10.0, etc.)
Update EvaluationPipeline.__init__() to accept JudgeSettings instead of config_path
Keep config_eval.json temporarily but it is no longer loaded at runtime
Reuse pattern from src/app/common/settings.py

Files:

src/app/evals/settings.py (new — JudgeSettings)
src/app/evals/evaluation_config.py (deprecate, replace usages)
src/app/evals/evaluation_pipeline.py (use JudgeSettings)
src/app/evals/composite_scorer.py (use JudgeSettings for weights)

Feature 2: Wire Evaluation After Review Generation¶

Description: Connect run_manager() output to EvaluationPipeline.evaluate_comprehensive(). Add --skip-eval CLI flag.

Acceptance Criteria:

After run_manager() completes, EvaluationPipeline runs automatically
--skip-eval CLI flag disables evaluation
Graceful skip when no ground-truth reviews available
make validate passes

Technical Requirements:

Import EvaluationPipeline in app.py, call after line 134
Pipeline uses JudgeSettings from Feature 1
Add --skip-eval to parse_args() in run_cli.py

Files:

src/app/app.py
src/run_cli.py

Feature 3: Capture GraphTraceData During MAS Execution¶

Description: Wire TraceCollector into agent orchestration so GraphTraceData is populated from real agent runs.

Acceptance Criteria:

Agent-to-agent delegations logged via trace_collector.log_agent_interaction()
Tool calls logged via trace_collector.log_tool_call()
Timing data captured for each delegation step
GraphTraceData passed to evaluate_comprehensive() with real data
GraphTraceData constructed via model_validate() instead of manual .get() extraction
make validate passes

Technical Requirements:

Initialize TraceCollector in run_manager() or setup_agent_env()
Instrument delegation calls in agent_system.py
Pass populated GraphTraceData from app.py to pipeline

Files:

src/app/agents/agent_system.py
src/app/agents/orchestration.py
src/app/app.py

Feature 4: Graph vs Text Metric Comparison Output¶

Description: Log comparative summary showing Tier 1 (text) vs Tier 3 (graph) scores after evaluation.

Acceptance Criteria:

Log shows Tier 1 overall score vs Tier 3 overall score
Individual graph metrics displayed (path_convergence, tool_selection_accuracy, communication_overhead, coordination_centrality, task_distribution_balance)
Individual text metrics displayed (cosine_score, jaccard_score, semantic_score)
Composite score shows per-tier contribution
make validate passes

Files:

src/app/app.py
src/app/evals/evaluation_pipeline.py (optional enhancement)

Feature 4b: Migrate Opik to Logfire + Phoenix Local Tracing¶

Description: Replace Opik tracing integration (11 Docker containers, ~155s startup) with Logfire SDK + Arize Phoenix. logfire.instrument_pydantic_ai() auto-instruments all PydanticAI agents natively, eliminating manual OpikInstrumentationManager, @track decorators, and get_opik_decorator() wrappers. Phoenix receives traces via OTLP and provides a local web UI — all via pip install with zero Docker dependencies.

Acceptance Criteria:

Technical Requirements:

Keep docker-compose.opik.yaml as optional legacy (not deleted)
Keep TraceCollector (trace_processors.py) unchanged — independent local SQLite/JSONL system
Logfire auto-instrumentation replaces all manual decorator wiring
Graceful degradation when Phoenix is not running

Files:

pyproject.toml
src/app/evals/settings.py
src/app/utils/load_configs.py
src/app/agents/opik_instrumentation.py (delete)
src/app/agents/logfire_instrumentation.py (new)
src/app/agents/agent_system.py
src/app/evals/evaluation_pipeline.py
src/app/common/settings.py
.env.example
Makefile
tests/evals/test_judge_settings.py
tests/common/test_common_settings.py

Feature 4c: Streamlit Evaluation Dashboard + Agent Graph Visualization¶

Description: Add two new Streamlit pages: an Evaluation Results dashboard displaying Tier 1/2/3 scores with graph vs text metric comparison, and an Agent Graph page rendering the NetworkX delegation graph interactively via Pyvis. Phoenix (localhost:6006) is cross-linked from the sidebar for deep trace inspection.

Acceptance Criteria:

“Evaluation Results” page displays Tier 1/2/3 scores from CompositeResult
Bar chart compares graph metrics vs text metrics (Tier 1 vs Tier 3)
Individual metric scores displayed in table format
“Agent Graph” page renders export_trace_to_networkx() output as interactive Pyvis graph
Agent nodes and tool nodes visually distinguished (color/shape)
Sidebar includes Phoenix link with status indicator
Pages render gracefully with empty/mock data when evaluation hasn’t run
pyvis added to gui dependency group in pyproject.toml
make validate passes

Technical Requirements:

Use graph_analysis.export_trace_to_networkx() (line 426) for graph data
Use CompositeResult / Tier1Result / Tier3Result models for evaluation data
Pyvis Network.from_nx(graph) → HTML → st.components.v1.html()
Cross-link to Phoenix at http://localhost:6006 (not embed)
Follow existing GUI patterns in src/gui/

Files:

src/gui/pages/evaluation.py (new)
src/gui/pages/agent_graph.py (new)
src/gui/config/config.py (add pages to PAGES list)
src/gui/components/sidebar.py (add Phoenix link)
src/run_gui.py (route new pages)
pyproject.toml (add pyvis to gui group)

Non-Functional Requirements¶

Maintainability:
Use modular design patterns for easy updates and maintenance.
Implement logging and error handling for debugging and monitoring.
Performance:
Ensure low latency in evaluation pipeline execution.
Optimize for memory usage during graph analysis.
Documentation:
Comprehensive documentation for setup, usage, and testing.
Docstrings for all new functions and classes (Google style format).

Out of Scope¶

A2A protocol migration (PydanticAI stays)
Agent system restructuring (src/app/agents/ unchanged except trace instrumentation)
Streamlit UI redesign (existing UI stays as-is)
pytest-bdd / Gherkin scenarios (use pytest + hypothesis instead)
HuggingFace datasets library (use GitHub API downloader instead)
Google Gemini SDK (google-genai) — use OpenAI-spec compatible providers only
VCR-based network mocking (use @patch for unit tests)
Browser-based E2E tests (Playwright/Selenium deferred)
CC-style evaluation baselines (deferred)
E2E integration tests and multi-channel deployment (deferred)

Notes for Ralph Loop¶

Story Breakdown - Sprint 2 (6 stories total):

Feature 1 (Settings Migration) → STORY-001: Migrate EvaluationConfig to JudgeSettings pydantic-settings
Feature 2 (Wire Evaluation) → STORY-002: Wire evaluate_comprehensive after run_manager (depends: STORY-001)
Feature 3 (Trace Capture) → STORY-003: Capture GraphTraceData during MAS execution (depends: STORY-002)
Feature 4 (Comparison Output) → STORY-004: Add graph vs text metric comparison logging (depends: STORY-003)
Feature 4b (Opik → Logfire Migration) → STORY-005: Migrate Opik to Logfire + Phoenix local tracing (depends: STORY-001)
Feature 4c (Streamlit Dashboard) → STORY-006: Streamlit evaluation dashboard + agent graph visualization (depends: STORY-005)