PRD Sprint4 Ralph
title: Product Requirements Document: Agents-eval Sprint 4 version: 2.0.0 created: 2026-02-15 updated: 2026-02-15
Project Overview¶
Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager -> Researcher -> Analyst -> Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).
Sprint 3 is complete: plugin architecture, GUI wiring, test alignment, optional weave, and trace quality fixes are all shipped.
Sprint 4 has two goals:
- Operational resilience – graceful degradation for Logfire trace export failures, thread-safe Tier 3 timeout handling, Tier 2 judge fallback validation, and completing test infrastructure alignment.
- CC baseline comparison – compare Claude Code against the PydanticAI MAS in two modes: solo (single CC instance, no orchestration) and teams (CC Agent Teams with delegation). Both modes run with full internal tool, plugin, and MCP access – the same capabilities available to the PydanticAI agents. Artifacts from both modes are parsed into
GraphTraceDataand evaluated through the same three-tier pipeline, enabling a three-way comparison: PydanticAI MAS vs CC solo vs CC teams.
Functional Requirements¶
Sprint 4: Operational Resilience & CC Baseline Comparison¶
Feature 1: Graceful Logfire Trace Export Failures¶
Description: Suppress noisy exception stack traces when Logfire/OTLP trace export fails due to connection errors (e.g., Opik service not running on localhost:6006). Currently, both span and metrics export print full ConnectionRefusedError stack traces to stderr multiple times during execution and at shutdown, cluttering logs during normal operation when tracing is unavailable. Affects both CLI (make run_cli) and GUI (make run_gui) equally.
Acceptance Criteria:
- Logfire initialization catches connection errors and logs single warning message
- Failed span exports do not print stack traces to stderr during agent runs
- Failed metrics exports do not print stack traces to stderr at shutdown
- When OTLP endpoint is unreachable, log one warning at initialization (not per-export)
- App continues normal operation when Logfire endpoint unavailable (both CLI and GUI)
- When Opik service is running, traces and metrics export successfully (no regression)
- Suppression works for both
/v1/traces/v1/traces(spans) and/v1/traces/v1/metrics(metrics) endpoints - Tests: Hypothesis property tests for retry/backoff behavior bounds
- Tests: inline-snapshot for warning message format
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Add connection check in
LogfireInstrumentationManager._initialize_logfire()(src/app/agents/logfire_instrumentation.py:50-71) - Catch
requests.exceptions.ConnectionErrorduring initialization - Set
self.config.enabled = Falsewhen OTLP endpoint unreachable - Log single warning: “Logfire tracing unavailable: {endpoint} unreachable (spans and metrics export disabled)”
- Configure OTLP span exporter with retry backoff to minimize per-span error noise
- Configure OTLP metrics exporter with retry backoff to minimize per-metric error noise
- Ensure existing
try/exceptat line 69-71 handles initialization failures - Suppress OpenTelemetry SDK export errors when endpoint connection fails (both span and metrics exporters)
Files:
src/app/agents/logfire_instrumentation.pytests/agents/test_logfire_instrumentation.py(new)
Feature 2: Thread-Safe Graph Analysis Timeout Handling¶
Description: Replace Python signal-based timeouts in Tier 3 graph analysis with thread-safe alternatives. Currently, _with_timeout() fails with “signal only works in main thread” when called from Streamlit (non-main thread), causing path_convergence metric to return 0.0 fallback.
Acceptance Criteria:
- Graph analysis timeout handling works in both main and non-main threads
-
path_convergencecalculation succeeds in Streamlit GUI (no signal error) - CLI evaluation continues to work with timeouts (no regression)
- Timeout mechanism uses
concurrent.futures.ThreadPoolExecutorwith timeout parameter - Graceful fallback when timeout occurs (return 0.3, log warning)
- Tests: Hypothesis property tests for timeout bounds (0.0 <= fallback <= 0.5)
- Tests: inline-snapshot for timeout error result structure
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Replace
signal-based_with_timeout()insrc/app/judge/graph_analysis.py:348 - Implement thread-safe timeout using
concurrent.futures.ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor, TimeoutError
def _with_timeout(func, *args, timeout=5.0):
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(func, *args)
return future.result(timeout=timeout)
- Update
_calculate_path_convergence()exception handler (line 342) to catchconcurrent.futures.TimeoutError - Maintain existing fallback values: disconnected graph -> 0.2, timeout -> 0.3
- Preserve debug logging for timeout events
Files:
src/app/judge/graph_analysis.pytests/evals/test_graph_analysis.py(update timeout tests)
Feature 3: Tier 2 Judge Provider Fallback Validation¶
Description: End-to-end validation that judge provider fallback works correctly. This is a testing and documentation task to confirm existing implementation handles missing API keys gracefully.
Acceptance Criteria:
- Integration test: Run evaluation with
tier2_provider=openaiand noOPENAI_API_KEYset - Verify fallback to
tier2_fallback_provideroccurs (check logs) - Verify Tier 2 metrics use neutral fallback scores (0.5) when all providers unavailable
- Verify composite score redistributes weights when Tier 2 is skipped
- Verify
Tier2Resultincludes fallback metadata flag - Update
docs/best-practices/troubleshooting.mdwith Tier 2 auth failure guidance - Tests: inline-snapshot for Tier2Result with fallback metadata
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Create integration test in
tests/evals/test_llm_evaluation_managers_integration.py - Test scenarios: 1. Valid primary provider -> Tier 2 succeeds 2. Invalid primary + valid fallback -> fallback succeeds 3. Both providers unavailable -> neutral scores, Tier 2 skipped
- Add troubleshooting section to
docs/best-practices/troubleshooting.md: - Symptom: “status_code: 401, model_name: gpt-4o-mini”
- Cause: Missing OPENAI_API_KEY when tier2_provider=openai
- Solution: Set valid API key or configure tier2_fallback_provider
- Document expected behavior when Tier 2 is skipped (weight redistribution)
Files:
tests/evals/test_llm_evaluation_managers_integration.py(new)docs/best-practices/troubleshooting.md(new)
Feature 4: Complete Test Suite Alignment¶
Description: Refactor remaining test suite to use hypothesis (property-based testing) and inline-snapshot (regression testing), completing the test infrastructure alignment. No production code changes. Covers integration tests, benchmarks, GUI tests, and data utilities not yet converted. Explicitly excludes BDD/Gherkin (pytest-bdd).
Acceptance Criteria:
- Property-based tests using
@givenfor data validation (PeerRead dataset schemas, model invariants) - Property-based tests for integration test invariants (API responses, file I/O operations)
- Property-based tests for GUI state management (session state updates, widget interactions)
- Snapshot tests using
snapshot()for integration test outputs (trace data, evaluation results) - Snapshot tests for GUI page rendering outputs (Streamlit component structures)
- Snapshot tests for benchmark result structures
- Remove low-value tests (trivial assertions, field existence checks per testing-strategy.md)
- All existing test coverage maintained or improved
-
make validatepasses - CHANGELOG.md updated
- Add
from hypothesis import given, strategies as stimports to relevant test files - Add
from inline_snapshot import snapshotimports to relevant test files - Convert data validation tests to property tests with invariants (schemas always valid)
- Convert integration test outputs to snapshot tests
- Document usage patterns in test files for future reference
- NO pytest-bdd, NO Gherkin, NO BDD methodology (use TDD with hypothesis for properties)
Technical Requirements:
- Apply hypothesis for property-based testing to:
- Data validation: PeerRead dataset schemas, model serialization
- Integration tests: API responses, trace data outputs
- GUI tests: Session state updates, widget value bounds
- Apply inline-snapshot for regression testing to:
- Integration test outputs: evaluation pipeline results, trace data structures
- GUI rendering: Streamlit page component outputs
- Benchmark results: performance metric structures
- Remove trivial tests per testing-strategy.md guidelines:
- Field existence checks (Pydantic models already validate)
- Simple getter/setter tests
- Tests that duplicate type checker validation
- Maintain coverage thresholds (no reduction in coverage percentage)
- Document patterns for future test authoring
Priority Test Areas (from testing-strategy.md):
- CRITICAL: Data validation (PeerRead dataset schemas, trace data formats)
- CRITICAL: Integration test invariants (end-to-end evaluation flows)
- HIGH: GUI state management (session state persistence, provider selection)
- HIGH: Serialization (integration test result structures)
- MEDIUM: Benchmark output validation (performance metric consistency)
Files:
tests/app/test_evaluation_wiring.py(snapshot for evaluation outputs)tests/benchmarks/test_performance_baselines.py(snapshot for benchmark results)tests/data_utils/test_datasets_peerread.py(property tests for schemas)tests/evals/test_opik_metrics.py(property tests for metric bounds)tests/integration/test_enhanced_peerread_integration.py(snapshot for integration outputs)tests/integration/test_opik_integration.py(snapshot for trace outputs)tests/integration/test_peerread_integration.py(property tests + snapshots)tests/integration/test_peerread_real_dataset_validation.py(property tests for real data)tests/metrics/test_metrics_output_similarity.py(property tests for similarity bounds)tests/test_gui/test_agent_graph_page.py(snapshot for GUI components)tests/test_gui/test_evaluation_page.py(snapshot for GUI outputs)tests/test_gui/test_sidebar_phoenix.py(snapshot for sidebar structure)
Feature 5: CC Trace Adapter¶
Description: Parse Claude Code artifacts into GraphTraceData format in two modes so CC runs can be evaluated through the same three-tier pipeline used for PydanticAI MAS runs. Both modes assume CC has full internal tool, plugin, and MCP access (the same capabilities as the PydanticAI agents).
- Solo mode: Parse a CC session export directory containing conversation history and tool-call logs from a single CC instance (no orchestration). Produces a single-agent
GraphTraceDatawithtool_callsandtiming_databut minimalagent_interactionsand nocoordination_events. - Teams mode: Parse CC Agent Teams artifacts (
~/.claude/teams/,~/.claude/tasks/) from a multi-agent CC run with delegation. Produces a multi-agentGraphTraceDatawith fullagent_interactions,tool_calls,timing_data, andcoordination_events.
Acceptance Criteria:
- Output
GraphTraceDatainstance passes existing Tier 3 graph analysis without modification in both modes - Auto-detect mode from directory structure (presence of
config.jsonwithmembersarray indicates teams; otherwise solo) - Graceful error handling when CC artifact directories are missing or malformed
- Tests: Hypothesis property tests for data mapping invariants (all fields populated, timestamps ordered) in both modes
- Tests: inline-snapshot for
GraphTraceDataoutput structure from sample CC artifacts (one solo, one teams) -
make validatepasses - CHANGELOG.md updated
5.1 Teams Mode¶
Acceptance Criteria:
- Adapter reads CC team config from
config.jsonand extractsexecution_idfrom team name - Adapter parses
inboxes/*.jsonmessages intoagent_interactionslist - Adapter parses
tasks/*.jsoncompletions intotool_callslist (task completions as proxy) - Adapter derives
timing_datafrom first/last timestamps across all artifacts - Adapter extracts
coordination_eventsfrom task assignments and blocked-by relationships
5.2 Solo Mode¶
Acceptance Criteria:
- Adapter reads CC session export directory and extracts
execution_idfrom session metadata - Adapter parses tool-call entries from session logs into
tool_callslist - Adapter derives
timing_datafrom session start/end timestamps -
agent_interactionsis empty or contains only user-agent exchanges -
coordination_eventsis empty (single agent, no delegation)
Technical Requirements:
- Create
CCTraceAdapterclass that accepts a CC artifacts directory path and auto-detects mode - Teams mode data mapping from CC artifacts to
GraphTraceData:
| GraphTraceData field | CC source | Mapping |
|---|---|---|
execution_id |
config.json team name |
Direct |
agent_interactions |
inboxes/*.json messages |
{"from": sender, "to": recipient, "type": msg_type, "timestamp": ts} |
tool_calls |
tasks/*.json completions |
{"agent_id": owner, "tool_name": subject, "success": completed, "duration": derived} |
timing_data |
First/last timestamps | {"start_time": min, "end_time": max, "total_duration": delta} |
coordination_events |
tasks/*.json assignments + blocks |
{"coordination_type": "task_delegation", "manager_agent": lead, "target_agents": [owner]} |
- Solo mode data mapping:
| GraphTraceData field | CC source | Mapping |
|---|---|---|
execution_id |
Session directory name or metadata | Direct |
agent_interactions |
None (single agent) | Empty list |
tool_calls |
Session tool-call log entries | {"agent_id": "cc-solo", "tool_name": tool_name, "success": bool, "duration": derived} |
timing_data |
Session start/end timestamps | {"start_time": min, "end_time": max, "total_duration": delta} |
coordination_events |
None (single agent) | Empty list |
- Post-hoc parsing of CC artifacts (not live OTel) – CC Agent Teams do not store tool-level traces, so task completions serve as proxy for
tool_callsin teams mode - Validate parsed data against existing
GraphTraceDataPydantic model - Return empty/default
GraphTraceDatawhen artifacts directory is invalid (log warning, do not raise)
Files:
src/app/judge/cc_trace_adapter.py(new)tests/judge/test_cc_trace_adapter.py(new)
Feature 6: Baseline Comparison Engine¶
Description: New BaselineComparison Pydantic model and comparison logic to diff CompositeResult instances across three systems: PydanticAI MAS, CC solo (no orchestration), and CC teams (with orchestration). The pairwise compare() function diffs any two CompositeResult instances; a compare_all() convenience function produces all three pairwise comparisons at once. Reuses existing CompositeResult model and CompositeScorer.extract_metric_values().
Acceptance Criteria:
-
BaselineComparisonPydantic model with fields:label_a,label_b,result_a,result_b,metric_deltas,tier_deltas,summary -
compare(result_a, result_b, label_a, label_b)accepts twoCompositeResultinstances and returnsBaselineComparison -
compare_all(pydantic_result, cc_solo_result, cc_teams_result)returns list of 3BaselineComparison(PydanticAI vs CC-solo, PydanticAI vs CC-teams, CC-solo vs CC-teams) -
compare_all()acceptsNonefor any result and skips comparisons involving that result -
metric_deltascontains per-metric delta for all 6 composite metrics -
tier_deltascontains tier-level score differences (Tier 1, Tier 2, Tier 3) -
summaryis a human-readable comparison string (e.g., “PydanticAI scored +0.12 higher on technical_accuracy vs CC-solo”) - Handles missing tiers gracefully (one system has Tier 2, other does not)
- Tests: Hypothesis property tests for delta symmetry (swap inputs -> negated deltas)
- Tests: inline-snapshot for
BaselineComparisonmodel dump structure - Tests: inline-snapshot for
compare_all()output with one None result -
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Create
BaselineComparisonPydantic model: label_a: str– human label for first system (e.g., “PydanticAI MAS”)label_b: str– human label for second system (e.g., “CC-solo”)result_a: CompositeResult– first system evaluationresult_b: CompositeResult– second system evaluationmetric_deltas: dict[str, float]– per-metric delta (6 composite metrics)tier_deltas: dict[str, float]– tier-level score differencessummary: str– human-readable comparison- Create
compare(result_a: CompositeResult, result_b: CompositeResult, label_a: str, label_b: str) -> BaselineComparisonfunction - Create
compare_all(pydantic_result: CompositeResult | None, cc_solo_result: CompositeResult | None, cc_teams_result: CompositeResult | None) -> list[BaselineComparison]convenience function - Reuse
CompositeScorer.extract_metric_values()(src/app/judge/composite_scorer.py:164) to extract per-metric values from each result - Compute deltas as
value_a - value_bfor each metric - Generate summary string listing metrics where delta exceeds 0.05 threshold
Files:
src/app/judge/baseline_comparison.py(new)src/app/data_models/evaluation_models.py(addBaselineComparisonmodel)tests/judge/test_baseline_comparison.py(new)
Feature 7: CLI & GUI Baseline Integration¶
Description: Wire the CC trace adapter and baseline comparison engine into the existing CLI and GUI so users can run side-by-side evaluations. Supports two CC baseline modes: solo (single CC instance, no orchestration) and teams (CC Agent Teams with delegation). Both modes assume CC had full internal tool, plugin, and MCP access during the run being evaluated.
Acceptance Criteria:
- CLI:
--cc-solo-dir PATHflag accepts path to CC solo session export directory - CLI:
--cc-teams-dir PATHflag accepts path to CC Agent Teams artifacts directory - CLI: Both flags can be provided together for three-way comparison (PydanticAI vs CC-solo vs CC-teams)
- CLI: Adapter auto-detects mode per directory; flags override auto-detection
- CLI: Baseline comparison(s) printed to console after standard evaluation output
- GUI: Baseline comparison view on evaluation results page (side-by-side metrics display)
- GUI: Separate directory inputs for CC solo and CC teams artifacts
- GUI: Three-way comparison table when both CC baselines are provided
- Both CLI and GUI skip baseline comparison when no CC artifacts provided (no regression)
- Tests: inline-snapshot for CLI output with single baseline and three-way comparison
- Tests: Hypothesis property tests for GUI state management with baseline data
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- CLI: Add
--cc-solo-dirand--cc-teams-dirarguments to CLI entry point - CLI: For each provided directory, call
CCTraceAdapter(path).parse()to get CCGraphTraceData, then run throughevaluate_comprehensive()pipeline - CLI: Call
compare_all()with available results (passNonefor missing baselines) and print eachBaselineComparison.summary - GUI: Add baseline section to evaluation results page using existing Streamlit patterns
- GUI: Display
metric_deltasas side-by-side bar chart andsummaryas text for each pairwise comparison - All traces go through the same evaluation pipeline (
evaluate_comprehensive()) - Reuse existing GUI evaluation page patterns (
src/gui/pages/evaluation.py)
Files:
src/app/app.py(add--cc-solo-dirand--cc-teams-dirCLI flags)src/gui/pages/evaluation.py(add baseline comparison view)tests/app/test_cli_baseline.py(new)tests/test_gui/test_evaluation_baseline.py(new)
Non-Functional Requirements¶
- Maintainability:
- Use modular design patterns for easy updates and maintenance.
- Implement logging and error handling for debugging and monitoring.
- Graceful degradation when external services unavailable.
- Performance:
- Timeout mechanisms must not introduce significant latency overhead.
- Thread-safe implementations should minimize thread pool creation overhead.
- CC trace adapter must parse typical team artifacts (< 50 files) in under 2 seconds.
- Documentation:
- Comprehensive troubleshooting guide for common operational issues.
- Docstrings for all new functions and classes (Google style format).
- Testing:
- All new features must include tests per
docs/best-practices/testing-strategy.md - Use Hypothesis (
@given) for property-based tests: timeout bounds, retry behavior, score fallbacks, data mapping invariants, delta symmetry - Use inline-snapshot (
snapshot()) for regression tests: warning messages, error result structures, trace adapter output, comparison model dumps - Use pytest for standard unit/integration tests with Arrange-Act-Assert structure
- Tool selection: pytest for logic, Hypothesis for properties, inline-snapshot for structure
Out of Scope¶
- Opik service auto-start on GUI launch (user must manually run
make start_opik) - Custom OTLP exporter implementation (use standard OpenTelemetry libraries)
- Tier 3 graph analysis performance optimization (timeout mechanism only)
- Alternative tracing backends (Phoenix/Logfire only)
- Persistent retry queues for failed trace exports (in-memory only)
- Gemini provider compatibility (
agent_system.py:610FIXME – deferred to future sprint) - HuggingFace provider implementation (deferred to future sprint)
- Streaming with Pydantic model outputs (
agent_system.py:522– deferred to future sprint) - CC OpenTelemetry live telemetry (post-hoc artifact parsing only)
- OTel Collector Docker deployment for CC traces
- CC native span creation or instrumentation
- A2A (Agent-to-Agent) protocol integration
- Provisioning CC tool/plugin/MCP access (assumed pre-configured by the user before the CC run)
Notes for Ralph Loop¶
Story Breakdown - Sprint 4 (7 stories total):
- Feature 1 (Logfire Export) → STORY-001: Graceful Logfire trace export failures
- Feature 2 (Graph Timeout) → STORY-002: Thread-safe graph analysis timeout handling
- Feature 3 (Judge Fallback Validation) → STORY-003: Tier 2 judge provider fallback validation
- Feature 4 (Complete Test Alignment) → STORY-004: Complete test suite alignment with hypothesis and inline-snapshot
- Feature 5 (CC Trace Adapter) → STORY-005: CC trace adapter for solo and teams artifacts
- Feature 6 (Baseline Comparison) → STORY-006: Baseline comparison engine for CompositeResult diffing
- Feature 7 (CLI & GUI Baseline) → STORY-007: CLI and GUI baseline integration (depends: STORY-002, STORY-005, STORY-006)