PRD Sprint4 Ralph

title: Product Requirements Document: Agents-eval Sprint 4 version: 2.0.0 created: 2026-02-15 updated: 2026-02-15

Project Overview¶

Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager -> Researcher -> Analyst -> Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).

Sprint 3 is complete: plugin architecture, GUI wiring, test alignment, optional weave, and trace quality fixes are all shipped.

Sprint 4 has two goals:

Operational resilience – graceful degradation for Logfire trace export failures, thread-safe Tier 3 timeout handling, Tier 2 judge fallback validation, and completing test infrastructure alignment.
CC baseline comparison – compare Claude Code against the PydanticAI MAS in two modes: solo (single CC instance, no orchestration) and teams (CC Agent Teams with delegation). Both modes run with full internal tool, plugin, and MCP access – the same capabilities available to the PydanticAI agents. Artifacts from both modes are parsed into GraphTraceData and evaluated through the same three-tier pipeline, enabling a three-way comparison: PydanticAI MAS vs CC solo vs CC teams.

Functional Requirements¶

Sprint 4: Operational Resilience & CC Baseline Comparison¶

Feature 1: Graceful Logfire Trace Export Failures¶

Description: Suppress noisy exception stack traces when Logfire/OTLP trace export fails due to connection errors (e.g., Opik service not running on localhost:6006). Currently, both span and metrics export print full ConnectionRefusedError stack traces to stderr multiple times during execution and at shutdown, cluttering logs during normal operation when tracing is unavailable. Affects both CLI (make run_cli) and GUI (make run_gui) equally.

Acceptance Criteria:

Technical Requirements:

Add connection check in LogfireInstrumentationManager._initialize_logfire() (src/app/agents/logfire_instrumentation.py:50-71)
Catch requests.exceptions.ConnectionError during initialization
Set self.config.enabled = False when OTLP endpoint unreachable
Log single warning: “Logfire tracing unavailable: {endpoint} unreachable (spans and metrics export disabled)”
Configure OTLP span exporter with retry backoff to minimize per-span error noise
Configure OTLP metrics exporter with retry backoff to minimize per-metric error noise
Ensure existing try/except at line 69-71 handles initialization failures
Suppress OpenTelemetry SDK export errors when endpoint connection fails (both span and metrics exporters)

Files:

src/app/agents/logfire_instrumentation.py
tests/agents/test_logfire_instrumentation.py (new)

Feature 2: Thread-Safe Graph Analysis Timeout Handling¶

Description: Replace Python signal-based timeouts in Tier 3 graph analysis with thread-safe alternatives. Currently, _with_timeout() fails with “signal only works in main thread” when called from Streamlit (non-main thread), causing path_convergence metric to return 0.0 fallback.

Acceptance Criteria:

Graph analysis timeout handling works in both main and non-main threads
path_convergence calculation succeeds in Streamlit GUI (no signal error)
CLI evaluation continues to work with timeouts (no regression)
Timeout mechanism uses concurrent.futures.ThreadPoolExecutor with timeout parameter
Graceful fallback when timeout occurs (return 0.3, log warning)
Tests: Hypothesis property tests for timeout bounds (0.0 <= fallback <= 0.5)
Tests: inline-snapshot for timeout error result structure
make validate passes
CHANGELOG.md updated

Technical Requirements:

Replace signal-based _with_timeout() in src/app/judge/graph_analysis.py:348
Implement thread-safe timeout using concurrent.futures.ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _with_timeout(func, *args, timeout=5.0):
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(func, *args)
        return future.result(timeout=timeout)

Update _calculate_path_convergence() exception handler (line 342) to catch concurrent.futures.TimeoutError
Maintain existing fallback values: disconnected graph -> 0.2, timeout -> 0.3
Preserve debug logging for timeout events

Files:

src/app/judge/graph_analysis.py
tests/evals/test_graph_analysis.py (update timeout tests)

Feature 3: Tier 2 Judge Provider Fallback Validation¶

Description: End-to-end validation that judge provider fallback works correctly. This is a testing and documentation task to confirm existing implementation handles missing API keys gracefully.

Acceptance Criteria:

Integration test: Run evaluation with tier2_provider=openai and no OPENAI_API_KEY set
Verify fallback to tier2_fallback_provider occurs (check logs)
Verify Tier 2 metrics use neutral fallback scores (0.5) when all providers unavailable
Verify composite score redistributes weights when Tier 2 is skipped
Verify Tier2Result includes fallback metadata flag
Update docs/best-practices/troubleshooting.md with Tier 2 auth failure guidance
Tests: inline-snapshot for Tier2Result with fallback metadata
make validate passes
CHANGELOG.md updated

Technical Requirements:

Create integration test in tests/evals/test_llm_evaluation_managers_integration.py
Test scenarios: 1. Valid primary provider -> Tier 2 succeeds 2. Invalid primary + valid fallback -> fallback succeeds 3. Both providers unavailable -> neutral scores, Tier 2 skipped
Add troubleshooting section to docs/best-practices/troubleshooting.md:
Symptom: “status_code: 401, model_name: gpt-4o-mini”
Cause: Missing OPENAI_API_KEY when tier2_provider=openai
Solution: Set valid API key or configure tier2_fallback_provider
Document expected behavior when Tier 2 is skipped (weight redistribution)

Files:

tests/evals/test_llm_evaluation_managers_integration.py (new)
docs/best-practices/troubleshooting.md (new)

Feature 4: Complete Test Suite Alignment¶

Description: Refactor remaining test suite to use hypothesis (property-based testing) and inline-snapshot (regression testing), completing the test infrastructure alignment. No production code changes. Covers integration tests, benchmarks, GUI tests, and data utilities not yet converted. Explicitly excludes BDD/Gherkin (pytest-bdd).

Acceptance Criteria:

Technical Requirements:

Apply hypothesis for property-based testing to:
Data validation: PeerRead dataset schemas, model serialization
Integration tests: API responses, trace data outputs
GUI tests: Session state updates, widget value bounds
Apply inline-snapshot for regression testing to:
Integration test outputs: evaluation pipeline results, trace data structures
GUI rendering: Streamlit page component outputs
Benchmark results: performance metric structures
Remove trivial tests per testing-strategy.md guidelines:
Field existence checks (Pydantic models already validate)
Simple getter/setter tests
Tests that duplicate type checker validation
Maintain coverage thresholds (no reduction in coverage percentage)
Document patterns for future test authoring

Priority Test Areas (from testing-strategy.md):

CRITICAL: Data validation (PeerRead dataset schemas, trace data formats)
CRITICAL: Integration test invariants (end-to-end evaluation flows)
HIGH: GUI state management (session state persistence, provider selection)
HIGH: Serialization (integration test result structures)
MEDIUM: Benchmark output validation (performance metric consistency)

Files:

tests/app/test_evaluation_wiring.py (snapshot for evaluation outputs)
tests/benchmarks/test_performance_baselines.py (snapshot for benchmark results)
tests/data_utils/test_datasets_peerread.py (property tests for schemas)
tests/evals/test_opik_metrics.py (property tests for metric bounds)
tests/integration/test_enhanced_peerread_integration.py (snapshot for integration outputs)
tests/integration/test_opik_integration.py (snapshot for trace outputs)
tests/integration/test_peerread_integration.py (property tests + snapshots)
tests/integration/test_peerread_real_dataset_validation.py (property tests for real data)
tests/metrics/test_metrics_output_similarity.py (property tests for similarity bounds)
tests/test_gui/test_agent_graph_page.py (snapshot for GUI components)
tests/test_gui/test_evaluation_page.py (snapshot for GUI outputs)
tests/test_gui/test_sidebar_phoenix.py (snapshot for sidebar structure)

Feature 5: CC Trace Adapter¶

Description: Parse Claude Code artifacts into GraphTraceData format in two modes so CC runs can be evaluated through the same three-tier pipeline used for PydanticAI MAS runs. Both modes assume CC has full internal tool, plugin, and MCP access (the same capabilities as the PydanticAI agents).

Solo mode: Parse a CC session export directory containing conversation history and tool-call logs from a single CC instance (no orchestration). Produces a single-agent GraphTraceData with tool_calls and timing_data but minimal agent_interactions and no coordination_events.
Teams mode: Parse CC Agent Teams artifacts (~/.claude/teams/, ~/.claude/tasks/) from a multi-agent CC run with delegation. Produces a multi-agent GraphTraceData with full agent_interactions, tool_calls, timing_data, and coordination_events.

Acceptance Criteria:

Output GraphTraceData instance passes existing Tier 3 graph analysis without modification in both modes
Auto-detect mode from directory structure (presence of config.json with members array indicates teams; otherwise solo)
Graceful error handling when CC artifact directories are missing or malformed
Tests: Hypothesis property tests for data mapping invariants (all fields populated, timestamps ordered) in both modes
Tests: inline-snapshot for GraphTraceData output structure from sample CC artifacts (one solo, one teams)
make validate passes
CHANGELOG.md updated

5.1 Teams Mode¶

Acceptance Criteria:

Adapter reads CC team config from config.json and extracts execution_id from team name
Adapter parses inboxes/*.json messages into agent_interactions list
Adapter parses tasks/*.json completions into tool_calls list (task completions as proxy)
Adapter derives timing_data from first/last timestamps across all artifacts
Adapter extracts coordination_events from task assignments and blocked-by relationships

5.2 Solo Mode¶

Acceptance Criteria:

Adapter reads CC session export directory and extracts execution_id from session metadata
Adapter parses tool-call entries from session logs into tool_calls list
Adapter derives timing_data from session start/end timestamps
agent_interactions is empty or contains only user-agent exchanges
coordination_events is empty (single agent, no delegation)

Technical Requirements:

Create CCTraceAdapter class that accepts a CC artifacts directory path and auto-detects mode
Teams mode data mapping from CC artifacts to GraphTraceData:

GraphTraceData field	CC source	Mapping
`execution_id`	`config.json` team name	Direct
`agent_interactions`	`inboxes/*.json` messages	`{"from": sender, "to": recipient, "type": msg_type, "timestamp": ts}`
`tool_calls`	`tasks/*.json` completions	`{"agent_id": owner, "tool_name": subject, "success": completed, "duration": derived}`
`timing_data`	First/last timestamps	`{"start_time": min, "end_time": max, "total_duration": delta}`
`coordination_events`	`tasks/*.json` assignments + blocks	`{"coordination_type": "task_delegation", "manager_agent": lead, "target_agents": [owner]}`

Solo mode data mapping:

GraphTraceData field	CC source	Mapping
`execution_id`	Session directory name or metadata	Direct
`agent_interactions`	None (single agent)	Empty list
`tool_calls`	Session tool-call log entries	`{"agent_id": "cc-solo", "tool_name": tool_name, "success": bool, "duration": derived}`
`timing_data`	Session start/end timestamps	`{"start_time": min, "end_time": max, "total_duration": delta}`
`coordination_events`	None (single agent)	Empty list

Post-hoc parsing of CC artifacts (not live OTel) – CC Agent Teams do not store tool-level traces, so task completions serve as proxy for tool_calls in teams mode
Validate parsed data against existing GraphTraceData Pydantic model
Return empty/default GraphTraceData when artifacts directory is invalid (log warning, do not raise)

Files:

src/app/judge/cc_trace_adapter.py (new)
tests/judge/test_cc_trace_adapter.py (new)

Feature 6: Baseline Comparison Engine¶

Description: New BaselineComparison Pydantic model and comparison logic to diff CompositeResult instances across three systems: PydanticAI MAS, CC solo (no orchestration), and CC teams (with orchestration). The pairwise compare() function diffs any two CompositeResult instances; a compare_all() convenience function produces all three pairwise comparisons at once. Reuses existing CompositeResult model and CompositeScorer.extract_metric_values().

Acceptance Criteria:

Technical Requirements:

Create BaselineComparison Pydantic model:
label_a: str – human label for first system (e.g., “PydanticAI MAS”)
label_b: str – human label for second system (e.g., “CC-solo”)
result_a: CompositeResult – first system evaluation
result_b: CompositeResult – second system evaluation
metric_deltas: dict[str, float] – per-metric delta (6 composite metrics)
tier_deltas: dict[str, float] – tier-level score differences
summary: str – human-readable comparison
Create compare(result_a: CompositeResult, result_b: CompositeResult, label_a: str, label_b: str) -> BaselineComparison function
Create compare_all(pydantic_result: CompositeResult | None, cc_solo_result: CompositeResult | None, cc_teams_result: CompositeResult | None) -> list[BaselineComparison] convenience function
Reuse CompositeScorer.extract_metric_values() (src/app/judge/composite_scorer.py:164) to extract per-metric values from each result
Compute deltas as value_a - value_b for each metric
Generate summary string listing metrics where delta exceeds 0.05 threshold

Files:

src/app/judge/baseline_comparison.py (new)
src/app/data_models/evaluation_models.py (add BaselineComparison model)
tests/judge/test_baseline_comparison.py (new)

Feature 7: CLI & GUI Baseline Integration¶

Description: Wire the CC trace adapter and baseline comparison engine into the existing CLI and GUI so users can run side-by-side evaluations. Supports two CC baseline modes: solo (single CC instance, no orchestration) and teams (CC Agent Teams with delegation). Both modes assume CC had full internal tool, plugin, and MCP access during the run being evaluated.

Acceptance Criteria:

Technical Requirements:

CLI: Add --cc-solo-dir and --cc-teams-dir arguments to CLI entry point
CLI: For each provided directory, call CCTraceAdapter(path).parse() to get CC GraphTraceData, then run through evaluate_comprehensive() pipeline
CLI: Call compare_all() with available results (pass None for missing baselines) and print each BaselineComparison.summary
GUI: Add baseline section to evaluation results page using existing Streamlit patterns
GUI: Display metric_deltas as side-by-side bar chart and summary as text for each pairwise comparison
All traces go through the same evaluation pipeline (evaluate_comprehensive())
Reuse existing GUI evaluation page patterns (src/gui/pages/evaluation.py)

Files:

src/app/app.py (add --cc-solo-dir and --cc-teams-dir CLI flags)
src/gui/pages/evaluation.py (add baseline comparison view)
tests/app/test_cli_baseline.py (new)
tests/test_gui/test_evaluation_baseline.py (new)

Non-Functional Requirements¶

Maintainability:
Use modular design patterns for easy updates and maintenance.
Implement logging and error handling for debugging and monitoring.
Graceful degradation when external services unavailable.
Performance:
Timeout mechanisms must not introduce significant latency overhead.
Thread-safe implementations should minimize thread pool creation overhead.
CC trace adapter must parse typical team artifacts (< 50 files) in under 2 seconds.
Documentation:
Comprehensive troubleshooting guide for common operational issues.
Docstrings for all new functions and classes (Google style format).
Testing:
All new features must include tests per docs/best-practices/testing-strategy.md
Use Hypothesis (@given) for property-based tests: timeout bounds, retry behavior, score fallbacks, data mapping invariants, delta symmetry
Use inline-snapshot (snapshot()) for regression tests: warning messages, error result structures, trace adapter output, comparison model dumps
Use pytest for standard unit/integration tests with Arrange-Act-Assert structure
Tool selection: pytest for logic, Hypothesis for properties, inline-snapshot for structure

Out of Scope¶

Opik service auto-start on GUI launch (user must manually run make start_opik)
Custom OTLP exporter implementation (use standard OpenTelemetry libraries)
Tier 3 graph analysis performance optimization (timeout mechanism only)
Alternative tracing backends (Phoenix/Logfire only)
Persistent retry queues for failed trace exports (in-memory only)
Gemini provider compatibility (agent_system.py:610 FIXME – deferred to future sprint)
HuggingFace provider implementation (deferred to future sprint)
Streaming with Pydantic model outputs (agent_system.py:522 – deferred to future sprint)
CC OpenTelemetry live telemetry (post-hoc artifact parsing only)
OTel Collector Docker deployment for CC traces
CC native span creation or instrumentation
A2A (Agent-to-Agent) protocol integration
Provisioning CC tool/plugin/MCP access (assumed pre-configured by the user before the CC run)

Notes for Ralph Loop¶

Story Breakdown - Sprint 4 (7 stories total):

Feature 1 (Logfire Export) → STORY-001: Graceful Logfire trace export failures
Feature 2 (Graph Timeout) → STORY-002: Thread-safe graph analysis timeout handling
Feature 3 (Judge Fallback Validation) → STORY-003: Tier 2 judge provider fallback validation
Feature 4 (Complete Test Alignment) → STORY-004: Complete test suite alignment with hypothesis and inline-snapshot
Feature 5 (CC Trace Adapter) → STORY-005: CC trace adapter for solo and teams artifacts
Feature 6 (Baseline Comparison) → STORY-006: Baseline comparison engine for CompositeResult diffing
Feature 7 (CLI & GUI Baseline) → STORY-007: CLI and GUI baseline integration (depends: STORY-002, STORY-005, STORY-006)