PRD Sprint3 Ralph
title: Product Requirements Document: Agents-eval Sprint 3 version: 3.9.0 created: 2026-02-15 updated: 2026-02-15
Project Overview¶
Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager → Researcher → Analyst → Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).
Sprint 3 adds judge provider fallback for Tier 2 evaluation, restructures the evaluation pipeline into a plugin architecture (EvaluatorPlugin + PluginRegistry → JudgeAgent), adds model-aware content truncation for provider rate limits, introduces a standalone CC OTel observability plugin, aligns the test suite with documented testing strategy (hypothesis for property-based tests, inline-snapshot for regression tests), and wires the Streamlit GUI to display actual pydantic-settings defaults, makes weave observability optional, fixes trace data quality for Tier 3 graph analysis, and adds GUI controls for provider and sub-agent configuration. All Sprint 2 features (settings migration, eval wiring, trace capture, graph-vs-text comparison, Logfire+Phoenix tracing, Streamlit dashboard) are prerequisites.
Functional Requirements¶
Sprint 3: Plugin Architecture and Infrastructure¶
Feature 5: Model-Aware Content Truncation¶
Description: Implement token-limit-aware content truncation to prevent 413 errors when paper content exceeds provider rate limits (e.g., GitHub Models free tier enforces 8,000 token request limit for gpt-4.1, despite the model supporting 1M tokens natively).
Acceptance Criteria:
-
CommonSettingsincludes per-providermax_content_lengthdefaults -
generate_paper_review_content_from_templatetruncatespaper_content_for_templatetomax_content_lengthbefore formatting into template - Truncation preserves abstract (always included) and truncates body with
[TRUNCATED]marker - Warning logged when truncation occurs with original vs truncated size
-
make validatepasses
Technical Requirements:
- Add per-provider max_content_length to
CommonSettings - Truncation logic in
generate_paper_review_content_from_template - Preserve abstract section, truncate body content
Files:
src/app/agents/peerread_tools.pysrc/app/common/settings.py
Feature 6: Judge Provider Fallback¶
Description: Make the Tier 2 LLM-as-Judge evaluation provider configurable and resilient. Currently hardcoded to openai/gpt-4o-mini, causing 401 errors when no OPENAI_API_KEY is set. The judge should validate API key availability at startup and fall back to an available provider or skip Tier 2 gracefully.
Acceptance Criteria:
- Judge provider validates API key availability before attempting evaluation
- When configured provider’s API key is missing, falls back to
tier2_fallback_provider/tier2_fallback_model - When no valid judge provider is available, Tier 2 is skipped with a warning (not scored 0.0)
- Composite score adjusts weights when Tier 2 is skipped (redistribute to Tier 1 + Tier 3)
-
JudgeSettings.tier2_providerandtier2_modeloverridable viaJUDGE_TIER2_PROVIDER/JUDGE_TIER2_MODELenv vars (already exists, ensure it works end-to-end) - Fallback heuristic scores capped at 0.5 (neutral) when LLM assessment fails due to auth/provider errors
- Tier2Result includes metadata flag indicating whether fallback was used
- CompositeScorer logs warning when using fallback-derived scores
- Tests: Hypothesis property tests for fallback score bounds (0.0 ≤ fallback ≤ 0.5)
- Tests: inline-snapshot for Tier2Result structure with fallback metadata
-
make validatepasses
Technical Requirements:
- Add API key availability check in
LLMJudgeEngineinitialization - Implement provider fallback chain: configured → fallback → skip
- Update
CompositeScorerto handle missing Tier 2 (weight redistribution) - Log clear warning when Tier 2 is skipped due to missing provider
- Fix
_fallback_planning_check()inllm_evaluation_managers.py:356-357— cap fallback scores at 0.5 instead of 1.0 for “optimal range” - Distinguish auth failures (401) from timeouts in fallback scoring
Files:
src/app/evals/llm_evaluation_managers.py(provider validation + fallback)src/app/evals/composite_scorer.py(weight redistribution when tier skipped)src/app/evals/settings.py(ensure fallback settings work)
Feature 7: EvaluatorPlugin Base and Registry¶
Description: Create EvaluatorPlugin ABC and PluginRegistry for typed, tier-ordered plugin execution.
Acceptance Criteria:
-
EvaluatorPluginABC with name/tier/evaluate/get_context_for_next_tier -
PluginRegistryfor registration and tier-ordered execution - Typed Pydantic models at all plugin boundaries
- Structured error results from plugins
Technical Requirements:
- ABC defines plugin interface:
name,tier,evaluate(),get_context_for_next_tier() - Registry manages plugin lifecycle and tier-ordered execution
- All data contracts use Pydantic models
Files:
src/app/judge/plugins/base.pysrc/app/judge/plugins/__init__.py
Feature 8: TraditionalMetricsPlugin Wrapper¶
Description: Wrap existing TraditionalMetricsEngine as an EvaluatorPlugin.
Acceptance Criteria:
- TraditionalMetricsPlugin wrapping existing engine
- All existing Tier 1 engine tests pass unchanged
- Per-plugin configurable timeout
Technical Requirements:
- Adapter pattern: delegate to existing
TraditionalMetricsEngine - Expose via
EvaluatorPlugininterface - Configurable timeout from
JudgeSettings
Files:
src/app/judge/plugins/traditional.py
Feature 9: LLMJudgePlugin Wrapper¶
Description: Wrap existing LLMJudgeEngine as an EvaluatorPlugin with opt-in Tier 1 context enrichment.
Acceptance Criteria:
- LLMJudgePlugin with opt-in Tier 1 context enrichment
- All existing Tier 2 engine tests pass unchanged
- Per-plugin configurable timeout
Technical Requirements:
- Adapter pattern: delegate to existing
LLMJudgeEngine - Accept optional Tier 1 context via
get_context_for_next_tier() - Configurable timeout from
JudgeSettings
Files:
src/app/judge/plugins/llm_judge.py
Feature 10: GraphEvaluatorPlugin Wrapper¶
Description: Wrap existing GraphAnalysisEngine as an EvaluatorPlugin.
Acceptance Criteria:
- GraphEvaluatorPlugin wrapping existing engine
- All existing Tier 3 engine tests pass unchanged
- Per-plugin configurable timeout
Technical Requirements:
- Adapter pattern: delegate to existing
GraphAnalysisEngine - Expose via
EvaluatorPlugininterface - Configurable timeout from
JudgeSettings
Files:
src/app/judge/plugins/graph_metrics.py
Feature 11: Plugin-Driven Pipeline¶
Description: Replace EvaluationPipeline with JudgeAgent using PluginRegistry for tier-ordered plugin execution.
Acceptance Criteria:
- JudgeAgent replaces EvaluationPipeline using PluginRegistry
- Explicit tier execution order in code
- Context flows Tier 1 → Tier 2 → Tier 3
- TraceStore with thread-safe storage
- Graceful degradation preserved
- Re-export shim for EvaluationPipeline
Technical Requirements:
JudgeAgentorchestrates plugins viaPluginRegistry- Tier context passed forward via
get_context_for_next_tier() TraceStoreprovides thread-safe trace storage- Backward-compatible
EvaluationPipelinere-export shim
Files:
src/app/judge/agent.pysrc/app/judge/trace_store.pysrc/app/judge/composite_scorer.pysrc/app/judge/performance_monitor.py
Feature 12: Migration Cleanup¶
Description: Remove backward-compatibility shims, update all imports, delete deprecated JSON config.
Acceptance Criteria:
- All imports use
judge.,common.paths - No re-export shims remain
-
config/config_eval.jsonremoved - Remove or implement commented-out
error_handling_context()FIXME notes inagent_system.py(lines 443, 514, 583) - Delete duplicate
src/app/agents/peerread_tools.py(canonical:src/app/tools/peerread_tools.py, imported atagent_system.py:63) - CHANGELOG.md updated
-
make validatepasses, no dead code
Technical Requirements:
- Update all source and test imports from
evals.tojudge.paths - Remove re-export shim from Feature 11
- Delete deprecated
config/config_eval.json - Resolve
error_handling_context()FIXMEs: either implement as a context manager or delete the comments (current try/except at line 520 is adequate)
Files:
CHANGELOG.mdsrc/app/agents/agent_system.py
Feature 13: CC OTel Observability Plugin¶
Description: Standalone CC telemetry plugin using OTel → Logfire + Phoenix pipeline. Enables CC session tracing alongside PydanticAI Logfire auto-instrumentation.
Acceptance Criteria:
-
src/app/cc_otel/module with config + enable/disable API -
CCOtelConfigwith env var export - OTel traces routed to Phoenix via OTLP endpoint
- Separate from existing
logfire_instrumentation.py - Graceful degradation when OTel unavailable
-
make validatepasses
Technical Requirements:
- Standalone module at
src/app/cc_otel/ CCOtelConfigusing pydantic-settings pattern- OTLP exporter sends to Phoenix endpoint
- Independent from
logfire_instrumentation.py(no coupling)
Files:
src/app/cc_otel/__init__.pysrc/app/cc_otel/config.pyMakefile(phoenix targets)
Feature 14: Wire GUI to Actual Settings¶
Description: Connect Streamlit GUI to load and display actual default values from CommonSettings and JudgeSettings pydantic-settings classes. Remove hardcoded PROMPTS_DEFAULT fallback and load prompts directly from ChatConfig. Follows DRY principle (single source of truth) and KISS principle (simple display, no persistence).
Acceptance Criteria:
- Settings page displays
CommonSettingsfields (log_level, enable_logfire, max_content_length) - Settings page displays key
JudgeSettingsfields (tier timeouts, composite thresholds, enabled tiers) - Prompts page loads from
ChatConfig.promptswithout hardcoded fallback - GUI instantiates
CommonSettings()andJudgeSettings()on startup - Displayed values match actual pydantic-settings defaults
- Remove hardcoded
PROMPTS_DEFAULTfromgui/config/config.py -
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Instantiate
CommonSettings()andJudgeSettings()insrc/run_gui.py - Pass settings instances to
render_settings() - Update
render_settings()to display CommonSettings and key JudgeSettings fields - Update
render_prompts()to useChatConfig.promptsdirectly (remove fallback) - Delete
PROMPTS_DEFAULTconstant fromgui/config/config.py - Read-only display (no save functionality per YAGNI principle)
- Use Streamlit expanders to organize settings by category
Key Settings to Display (JudgeSettings):
- Tiers:
tiers_enabled,tier1_max_seconds,tier2_max_seconds,tier3_max_seconds - Composite:
composite_accept_threshold,composite_weak_accept_threshold - Tier 2:
tier2_provider,tier2_model,tier2_cost_budget_usd - Observability:
trace_collection,logfire_enabled,phoenix_endpoint
Out of Scope (per YAGNI):
- Saving edited settings back to .env file (read-only display only)
- Full CRUD for all 50+ JudgeSettings fields (show key 12-15 fields only)
- Settings validation/editing (display actual values only)
Files:
src/run_gui.py(instantiate CommonSettings, JudgeSettings)src/gui/pages/settings.py(render CommonSettings, key JudgeSettings)src/gui/pages/prompts.py(remove hardcoded fallback)src/gui/config/config.py(delete PROMPTS_DEFAULT)
Feature 15: Test Infrastructure Alignment¶
Description: Refactor existing tests to use hypothesis (property-based testing) and inline-snapshot (regression testing), aligning test suite with documented testing-strategy.md practices. No production code changes. Explicitly excludes BDD/Gherkin (pytest-bdd).
Acceptance Criteria:
- Property-based tests using
@givenfor math formulas (score bounds, composite calculations) - Property-based tests for input validation (arbitrary text handling)
- Property-based tests for serialization (model dumps always valid)
- Snapshot tests using
snapshot()for Pydantic.model_dump()outputs - Snapshot tests for complex nested result structures
- Snapshot tests for GraphTraceData transformations
- Remove low-value tests (trivial assertions, field existence checks per testing-strategy.md)
- All existing test coverage maintained or improved
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Add
from hypothesis import given, strategies as stimports - Add
from inline_snapshot import snapshotimports - Convert score calculation tests to property tests with invariants (0.0 ≤ score ≤ 1.0)
- Convert model serialization tests to snapshot tests
- Document usage patterns in test files for future reference
- NO pytest-bdd, NO Gherkin, NO BDD methodology (use TDD with hypothesis for properties)
Priority Test Areas (from testing-strategy.md):
- CRITICAL: Math formulas (composite scoring, normalization bounds)
- CRITICAL: Loop termination (evaluation pipeline timeouts)
- HIGH: Input validation (arbitrary paper/review text)
- HIGH: Serialization (Tier1/2/3 result model dumps)
- MEDIUM: Invariants (tier weight sums, score aggregation)
Files:
tests/evals/test_composite_scorer.py(score bounds properties)tests/evals/test_traditional_metrics.py(similarity score properties)tests/data_models/test_peerread_models_serialization.py(snapshot tests)tests/evals/test_evaluation_pipeline.py(result structure snapshots)tests/evals/test_llm_evaluation_managers.py(fallback property tests)tests/evals/test_graph_analysis.py(graph metric properties)- Other test files as needed (~10-15 files total)
Feature 16: Optional Weave Integration¶
Description: Make weave dependency optional. Only import/init when WANDB_API_KEY is configured. Eliminates warning noise for users who don’t use Weights & Biases.
Acceptance Criteria:
-
weavemoved from required to optional dependency group inpyproject.toml -
login.pyconditionally imports weave only whenWANDB_API_KEYis present -
app.pyprovides no-op@op()decorator fallback when weave unavailable - No warning messages emitted when
WANDB_API_KEYnot set - Existing weave tracing works unchanged when
WANDB_API_KEYIS set - Tests use Hypothesis for import guard property tests (weave present vs absent)
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Move
weave>=0.52.28to optional group inpyproject.toml try/except ImportErrorguard inapp.py:op = lambda: lambda f: f- Conditional import in
login.py— only import weave inside theif is_api_key:block
Files:
pyproject.tomlsrc/app/utils/login.pysrc/app/app.py
Feature 17: Trace Data Quality & Manager Tool Tracing¶
Description: Fix trace data transformation bugs, add trace logging to PeerRead tools, initialize Logfire instrumentation, and improve trace storage logging.
Acceptance Criteria:
- Fix:
_process_events()includesagent_idin tool_call dicts (trace_processors.py:268-269) - Fix:
_parse_trace_events()includesagent_idin tool_call dicts (trace_processors.py:376-377) - Tier 3 graph analysis succeeds with
--include-researchertraces (no “missing agent_id” error) - PeerRead tools log trace events via
trace_collector.log_tool_call()(all 6 tools) -
initialize_logfire_instrumentation_from_settings()called at startup whenlogfire_enabled=True -
_store_trace()logs full storage path (JSONL + SQLite) at least once per execution - Manager-only runs produce non-empty trace data
- Tests: Hypothesis property tests for trace event schema invariants (agent_id always present)
- Tests: inline-snapshot for GraphTraceData transformation output structure
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- In
_process_events()line 269: add"agent_id": event.agent_idto tool_call dict - In
_parse_trace_events()line 377: add"agent_id": agent_idto tool_call dict - Add
trace_collector.log_tool_call()to 6 PeerRead tools insrc/app/tools/peerread_tools.pyfollowing delegation tool pattern (time.perf_counter()timing, success/failure) - Call
initialize_logfire_instrumentation_from_settings()insrc/app/app.pyafter settings load - Extend log message at
trace_processors.py:352-358to includeself.storage_path - Use
JudgeSettings.logfire_enabledas authoritative setting for Logfire initialization (notCommonSettings.enable_logfire)
Files:
src/app/evals/trace_processors.py(agent_id fix + path logging)src/app/tools/peerread_tools.py(trace logging for 6 tools)src/app/app.py(Logfire init call)
Note: src/app/agents/peerread_tools.py appears to be a duplicate of src/app/tools/peerread_tools.py. Canonical import is app.tools.peerread_tools (used at agent_system.py:63).
Feature 18: GUI Agent & Provider Configuration¶
Description: Expose provider selection and sub-agent toggles in the Streamlit GUI with session state persistence. Currently CLI-only (--chat-provider, --include-researcher/analyst/synthesiser).
Acceptance Criteria:
- Settings page displays provider selectbox with all providers from
PROVIDER_REGISTRY - Settings page displays checkboxes for include_researcher, include_analyst, include_synthesiser
- Selections persist across page navigation via
st.session_state - Run App page passes all flags to
main()from session state - Default provider matches
CHAT_DEFAULT_PROVIDER - Tests: inline-snapshot for session state defaults structure
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Settings page: provider selectbox keyed to
st.session_state, agent checkboxes - Run App page: read from session state, pass to
main(chat_provider=..., include_researcher=..., ...) run_gui.py: initialize session state defaults on startup- Import
PROVIDER_REGISTRYfromapp.data_models.app_modelsfor provider list
Files:
src/gui/pages/settings.pysrc/gui/pages/run_app.pysrc/run_gui.py
Non-Functional Requirements¶
- Maintainability:
- Use modular design patterns for easy updates and maintenance.
- Implement logging and error handling for debugging and monitoring.
- Performance:
- Ensure low latency in evaluation pipeline execution.
- Optimize for memory usage during graph analysis.
- Documentation:
- Comprehensive documentation for setup, usage, and testing.
- Docstrings for all new functions and classes (Google style format).
- Testing:
- All new features must include tests per
docs/best-practices/testing-strategy.md - Use Hypothesis (
@given) for property-based tests: score bounds, input validation, serialization invariants - Use inline-snapshot (
snapshot()) for regression tests: Pydantic model dumps, complex result structures - Use pytest for standard unit/integration tests with Arrange-Act-Assert structure
- Tool selection: pytest for logic, Hypothesis for properties, inline-snapshot for structure
- NO pytest-bdd / Gherkin (already in Out of Scope)
Out of Scope¶
- A2A protocol migration (PydanticAI stays)
- Agent system restructuring (
src/app/agents/unchanged except trace instrumentation) - Streaming with Pydantic model outputs (
agent_system.py:522NotImplementedError— PydanticAI supportsstream_struct()/agent.iter()but integration deferred) - Gemini provider compatibility (
agent_system.py:610FIXME —ModelRequestiteration andMALFORMED_FUNCTION_CALLliteral errors) - HuggingFace provider implementation (falls through to generic OpenAI-compatible path, no dedicated handling needed yet)
- pytest-bdd / Gherkin scenarios (use pytest + hypothesis instead)
- HuggingFace
datasetslibrary (use GitHub API downloader instead) - Google Gemini SDK (
google-genai) — use OpenAI-spec compatible providers only - VCR-based network mocking (use @patch for unit tests)
- Browser-based E2E tests (Playwright/Selenium deferred)
- CC-style evaluation baselines (deferred beyond Sprint 3)
- E2E integration tests and multi-channel deployment (deferred beyond Sprint 3)
Notes for Ralph Loop¶
Story Breakdown - Sprint 3 (14 stories total):
- Feature 5 (Content Truncation) → STORY-001: Model-aware content truncation
- Feature 6 (Judge Fallback) → STORY-002: Judge provider fallback for Tier 2
- Feature 7 (Plugin Base) → STORY-003: EvaluatorPlugin base and registry
- Feature 8 (Traditional Adapter) → STORY-004: TraditionalMetricsPlugin wrapper (depends: STORY-003)
- Feature 9 (LLM Judge Adapter) → STORY-005: LLMJudgePlugin wrapper (depends: STORY-003)
- Feature 10 (Graph Adapter) → STORY-006: GraphEvaluatorPlugin wrapper (depends: STORY-003)
- Feature 11 (Plugin Pipeline) → STORY-007: JudgeAgent replaces EvaluationPipeline (depends: STORY-004, STORY-005, STORY-006)
- Feature 12 (Migration Cleanup) → STORY-008: Remove shims and update imports (depends: STORY-007)
- Feature 13 (CC OTel) → STORY-009: CC OTel observability plugin (depends: STORY-007)
- Feature 14 (GUI Settings Wiring) → STORY-010: Wire GUI to actual settings (depends: STORY-008)
- Feature 15 (Test Refactoring) → STORY-011: Test infrastructure alignment (depends: STORY-008)
- Feature 16 (Optional Weave) → STORY-012: Make weave dependency optional
- Feature 17 (Trace Quality) → STORY-013: Trace data quality fixes + manager tool tracing
- Feature 18 (GUI Config) → STORY-014: GUI agent & provider configuration (depends: STORY-010)