PRD Sprint5 Ralph

title: Product Requirements Document: Agents-eval Sprint 5 version: 1.0.0 created: 2026-02-15 updated: 2026-02-15

Project Overview¶

Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager -> Researcher -> Analyst -> Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).

Sprint 4 delivered operational resilience (Logfire, graph timeouts, test alignment) and Claude Code baseline comparison infrastructure.

Sprint 5 addresses runtime bugs, GUI enhancements, architectural improvements, code quality review, and technical debt cleanup across 17 stories:

Core Runtime Fixes (Features 1-4): Judge provider fallback, token limit configurability, single-agent score fairness, PeerRead validation resilience
GUI Enhancements (Features 6-9): Background execution, debug log panel, results tab wiring, editable settings
Architecture & Reliability (Features 12-13, 16): OTLP endpoint fix, graph analysis accuracy, PeerRead tool delegation for multi-agent coordination
Code Quality & Cleanup (Features 10-11, 14-15, 17): Security review (OWASP MAESTRO), test suite audit, wandb import guard, API key debug logging, duplicate AppEnv deletion
Documentation (Feature 5): Comprehensive updates reflecting all Sprint 5 changes

Development Methodology¶

All implementation stories MUST follow these practices. Ralph Loop enforces this order.

TDD Workflow (Mandatory for Features 1-4, 6-9, 12-17)¶

RED: Write failing tests first using testing-python skill. Tests define expected behavior before any implementation code exists.
GREEN: Implement minimal code to pass tests using implementing-python skill. No extra functionality.
REFACTOR: Clean up while keeping tests green. Run make validate before marking complete.

Test Tool Selection (per `docs/best-practices/testing-strategy.md`)¶

Tool	Use for	NOT for
pytest	Core logic, unit tests, known edge cases (primary TDD tool)	Random inputs
Hypothesis	Property invariants, bounds, all-input guarantees	Snapshots, known cases
inline-snapshot	Regression, model dumps, complex structures	TDD red-green, ranges

Decision rule: If the test wouldn’t catch a real bug, don’t write it. Test behavior, not implementation.

Core Principles (per `.claude/rules/core-principles.md`)¶

KISS: Simplest solution that passes tests. Clear > clever.
DRY: Reuse existing patterns (pydantic-settings, select_available_provider(), session state). Don’t rebuild.
YAGNI: Implement only what acceptance criteria require. No speculative features.

Skills Usage¶

Story type	Skills to invoke
Implementation (1-4, 6-9, 12-17)	`testing-python` (RED) → `implementing-python` (GREEN)
Review (10)	`reviewing-code` + `securing-mas`
Audit (11)	`testing-python` (for rewrites)
Documentation (5)	None (manual edits)

Functional Requirements¶

Feature 1: Wire Tier 2 Judge Provider Fallback¶

Description: The LLMJudgeEngine.select_available_provider() method exists but is never called. When tier2_provider=openai and no OPENAI_API_KEY is set, all three Tier 2 metrics fail with 401 and fall back to neutral 0.5 scores. Wire the existing fallback chain so the judge validates API key availability before attempting LLM calls, and add a tier2_provider=auto mode that inherits the agent system’s active provider.

Acceptance Criteria:

Technical Requirements:

In LLMJudgeEngine.__init__() (llm_evaluation_managers.py:35): store env_config parameter
In LLMJudgeEngine.create_judge_agent() (llm_evaluation_managers.py:110): call select_available_provider(self.env_config) to resolve provider/model before creating agent
If select_available_provider() returns None, skip Tier 2 evaluation (return neutral fallback)
Add tier2_provider: str = Field(default="auto") option to JudgeSettings – when "auto", use the chat_provider passed from the agent system
Update EvaluationPipeline.__init__() to accept optional chat_provider: str | None and pass it to LLMJudgeEngine
Update _run_evaluation_if_enabled() in app.py to forward chat_provider from the agent run
Update create_simple_model() in models.py to support Cerebras provider (reuse existing create_llm_model() provider logic)

Files:

src/app/judge/llm_evaluation_managers.py
src/app/judge/evaluation_pipeline.py
src/app/judge/settings.py
src/app/app.py
src/app/llms/models.py
tests/judge/test_llm_evaluation_managers.py (update)

Feature 2: Configurable Agent Token Limits¶

Description: The Cerebras provider has usage_limits: 60000 in config_chat.json, but the gpt-oss-120b model consumed 75,954 tokens (74,714 input + 1,240 output) during a GUI run, causing UsageLimitExceeded. The high input token count was amplified by PeerRead tool returning 23 papers plus structured output validation retries. Add CLI and GUI overrides for total_tokens_limit so users can adjust without editing config_chat.json.

Acceptance Criteria:

CLI: --token-limit N flag overrides usage_limits from config_chat.json
GUI: Token limit input field in settings sidebar (pre-populated from config_chat.json)
When flag/field is not set, existing config_chat.json value is used (no regression)
AGENT_TOKEN_LIMIT environment variable override (lowest priority after CLI/GUI)
Validation: minimum 1000, maximum 1000000
Tests: Hypothesis property tests for limit bounds and override priority
Tests: inline-snapshot for CLI help text
make validate passes
CHANGELOG.md updated

Technical Requirements:

Add --token-limit argument to CLI entry point in src/app/main.py
Pass override through main() in app.py to setup_agent_env() in agent_system.py
In setup_agent_env() (agent_system.py:633-638): use CLI override if provided, else config_chat.json value, else env var
Add token limit input to GUI settings sidebar in src/gui/pages/settings.py
Pass GUI value through to render_app() in src/gui/pages/run_app.py

Files:

src/app/main.py
src/app/app.py
src/app/agents/agent_system.py
src/gui/pages/settings.py
src/gui/pages/run_app.py
tests/app/test_cli_token_limit.py (new)

Feature 3: Single-Agent Composite Score Weight Redistribution¶

Description: The composite scorer uses equal weights (0.167 each) across 6 metrics. For single-agent runs (no multi-agent delegation), coordination_quality is structurally 0.0 (mapped from coordination_centrality in Tier 3 graph analysis), causing a guaranteed 0.167 deduction. The scorer should detect single-agent runs and redistribute coordination_quality weight to the remaining 5 metrics.

Acceptance Criteria:

Technical Requirements:

Add is_single_agent(execution_trace: GraphTraceData | None) -> bool helper in composite_scorer.py
In CompositeScorer.calculate_composite_score() (composite_scorer.py:167): check agent count and adjust weights
Adjusted weights for single-agent: each of the 5 remaining metrics gets 1.0 / 5.0 = 0.200
Add single_agent_mode: bool = False field to CompositeResult model in evaluation_models.py
Set flag in evaluate_composite() before returning result

Files:

src/app/judge/composite_scorer.py
src/app/data_models/evaluation_models.py
tests/judge/test_composite_scorer.py (update)

Feature 4: PeerRead Dataset Validation Resilience¶

Description: Six papers (304-308, 330) fail validation with KeyError: 'IMPACT' at datasets_peerread.py:724 because they lack the IMPACT field. These papers are silently skipped, reducing dataset coverage. The IMPACT field should be treated as optional with a sensible default instead of causing validation failure.

Acceptance Criteria:

Papers with missing IMPACT field are validated successfully with IMPACT defaulting to None or "UNKNOWN"
Papers with missing other optional fields (histories, comments) also handled gracefully
Existing papers with valid IMPACT field are unaffected (no regression)
Log debug message when optional field is missing (not warning)
Tests: Hypothesis property tests for paper validation with arbitrary missing optional fields
Tests: inline-snapshot for validated paper with missing IMPACT
make validate passes
CHANGELOG.md updated

Technical Requirements:

In _validate_papers() (datasets_peerread.py:~700-727): use paper_data.get("IMPACT", None) instead of direct key access
Update PeerReadPaper Pydantic model to make IMPACT field Optional[str] = None
Review other fields in validation loop for similar missing-key risks (histories, comments, SUBSTANCE, APPROPRIATENESS)
Use .get() with defaults for all optional paper metadata fields

Files:

src/app/data_utils/datasets_peerread.py
src/app/data_models/peerread_models.py (update PeerReadPaper model)
tests/data_utils/test_datasets_peerread.py (update)

Feature 5: Documentation and Diagram Updates¶

Description: Update project documentation and architecture diagrams to reflect Sprint 5 changes: judge provider fallback, configurable token limits, single-agent score redistribution, and PeerRead validation resilience. Add Sprint 5 to the roadmap and update architecture decision records where applicable.

Acceptance Criteria:

README.md: Version badge updated, Sprint 5 referenced in status section
docs/roadmap.md: Sprint 5 row added to roadmap table with status “Active” and link to PRD-Sprint5-Ralph.md
docs/architecture.md: Composite Scoring section updated to document single-agent weight redistribution behavior
docs/architecture.md: Tier 2 LLM-as-Judge section updated to document provider fallback chain and auto mode
docs/architecture.md: Implementation Status section updated with Sprint 5 entry
docs/arch_vis/: Update relevant PlantUML diagrams if evaluation pipeline flow changed (e.g., provider selection step in Tier 2)
CHANGELOG.md updated
Stale “Opik integration” docstrings in graph_analysis.py (lines 423, 506) updated to reference Phoenix
No broken internal links introduced

Technical Requirements:

Update docs/roadmap.md roadmap table: add Sprint 5 row, update Sprint 4 status to “Delivered”
Update docs/architecture.md Composite Scoring System section (line ~180-200) to note weight redistribution for single-agent runs
Update docs/architecture.md LLM-as-a-Judge section (line ~124-139) to document tier2_provider=auto and fallback chain
Update docs/architecture.md Development Timeline section (line ~297-304) with Sprint 5 entry
Review docs/arch_vis/mas-enhanced-workflow.plantuml and docs/arch_vis/metrics-eval-sweep.plantuml for accuracy against new evaluation flow
Update README.md version badge if project version incremented

Files:

README.md
docs/roadmap.md
docs/architecture.md
docs/arch_vis/mas-enhanced-workflow.plantuml (if applicable)
docs/arch_vis/metrics-eval-sweep.plantuml (if applicable)
CHANGELOG.md

Description: When a user navigates away from the App tab during query execution, the run aborts because Streamlit re-runs the script on page change. The execution result is also lost because it is not persisted to session state. The app should run queries in the background and persist results so users can navigate freely and return to see completed output.

Acceptance Criteria:

Technical Requirements:

Use st.session_state to persist execution state: running, result, error, execution_id
Run main() via st.fragment (Streamlit 1.33+) or threading.Thread with callback that writes result to session state on completion. Note: st.session_state is not thread-safe – use st.fragment if available, otherwise synchronize writes
In render_app() (run_app.py:108): check session state for existing result before starting new execution
Add st.spinner("Running evaluation...") or st.status() container around execution
Store CompositeResult in st.session_state["evaluation_result"] for cross-tab access

Files:

src/gui/pages/run_app.py
src/run_gui.py
tests/test_gui/test_run_app.py (update)

Feature 7: Debug Log Panel in App Tab¶

Description: Add an expandable panel in the App tab that displays real-time pipeline log output (evaluation metrics, tier results, errors) that currently only appears in the terminal. Users should see the same diagnostic information visible in the CLI without needing terminal access.

Acceptance Criteria:

Technical Requirements:

Add a custom loguru sink that captures log records to a list during execution
Store captured logs in st.session_state["debug_logs"]
In render_app(): add st.expander("Debug Log", expanded=False) after result display
Inside expander, render each log entry using st.text() or st.code() with level formatting
Sink should filter to app.* module logs only (exclude Streamlit internals)
Clear log buffer at start of each new execution

Files:

src/gui/pages/run_app.py
src/gui/utils/log_capture.py (new)
tests/test_gui/test_log_capture.py (new)

Feature 8: Wire Evaluation Results and Agent Graph Tabs to Real Data¶

Description: The “Evaluation Results” and “Agent Graph” tabs have full rendering implementations but are called with None data from run_gui.py (lines 100, 103). After a query execution in the App tab, both pages should display actual results from the completed run instead of showing placeholder messages.

Acceptance Criteria:

Technical Requirements:

Store CompositeResult in st.session_state["evaluation_result"] after evaluation pipeline completes (from Feature 6)
Build nx.DiGraph from GraphTraceData and store in st.session_state["agent_graph"]
In run_gui.py: pass st.session_state.get("evaluation_result") to render_evaluation() instead of None
In run_gui.py: pass st.session_state.get("agent_graph") to render_agent_graph() instead of None
Graph construction: reuse existing GraphTraceData → nx.DiGraph conversion from graph_analysis.py

Files:

src/run_gui.py
src/gui/pages/run_app.py
src/gui/pages/evaluation.py (minor – verify data flow)
src/gui/pages/agent_graph.py (minor – verify data flow)
tests/test_gui/test_evaluation_page.py (update)
tests/test_gui/test_agent_graph_page.py (update)

Feature 9: Editable Settings Page¶

Description: The Settings page (src/gui/pages/settings.py) currently displays JudgeSettings and AppEnv values as read-only text. Only the “Agent Configuration” section (provider selector, agent toggles) in the App tab sidebar is interactive. All displayed settings should be editable via the GUI and applied to the current session.

Acceptance Criteria:

Technical Requirements:

Replace st.text() / st.json() display-only rendering with appropriate input widgets:
st.multiselect() for tiers_enabled
st.selectbox() for tier2_provider (options from known providers list)
st.text_input() for model names
st.number_input() for timeouts and thresholds (with min/max from Field constraints)
st.toggle() for boolean fields (logfire_enabled, trace_collection)
Store modified settings in st.session_state["judge_settings"] as JudgeSettings instance
In run_app.py: use st.session_state.get("judge_settings", JudgeSettings()) when creating EvaluationPipeline
Add “Reset to Defaults” button that clears settings from session state

Files:

src/gui/pages/settings.py
src/gui/pages/run_app.py (read settings from session state)
tests/test_gui/test_settings_page.py (update)

Feature 10: Code Quality and Security Review¶

Description: Comprehensive code quality and security audit of the entire codebase using the reviewing-code and securing-mas Claude Code skills, supported by Context7 MCP for up-to-date library documentation and Exa MCP for security advisory lookups. The review applies the OWASP MAESTRO 7-layer security framework documented in docs/best-practices/mas-security.md and produces actionable findings with fix recommendations.

Acceptance Criteria:

Technical Requirements:

Use reviewing-code skill for quality review: code complexity, duplication, naming, error handling, docstring completeness
Use securing-mas skill for security review: apply each MAESTRO layer checklist against relevant modules
Use Context7 MCP (mcp__plugin_context7_context7__query-docs) to verify security patterns for:
/pydantic/pydantic-ai – agent security, tool validation
/pydantic/logfire – trace data sensitivity
/websites/streamlit_io – session state security, XSS prevention
Use Exa MCP (mcp__exa__web_search_exa) to search for CVEs and security advisories for project dependencies
Claude Code agent orchestration: spawn parallel review agents per MAESTRO layer for efficiency
Priority modules for review:
src/app/agents/agent_system.py – agent orchestration, prompt handling
src/app/llms/providers.py – API key management
src/app/llms/models.py – model creation, provider routing
src/app/judge/llm_evaluation_managers.py – LLM judge calls
src/app/judge/evaluation_pipeline.py – pipeline orchestration
src/gui/pages/run_app.py – user input handling
src/app/data_utils/datasets_peerread.py – external data ingestion

Files:

docs/reviews/sprint5-code-review.md (new)
src/app/ (all modules subject to review – fixes applied in place)

Ralph Loop Note: This is a review task, not an implementation task. Ralph should invoke the reviewing-code and securing-mas skills to produce findings, fix critical/high issues in code, and write the review document. The primary deliverable is docs/reviews/sprint5-code-review.md.

Feature 11: Test Suite Audit and Behavioral Refactoring¶

Description: Systematic audit of all 56 test files against the testing strategy (docs/best-practices/testing-strategy.md). Tests that only verify implementation details (field existence, type checks, default values, import availability) are deleted or replaced with behavioral tests. Tests that verify actual behavior (business logic, error handling, integration contracts) are kept and improved. The goal is a leaner, higher-signal test suite where every test catches real bugs.

Acceptance Criteria:

Every test file in tests/ audited against testing strategy criteria
Tests that only verify implementation details identified and removed (see anti-patterns below)
Tests that verify actual behavior kept and improved where needed
No reduction in behavioral coverage – only implementation-detail tests removed
Remaining tests use appropriate tooling: pytest for logic, Hypothesis for properties, inline-snapshot for structure
Audit findings documented in docs/reviews/sprint5-test-audit.md with per-file decisions (keep/delete/refactor)
make validate passes after refactoring
make test_all passes with no regressions in behavioral coverage
CHANGELOG.md updated

Technical Requirements:

Anti-patterns to remove (from docs/best-practices/testing-strategy.md):
Import/existence tests: test_module_exists(), test_class_importable()
Field existence tests: test_model_has_field_x(), hasattr() checks
Default constant tests: assert DEFAULT_VALUE == 300
Type-only checks: assert isinstance(result, dict) (pyright handles this)
Over-granular tests: 8 separate tests for one Pydantic model’s fields
Library behavior tests: testing Pydantic validation, os.environ reads, framework internals
Trivial assertions: x is not None, callable(func)
Behavioral tests to keep/improve:
Business logic: composite scoring calculations, metric extraction, weight redistribution
Error handling: provider fallback chains, timeout behavior, missing data graceful degradation
Integration contracts: evaluation pipeline end-to-end, trace data flow, GUI session state
Edge cases with real impact: empty inputs, boundary values, concurrent access
Per-directory audit priorities:
tests/evals/ (14 files) – HIGH: core evaluation logic, likely contains both good behavioral tests and implementation-detail tests
tests/judge/ (10 files) – HIGH: judge pipeline, plugin system
tests/integration/ (5 files) – MEDIUM: integration contracts, may have over-mocked tests
tests/test_gui/ (5 files) – MEDIUM: GUI behavior, may test Streamlit internals
tests/agents/ (3 files) – MEDIUM: agent wiring
tests/data_models/, tests/data_utils/, tests/common/, tests/utils/ – LOW: data validation, likely candidates for cleanup
tests/cc_otel/ (2 files) – DELETE: Phoenix replaced Opik, cc_otel tests are obsolete
test_migration_cleanup.py – DELETE: Sprint 4 migration complete, cleanup file no longer needed
Decision rule: If a test wouldn’t catch a real bug introduced by a code change, remove it.

Files:

tests/ (all 56 test files subject to audit)
docs/reviews/sprint5-test-audit.md (new – per-file audit decisions)

Ralph Loop Note: This is an audit-and-refactor task. Ralph should read each test file, apply the decision rule against the anti-pattern list, delete or rewrite failing tests, and document per-file decisions in docs/reviews/sprint5-test-audit.md. Run make test_all after each batch of changes.

Feature 12: Fix OTLP Endpoint Double-Path Bug¶

Description: The Logfire instrumentation sets OTEL_EXPORTER_OTLP_ENDPOINT to http://localhost:6006/v1/traces (logfire_instrumentation.py:59). Per the OTEL spec, the SDK auto-appends signal-specific paths to this base endpoint, producing http://localhost:6006/v1/traces/v1/traces for spans and http://localhost:6006/v1/traces/v1/metrics for metrics – both return HTTP 405 from Phoenix. All trace export silently fails despite the agent instrumentation working correctly.

Acceptance Criteria:

Traces from agent runs appear in the Phoenix UI at http://localhost:6006
No HTTP 405 errors in logs for /v1/traces/v1/traces or /v1/traces/v1/metrics paths
OTEL_EXPORTER_OTLP_ENDPOINT set to base URL only (http://localhost:6006), not the signal-specific path
Existing PHOENIX_ENDPOINT env var and phoenix_endpoint config field continue to work
Tests: pytest unit test for endpoint construction logic (base URL without signal path)
Tests: inline-snapshot for the constructed OTLP endpoint value
make validate passes
CHANGELOG.md updated

Technical Requirements:

In LogfireInstrumentation._configure_phoenix() (logfire_instrumentation.py:59): set OTEL_EXPORTER_OTLP_ENDPOINT to self.config.phoenix_endpoint (base URL) instead of appending /v1/traces
Update connectivity check at line 65: requests.head() should probe the base URL (e.g., http://localhost:6006), not the double-path URL
Alternative: use OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (signal-specific variable, not auto-appended by SDK) if base endpoint conflicts with other signal exporters
Remove any 405 suppression workarounds that are no longer needed after the fix

Files:

src/app/agents/logfire_instrumentation.py
tests/agents/test_logfire_instrumentation.py (update)

Feature 13: Fix Tier 3 Graph Analysis Tool Accuracy and Dead Metric¶

Description: Two issues in graph_analysis.py affect Tier 3 scoring accuracy. First, add_node at line 171 overwrites success_rate each time a tool is called, so only the last call’s outcome survives – if a tool succeeds 9 times and fails once (last), success_rate=0.0. The same overwrite applies to add_edge at line 173 for agent-tool edge weights. Second, communication_overhead is computed and stored in Tier3Result but never included in overall_score (lines 392-397), making it a dead metric that inflates the model without contributing to scoring.

Acceptance Criteria:

Tool success_rate accumulates across all calls (e.g., 9/10 successes = 0.9), not just the last call
Agent-tool edge weight accumulates or averages across repeated calls, not overwritten
communication_overhead either contributes to overall_score or is removed from Tier3Result
If communication_overhead is included in scoring, weights are rebalanced to sum to 1.0
Existing multi-tool and single-tool scenarios produce correct tool_selection_accuracy
Tests: Hypothesis property tests for tool accuracy with repeated calls (success_rate in [0.0, 1.0])
Tests: inline-snapshot for Tier3Result with known tool call sequences
make validate passes
CHANGELOG.md updated

Technical Requirements:

In analyze_tool_usage() (graph_analysis.py:165-173): track call count and success count per tool node, compute success_rate as successes / total_calls after the loop
For edge weights: accumulate call count per agent-tool pair, average or sum weights
For communication_overhead: decide include-or-remove. If removed, delete from Tier3Result model in evaluation_models.py and any downstream references. If included, rebalance self.weights to sum to 1.0 across 5 metrics.

Files:

src/app/judge/graph_analysis.py
src/app/data_models/evaluation_models.py (if communication_overhead removed)
tests/judge/test_graph_analysis.py (update)

Feature 14: Guard Wandb Import and Disable Crash Telemetry¶

Description: login.py:9 has an unconditional from wandb import login as wandb_login at module level. If the optional wandb package is not installed, the entire login.py module fails to import, breaking the application. Additionally, wandb sends crash telemetry to Sentry by default with no opt-out. The weave import at line 44 is already guarded inside the function body – the wandb import should follow the same pattern.

Acceptance Criteria:

Application starts successfully when wandb is not installed (no ImportError)
When wandb is installed and WANDB_API_KEY is set, login and weave init work as before
When wandb is not installed, login() skips wandb/weave initialization with a debug log
WANDB_ERROR_REPORTING defaults to "false" (respects user override if already set)
Dead agentops commented code removed from login.py: commented import at line 7 (# from agentops import init as agentops_init) and commented code block at lines 30-37
Tests: pytest unit test for login with wandb unavailable (mock ImportError)
Tests: inline-snapshot for log output when wandb is missing
make validate passes
CHANGELOG.md updated

Technical Requirements:

Replace unconditional from wandb import login as wandb_login (line 9) with guarded import:

try:
    from wandb import login as wandb_login
    _wandb_available = True
except ImportError:
    _wandb_available = False

Add os.environ.setdefault("WANDB_ERROR_REPORTING", "false") before the wandb import
In login(): guard wandb_login() and weave_init() calls with if _wandb_available
Log logger.debug("wandb not installed, skipping wandb/weave initialization") when unavailable

Files:

src/app/utils/login.py
tests/utils/test_login.py (new or update)

Feature 15: Debug Logging for Empty API Keys¶

Description: When get_api_key() returns False for a provider whose key exists in .env but resolves to empty string at runtime, there is no diagnostic log. This makes transient .env loading issues (CWD mismatch, env var unset between runs) hard to diagnose. Add a debug log when a key is expected (provider registered with env_key) but the value is empty.

Acceptance Criteria:

get_api_key() logs a debug message when a registered provider’s key resolves to empty string
Debug message includes the env_key name (e.g., GITHUB_API_KEY) for diagnosis
No log emitted for providers without API keys (e.g., Ollama)
No log emitted when key is correctly loaded
Tests: pytest unit test for empty-key debug log scenario
make validate passes
CHANGELOG.md updated

Technical Requirements:

In get_api_key() (providers.py:42): add logger.debug(f"API key '{provider_metadata.env_key}' is empty for provider '{provider}'") in the else branch
No new settings fields, no behavioral change – debug logging only

Files:

src/app/llms/providers.py
tests/llms/test_providers.py (update)

Feature 16: Move PeerRead Tools from Manager to Researcher Agent¶

Description: The manager agent receives both delegation tools (researcher(), analyst(), synthesiser()) and PeerRead tools (get_peerread_paper, generate_paper_review_content_from_template, save_structured_review) via add_peerread_tools_to_manager() at agent_system.py:411. Sub-agents get minimal tools: researcher has only duckduckgo_search_tool(), analyst and synthesiser have none. Models take the path of least resistance – the manager uses PeerRead tools directly instead of delegating, resulting in zero multi-agent coordination. Moving PeerRead tools to the researcher enforces separation of concerns: manager coordinates, researcher executes.

Acceptance Criteria:

Technical Requirements:

Rename add_peerread_tools_to_manager() to add_peerread_tools_to_researcher() in peerread_tools.py
In _create_manager() (agent_system.py:410-411): remove add_peerread_tools_to_manager(manager) call
In researcher agent creation block (agent_system.py:~370-380): add PeerRead tools alongside duckduckgo_search_tool()
Update researcher system prompt to include PeerRead tool usage instructions (currently only on manager prompt)
If researcher is None (disabled), fall back to adding PeerRead tools to manager to preserve single-agent operation

Files:

src/app/agents/agent_system.py
src/app/tools/peerread_tools.py
tests/agents/test_agent_system.py (update)

Feature 17: Delete Duplicate AppEnv and Dead Code in load_settings.py¶

Description: src/app/utils/load_settings.py contains a duplicate AppEnv class (lines 22-49) that diverges from the canonical AppEnv in src/app/data_models/app_models.py (lines 219-249). The duplicate is missing ANTHROPIC_API_KEY, CEREBRAS_API_KEY, OPENAI_API_KEY, and uses LOGFIRE_TOKEN instead of LOGFIRE_API_KEY. It also eagerly instantiates chat_config = AppEnv() at module level (line 52), which runs on import. Only one consumer exists: datasets_peerread.py:23. The duplicate class and module-level instance should be deleted; load_config() can remain since it loads ChatConfig from JSON.

Acceptance Criteria:

Duplicate AppEnv class removed from load_settings.py
Module-level chat_config = AppEnv() instance removed from load_settings.py
datasets_peerread.py import updated to use canonical AppEnv from app.data_models.app_models
load_config() function retained in load_settings.py ONLY if still used for JSON config loading; if load_config() is unused (grep/search confirms no consumers), delete entire load_settings.py module
No import errors or runtime failures after removal
Tests: pytest unit test verifying single AppEnv source of truth
make validate passes
CHANGELOG.md updated

Technical Requirements:

Delete class AppEnv (lines 22-49) and chat_config = AppEnv() (line 52) from load_settings.py
In datasets_peerread.py:23: change from app.utils.load_settings import chat_config to from app.data_models.app_models import AppEnv and instantiate where needed (or pass as parameter)
Verify no other files import from load_settings.AppEnv (grep confirmed: only datasets_peerread.py)

Files:

src/app/utils/load_settings.py
src/app/data_utils/datasets_peerread.py
tests/data_utils/test_datasets_peerread.py (update if affected)

Non-Functional Requirements¶

Maintainability:
Use existing configuration patterns (pydantic-settings, env prefix overrides)
No new dependencies required
Performance:
Provider selection adds negligible overhead (single API key check per evaluation)
Weight redistribution is a simple arithmetic operation
GUI background execution must not block Streamlit’s event loop
Backward Compatibility:
All fixes are additive; existing JUDGE_TIER2_PROVIDER=openai and config_chat.json values continue working
Default behavior changes only where current defaults produce errors (401 auth failures, silent paper skipping)
CLI behavior unchanged by GUI features
Testing (per docs/best-practices/testing-strategy.md):
TDD mandatory: RED (failing test) → GREEN (minimal implementation) → REFACTOR for all implementation stories
Use pytest as primary TDD tool for unit tests with Arrange-Act-Assert structure
Use Hypothesis (@given) for property-based tests: provider selection invariants, weight sum invariants, token limit bounds, paper validation with missing fields, settings value bounds
Use inline-snapshot (snapshot()) for regression tests: log messages, model dumps, CLI help text, GUI component structures
Decision rule: Test behavior, not implementation. If a test wouldn’t catch a real bug, don’t write it.

Out of Scope¶

Tier 1 reference comparison fix (all-1.0 self-comparison scores – requires ground-truth review integration, separate feature)
Automatic PeerRead dataset download when splits are missing (existing error message with instructions is sufficient)
Custom composite weight configuration via CLI/GUI (equal-weight is the defined scoring model)
Tier 2 cost tracking or budget enforcement (existing tier2_cost_budget_usd is informational only)
Cerebras-specific prompt optimization for structured output validation retries
Real-time log streaming in GUI (logs displayed after execution completes)
Persisting modified settings to .env or config_chat.json (session-only changes)
Adding new tests for untested modules (audit scope is refactoring existing tests only)
Penetration testing or runtime exploit validation (static review only)

Notes for Ralph Loop¶

Story Priority Tiers for Ralph¶

Ralph should prioritize stories in the following order to maximize value delivery:

P0 (Quick Wins - 1-line fixes): STORY-012 (OTLP endpoint), STORY-014 (wandb import guard), STORY-015 (API key debug log), STORY-017 (duplicate AppEnv)
P1 (Core Bugs): STORY-001 (judge provider fallback), STORY-003 (single-agent score fairness), STORY-004 (PeerRead validation), STORY-013 (graph analysis accuracy)
P2 (GUI Features): STORY-002 (token limits), STORY-006 (background execution), STORY-007 (debug log panel), STORY-008 (wire results tabs), STORY-009 (editable settings)
P3 (Architecture): STORY-016 (PeerRead tool delegation)
P4 (Meta-Tasks): STORY-010 (code review), STORY-011 (test audit)
P5 (Documentation - Blocked): STORY-005 (depends on all other stories)

Story Breakdown - Sprint 5 (17 stories total):

Feature 1 (Judge Provider Fallback) → STORY-001: Wire Tier 2 judge provider fallback and auto-inherit agent provider
Feature 2 (Token Limits) → STORY-002: Configurable agent token limits via CLI, GUI, and env var
Feature 3 (Score Fairness) → STORY-003: Single-agent composite score weight redistribution
Feature 4 (PeerRead Validation) → STORY-004: PeerRead dataset validation resilience for optional fields
Feature 5 (Documentation Updates) → STORY-005: Update documentation and diagrams for Sprint 5 (depends: STORY-001, STORY-002, STORY-003, STORY-004, STORY-006, STORY-007, STORY-008, STORY-009, STORY-010, STORY-011, STORY-012, STORY-013, STORY-014, STORY-015, STORY-016, STORY-017)
Feature 6 (Background Execution) → STORY-006: Background query execution with tab navigation resilience
Feature 7 (Debug Log Panel) → STORY-007: Debug log panel in App tab (depends: STORY-006)
Feature 8 (Wire Results Tabs) → STORY-008: Wire Evaluation Results and Agent Graph tabs to real data (depends: STORY-006)
Feature 9 (Editable Settings) → STORY-009: Editable settings page with session-scoped persistence (depends: STORY-001)
Feature 10 (Code Review) → STORY-010: Code quality and OWASP MAESTRO security review
Feature 11 (Test Audit) → STORY-011: Test suite audit and behavioral refactoring
Feature 12 (OTLP Endpoint Fix) → STORY-012: Fix OTLP endpoint double-path bug in Logfire instrumentation
Feature 13 (Graph Analysis Fixes) → STORY-013: Fix Tier 3 tool accuracy overwrite and dead communication_overhead metric
Feature 14 (Wandb Import Guard) → STORY-014: Guard wandb import and disable crash telemetry default
Feature 15 (API Key Debug Log) → STORY-015: Debug logging for empty API keys in provider resolution
Feature 16 (PeerRead Tool Delegation) → STORY-016: Move PeerRead tools from manager to researcher agent
Feature 17 (Delete Duplicate AppEnv) → STORY-017: Delete duplicate AppEnv class and dead code in load_settings.py

PRD Sprint5 Ralph

Project Overview¶

Development Methodology¶

TDD Workflow (Mandatory for Features 1-4, 6-9, 12-17)¶

Test Tool Selection (per docs/best-practices/testing-strategy.md)¶

Core Principles (per .claude/rules/core-principles.md)¶

Skills Usage¶

Functional Requirements¶

Feature 1: Wire Tier 2 Judge Provider Fallback¶

Feature 2: Configurable Agent Token Limits¶

Feature 3: Single-Agent Composite Score Weight Redistribution¶

Feature 4: PeerRead Dataset Validation Resilience¶

Feature 5: Documentation and Diagram Updates¶

Feature 6: Background Execution with Tab Navigation¶

Feature 7: Debug Log Panel in App Tab¶

Feature 8: Wire Evaluation Results and Agent Graph Tabs to Real Data¶

Feature 9: Editable Settings Page¶

Feature 10: Code Quality and Security Review¶

Feature 11: Test Suite Audit and Behavioral Refactoring¶

Feature 12: Fix OTLP Endpoint Double-Path Bug¶

Feature 13: Fix Tier 3 Graph Analysis Tool Accuracy and Dead Metric¶

Feature 14: Guard Wandb Import and Disable Crash Telemetry¶

Feature 15: Debug Logging for Empty API Keys¶

Feature 16: Move PeerRead Tools from Manager to Researcher Agent¶

Feature 17: Delete Duplicate AppEnv and Dead Code in load_settings.py¶

Non-Functional Requirements¶

Out of Scope¶

Notes for Ralph Loop¶

Story Priority Tiers for Ralph¶

Test Tool Selection (per `docs/best-practices/testing-strategy.md`)¶

Core Principles (per `.claude/rules/core-principles.md`)¶