PRD Sprint5 Ralph
title: Product Requirements Document: Agents-eval Sprint 5 version: 1.0.0 created: 2026-02-15 updated: 2026-02-15
Project Overview¶
Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager -> Researcher -> Analyst -> Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).
Sprint 4 delivered operational resilience (Logfire, graph timeouts, test alignment) and Claude Code baseline comparison infrastructure.
Sprint 5 addresses runtime bugs, GUI enhancements, architectural improvements, code quality review, and technical debt cleanup across 17 stories:
- Core Runtime Fixes (Features 1-4): Judge provider fallback, token limit configurability, single-agent score fairness, PeerRead validation resilience
- GUI Enhancements (Features 6-9): Background execution, debug log panel, results tab wiring, editable settings
- Architecture & Reliability (Features 12-13, 16): OTLP endpoint fix, graph analysis accuracy, PeerRead tool delegation for multi-agent coordination
- Code Quality & Cleanup (Features 10-11, 14-15, 17): Security review (OWASP MAESTRO), test suite audit, wandb import guard, API key debug logging, duplicate AppEnv deletion
- Documentation (Feature 5): Comprehensive updates reflecting all Sprint 5 changes
Development Methodology¶
All implementation stories MUST follow these practices. Ralph Loop enforces this order.
TDD Workflow (Mandatory for Features 1-4, 6-9, 12-17)¶
- RED: Write failing tests first using
testing-pythonskill. Tests define expected behavior before any implementation code exists. - GREEN: Implement minimal code to pass tests using
implementing-pythonskill. No extra functionality. - REFACTOR: Clean up while keeping tests green. Run
make validatebefore marking complete.
Test Tool Selection (per docs/best-practices/testing-strategy.md)¶
| Tool | Use for | NOT for |
|---|---|---|
| pytest | Core logic, unit tests, known edge cases (primary TDD tool) | Random inputs |
| Hypothesis | Property invariants, bounds, all-input guarantees | Snapshots, known cases |
| inline-snapshot | Regression, model dumps, complex structures | TDD red-green, ranges |
Decision rule: If the test wouldn’t catch a real bug, don’t write it. Test behavior, not implementation.
Core Principles (per .claude/rules/core-principles.md)¶
- KISS: Simplest solution that passes tests. Clear > clever.
- DRY: Reuse existing patterns (
pydantic-settings,select_available_provider(), session state). Don’t rebuild. - YAGNI: Implement only what acceptance criteria require. No speculative features.
Skills Usage¶
| Story type | Skills to invoke |
|---|---|
| Implementation (1-4, 6-9, 12-17) | testing-python (RED) → implementing-python (GREEN) |
| Review (10) | reviewing-code + securing-mas |
| Audit (11) | testing-python (for rewrites) |
| Documentation (5) | None (manual edits) |
Functional Requirements¶
Feature 1: Wire Tier 2 Judge Provider Fallback¶
Description: The LLMJudgeEngine.select_available_provider() method exists but is never called. When tier2_provider=openai and no OPENAI_API_KEY is set, all three Tier 2 metrics fail with 401 and fall back to neutral 0.5 scores. Wire the existing fallback chain so the judge validates API key availability before attempting LLM calls, and add a tier2_provider=auto mode that inherits the agent system’s active provider.
Acceptance Criteria:
-
LLMJudgeEnginecallsselect_available_provider()before creating judge agents - When primary provider API key is missing, fallback provider is used automatically
- When both providers are unavailable, Tier 2 is skipped with a single warning (no 401 errors, no neutral 0.5 fallback scores)
- When Tier 2 is skipped, its 3 metrics (
technical_accuracy,constructiveness,planning_rationality) are excluded from composite scoring and their weights redistributed to Tier 1 and Tier 3 metrics - Compound redistribution: When both Tier 2 skipped AND single-agent mode (STORY-003), composite scorer must handle both conditions (skip 3 Tier 2 metrics + skip
coordination_quality, redistribute remaining weights) - New
tier2_provider=automode inherits the agent system’s activechat_provider -
EvaluationPipelineaccepts optionalchat_providerparameter to pass through to judge -
_run_evaluation_if_enabled()inapp.pypasseschat_providerto the pipeline - Existing
JUDGE_TIER2_PROVIDERenv var override continues to work - Tests: Hypothesis property tests for provider selection invariants (fallback only when primary unavailable)
- Tests: inline-snapshot for log messages during fallback
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- In
LLMJudgeEngine.__init__()(llm_evaluation_managers.py:35): storeenv_configparameter - In
LLMJudgeEngine.create_judge_agent()(llm_evaluation_managers.py:110): callselect_available_provider(self.env_config)to resolve provider/model before creating agent - If
select_available_provider()returnsNone, skip Tier 2 evaluation (return neutral fallback) - Add
tier2_provider: str = Field(default="auto")option toJudgeSettings– when"auto", use thechat_providerpassed from the agent system - Update
EvaluationPipeline.__init__()to accept optionalchat_provider: str | Noneand pass it toLLMJudgeEngine - Update
_run_evaluation_if_enabled()inapp.pyto forwardchat_providerfrom the agent run - Update
create_simple_model()inmodels.pyto support Cerebras provider (reuse existingcreate_llm_model()provider logic)
Files:
src/app/judge/llm_evaluation_managers.pysrc/app/judge/evaluation_pipeline.pysrc/app/judge/settings.pysrc/app/app.pysrc/app/llms/models.pytests/judge/test_llm_evaluation_managers.py(update)
Feature 2: Configurable Agent Token Limits¶
Description: The Cerebras provider has usage_limits: 60000 in config_chat.json, but the gpt-oss-120b model consumed 75,954 tokens (74,714 input + 1,240 output) during a GUI run, causing UsageLimitExceeded. The high input token count was amplified by PeerRead tool returning 23 papers plus structured output validation retries. Add CLI and GUI overrides for total_tokens_limit so users can adjust without editing config_chat.json.
Acceptance Criteria:
- CLI:
--token-limit Nflag overridesusage_limitsfromconfig_chat.json - GUI: Token limit input field in settings sidebar (pre-populated from
config_chat.json) - When flag/field is not set, existing
config_chat.jsonvalue is used (no regression) -
AGENT_TOKEN_LIMITenvironment variable override (lowest priority after CLI/GUI) - Validation: minimum 1000, maximum 1000000
- Tests: Hypothesis property tests for limit bounds and override priority
- Tests: inline-snapshot for CLI help text
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Add
--token-limitargument to CLI entry point insrc/app/main.py - Pass override through
main()inapp.pytosetup_agent_env()inagent_system.py - In
setup_agent_env()(agent_system.py:633-638): use CLI override if provided, elseconfig_chat.jsonvalue, else env var - Add token limit input to GUI settings sidebar in
src/gui/pages/settings.py - Pass GUI value through to
render_app()insrc/gui/pages/run_app.py
Files:
src/app/main.pysrc/app/app.pysrc/app/agents/agent_system.pysrc/gui/pages/settings.pysrc/gui/pages/run_app.pytests/app/test_cli_token_limit.py(new)
Feature 3: Single-Agent Composite Score Weight Redistribution¶
Description: The composite scorer uses equal weights (0.167 each) across 6 metrics. For single-agent runs (no multi-agent delegation), coordination_quality is structurally 0.0 (mapped from coordination_centrality in Tier 3 graph analysis), causing a guaranteed 0.167 deduction. The scorer should detect single-agent runs and redistribute coordination_quality weight to the remaining 5 metrics.
Acceptance Criteria:
- Detect single-agent runs from
GraphTraceData(0 or 1 unique agent IDs, emptycoordination_events) - When single-agent detected, redistribute
coordination_qualityweight (0.167) equally across remaining 5 metrics - Multi-agent runs continue using all 6 metrics with equal weights (no regression)
-
CompositeResultincludessingle_agent_mode: boolflag for transparency - Compound redistribution: When both Tier 2 skipped (STORY-001) AND single-agent mode, composite scorer must handle both conditions (see STORY-001 for interaction)
- Log message when weight redistribution occurs
- Tests: Hypothesis property tests for weight sum invariant (always sums to ~1.0)
- Tests: inline-snapshot for metric weights in single-agent vs multi-agent mode
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Add
is_single_agent(execution_trace: GraphTraceData | None) -> boolhelper incomposite_scorer.py - In
CompositeScorer.calculate_composite_score()(composite_scorer.py:167): check agent count and adjust weights - Adjusted weights for single-agent: each of the 5 remaining metrics gets
1.0 / 5.0 = 0.200 - Add
single_agent_mode: bool = Falsefield toCompositeResultmodel inevaluation_models.py - Set flag in
evaluate_composite()before returning result
Files:
src/app/judge/composite_scorer.pysrc/app/data_models/evaluation_models.pytests/judge/test_composite_scorer.py(update)
Feature 4: PeerRead Dataset Validation Resilience¶
Description: Six papers (304-308, 330) fail validation with KeyError: 'IMPACT' at datasets_peerread.py:724 because they lack the IMPACT field. These papers are silently skipped, reducing dataset coverage. The IMPACT field should be treated as optional with a sensible default instead of causing validation failure.
Acceptance Criteria:
- Papers with missing
IMPACTfield are validated successfully withIMPACTdefaulting toNoneor"UNKNOWN" - Papers with missing other optional fields (
histories,comments) also handled gracefully - Existing papers with valid
IMPACTfield are unaffected (no regression) - Log debug message when optional field is missing (not warning)
- Tests: Hypothesis property tests for paper validation with arbitrary missing optional fields
- Tests: inline-snapshot for validated paper with missing IMPACT
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- In
_validate_papers()(datasets_peerread.py:~700-727): usepaper_data.get("IMPACT", None)instead of direct key access - Update
PeerReadPaperPydantic model to makeIMPACTfieldOptional[str] = None - Review other fields in validation loop for similar missing-key risks (
histories,comments,SUBSTANCE,APPROPRIATENESS) - Use
.get()with defaults for all optional paper metadata fields
Files:
src/app/data_utils/datasets_peerread.pysrc/app/data_models/peerread_models.py(update PeerReadPaper model)tests/data_utils/test_datasets_peerread.py(update)
Feature 5: Documentation and Diagram Updates¶
Description: Update project documentation and architecture diagrams to reflect Sprint 5 changes: judge provider fallback, configurable token limits, single-agent score redistribution, and PeerRead validation resilience. Add Sprint 5 to the roadmap and update architecture decision records where applicable.
Acceptance Criteria:
-
README.md: Version badge updated, Sprint 5 referenced in status section -
docs/roadmap.md: Sprint 5 row added to roadmap table with status “Active” and link toPRD-Sprint5-Ralph.md -
docs/architecture.md: Composite Scoring section updated to document single-agent weight redistribution behavior -
docs/architecture.md: Tier 2 LLM-as-Judge section updated to document provider fallback chain andautomode -
docs/architecture.md: Implementation Status section updated with Sprint 5 entry -
docs/arch_vis/: Update relevant PlantUML diagrams if evaluation pipeline flow changed (e.g., provider selection step in Tier 2) - CHANGELOG.md updated
- Stale “Opik integration” docstrings in
graph_analysis.py(lines 423, 506) updated to reference Phoenix - No broken internal links introduced
Technical Requirements:
- Update
docs/roadmap.mdroadmap table: add Sprint 5 row, update Sprint 4 status to “Delivered” - Update
docs/architecture.mdComposite Scoring System section (line ~180-200) to note weight redistribution for single-agent runs - Update
docs/architecture.mdLLM-as-a-Judge section (line ~124-139) to documenttier2_provider=autoand fallback chain - Update
docs/architecture.mdDevelopment Timeline section (line ~297-304) with Sprint 5 entry - Review
docs/arch_vis/mas-enhanced-workflow.plantumlanddocs/arch_vis/metrics-eval-sweep.plantumlfor accuracy against new evaluation flow - Update
README.mdversion badge if project version incremented
Files:
README.mddocs/roadmap.mddocs/architecture.mddocs/arch_vis/mas-enhanced-workflow.plantuml(if applicable)docs/arch_vis/metrics-eval-sweep.plantuml(if applicable)CHANGELOG.md
Feature 6: Background Execution with Tab Navigation¶
Description: When a user navigates away from the App tab during query execution, the run aborts because Streamlit re-runs the script on page change. The execution result is also lost because it is not persisted to session state. The app should run queries in the background and persist results so users can navigate freely and return to see completed output.
Acceptance Criteria:
- Query execution continues when user navigates to another tab (Settings, Evaluation Results, etc.)
- User can return to App tab and see the result after execution completes
- A progress indicator (spinner or status) shows while execution is in progress
- If execution is in progress when returning to App tab, spinner is displayed
- Execution result (
CompositeResult, agent output) stored inst.session_state - Error state stored in session state and displayed when user returns
- Tests: pytest unit tests for session state transitions (idle → running → completed/error)
- Tests: inline-snapshot for session state keys after execution
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Use
st.session_stateto persist execution state:running,result,error,execution_id - Run
main()viast.fragment(Streamlit 1.33+) orthreading.Threadwith callback that writes result to session state on completion. Note:st.session_stateis not thread-safe – usest.fragmentif available, otherwise synchronize writes - In
render_app()(run_app.py:108): check session state for existing result before starting new execution - Add
st.spinner("Running evaluation...")orst.status()container around execution - Store
CompositeResultinst.session_state["evaluation_result"]for cross-tab access
Files:
src/gui/pages/run_app.pysrc/run_gui.pytests/test_gui/test_run_app.py(update)
Feature 7: Debug Log Panel in App Tab¶
Description: Add an expandable panel in the App tab that displays real-time pipeline log output (evaluation metrics, tier results, errors) that currently only appears in the terminal. Users should see the same diagnostic information visible in the CLI without needing terminal access.
Acceptance Criteria:
- Expandable/collapsible “Debug Log” section at the bottom of the App tab
- Captures loguru output from
app.*modules during execution - Displays log entries with timestamp, level, and message (formatted, not raw)
- Log panel updates after execution completes (not required to be real-time streaming)
- Collapsed by default to keep UI clean
- Log entries color-coded by level: INFO (default), WARNING (yellow), ERROR (red)
- Tests: pytest unit tests for log capture sink (filters
app.*modules, clears buffer) - Tests: inline-snapshot for log panel HTML structure
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Add a custom loguru sink that captures log records to a list during execution
- Store captured logs in
st.session_state["debug_logs"] - In
render_app(): addst.expander("Debug Log", expanded=False)after result display - Inside expander, render each log entry using
st.text()orst.code()with level formatting - Sink should filter to
app.*module logs only (exclude Streamlit internals) - Clear log buffer at start of each new execution
Files:
src/gui/pages/run_app.pysrc/gui/utils/log_capture.py(new)tests/test_gui/test_log_capture.py(new)
Feature 8: Wire Evaluation Results and Agent Graph Tabs to Real Data¶
Description: The “Evaluation Results” and “Agent Graph” tabs have full rendering implementations but are called with None data from run_gui.py (lines 100, 103). After a query execution in the App tab, both pages should display actual results from the completed run instead of showing placeholder messages.
Acceptance Criteria:
- After App tab execution completes, navigating to “Evaluation Results” displays the actual
CompositeResult - After App tab execution completes, navigating to “Agent Graph” displays the actual interaction graph from
GraphTraceData - Both pages show informational message when no execution has been run yet (existing behavior preserved)
- Evaluation Results page displays: composite score, tier scores, metric comparison chart, recommendation
- Agent Graph page displays: interactive Pyvis network with agent and tool nodes
- Data persists across tab navigation within the same session
- Tests: Hypothesis property tests for session state data integrity across page switches
- Tests: inline-snapshot for evaluation page render with real CompositeResult
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Store
CompositeResultinst.session_state["evaluation_result"]after evaluation pipeline completes (from Feature 6) - Build
nx.DiGraphfromGraphTraceDataand store inst.session_state["agent_graph"] - In
run_gui.py: passst.session_state.get("evaluation_result")torender_evaluation()instead ofNone - In
run_gui.py: passst.session_state.get("agent_graph")torender_agent_graph()instead ofNone - Graph construction: reuse existing
GraphTraceData→nx.DiGraphconversion fromgraph_analysis.py
Files:
src/run_gui.pysrc/gui/pages/run_app.pysrc/gui/pages/evaluation.py(minor – verify data flow)src/gui/pages/agent_graph.py(minor – verify data flow)tests/test_gui/test_evaluation_page.py(update)tests/test_gui/test_agent_graph_page.py(update)
Feature 9: Editable Settings Page¶
Description: The Settings page (src/gui/pages/settings.py) currently displays JudgeSettings and AppEnv values as read-only text. Only the “Agent Configuration” section (provider selector, agent toggles) in the App tab sidebar is interactive. All displayed settings should be editable via the GUI and applied to the current session.
Acceptance Criteria:
-
JudgeSettingsfields editable:tiers_enabled,tier2_provider,tier2_model,tier2_fallback_provider,tier2_fallback_model, timeout values -
JudgeSettingscomposite thresholds editable:composite_accept_threshold,composite_weak_accept_threshold,composite_weak_reject_threshold - Observability settings editable:
logfire_enabled,phoenix_endpoint,trace_collection - Changed settings applied to the current session (stored in
st.session_state) - “Reset to Defaults” button restores original
JudgeSettings()defaults - Settings changes take effect on next App tab execution (no restart required)
- Input validation matches pydantic field constraints (e.g.,
gt=0, le=300for timeouts) - Tests: Hypothesis property tests for settings value bounds
- Tests: inline-snapshot for settings page widget structure
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Replace
st.text()/st.json()display-only rendering with appropriate input widgets: st.multiselect()fortiers_enabledst.selectbox()fortier2_provider(options from known providers list)st.text_input()for model namesst.number_input()for timeouts and thresholds (with min/max from Field constraints)st.toggle()for boolean fields (logfire_enabled,trace_collection)- Store modified settings in
st.session_state["judge_settings"]asJudgeSettingsinstance - In
run_app.py: usest.session_state.get("judge_settings", JudgeSettings())when creatingEvaluationPipeline - Add “Reset to Defaults” button that clears settings from session state
Files:
src/gui/pages/settings.pysrc/gui/pages/run_app.py(read settings from session state)tests/test_gui/test_settings_page.py(update)
Feature 10: Code Quality and Security Review¶
Description: Comprehensive code quality and security audit of the entire codebase using the reviewing-code and securing-mas Claude Code skills, supported by Context7 MCP for up-to-date library documentation and Exa MCP for security advisory lookups. The review applies the OWASP MAESTRO 7-layer security framework documented in docs/best-practices/mas-security.md and produces actionable findings with fix recommendations.
Acceptance Criteria:
- Code quality review completed using
reviewing-codeskill across allsrc/app/modules - Security review completed using
securing-masskill applying OWASP MAESTRO 7-layer framework - MAESTRO Layer 1 (Model): Prompt injection risks assessed in agent system prompts and tool outputs
- MAESTRO Layer 2 (Agent Logic): Input validation and type safety verified across agent factories, evaluation managers
- MAESTRO Layer 3 (Integration): External service failure handling reviewed (LLM providers, PeerRead API, OTLP export)
- MAESTRO Layer 4 (Monitoring): Log injection risks and sensitive data in traces checked
- MAESTRO Layer 5 (Execution): Resource exhaustion risks reviewed (token limits, timeouts, thread pools)
- MAESTRO Layer 6 (Environment): Secret management verified (API keys in
.env, no hardcoded credentials) - MAESTRO Layer 7 (Orchestration): Agent delegation and tool registration security reviewed
- Context7 MCP used to verify current best practices for PydanticAI, Logfire, and Streamlit security patterns
- Exa MCP used to check for known CVEs in project dependencies
- Review findings documented in
docs/reviews/sprint5-code-review.md - Critical and high findings fixed in code; medium/low findings documented as future work
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Use
reviewing-codeskill for quality review: code complexity, duplication, naming, error handling, docstring completeness - Use
securing-masskill for security review: apply each MAESTRO layer checklist against relevant modules - Use Context7 MCP (
mcp__plugin_context7_context7__query-docs) to verify security patterns for: /pydantic/pydantic-ai– agent security, tool validation/pydantic/logfire– trace data sensitivity/websites/streamlit_io– session state security, XSS prevention- Use Exa MCP (
mcp__exa__web_search_exa) to search for CVEs and security advisories for project dependencies - Claude Code agent orchestration: spawn parallel review agents per MAESTRO layer for efficiency
- Priority modules for review:
src/app/agents/agent_system.py– agent orchestration, prompt handlingsrc/app/llms/providers.py– API key managementsrc/app/llms/models.py– model creation, provider routingsrc/app/judge/llm_evaluation_managers.py– LLM judge callssrc/app/judge/evaluation_pipeline.py– pipeline orchestrationsrc/gui/pages/run_app.py– user input handlingsrc/app/data_utils/datasets_peerread.py– external data ingestion
Files:
docs/reviews/sprint5-code-review.md(new)src/app/(all modules subject to review – fixes applied in place)
Ralph Loop Note: This is a review task, not an implementation task. Ralph should invoke the reviewing-code and securing-mas skills to produce findings, fix critical/high issues in code, and write the review document. The primary deliverable is docs/reviews/sprint5-code-review.md.
Feature 11: Test Suite Audit and Behavioral Refactoring¶
Description: Systematic audit of all 56 test files against the testing strategy (docs/best-practices/testing-strategy.md). Tests that only verify implementation details (field existence, type checks, default values, import availability) are deleted or replaced with behavioral tests. Tests that verify actual behavior (business logic, error handling, integration contracts) are kept and improved. The goal is a leaner, higher-signal test suite where every test catches real bugs.
Acceptance Criteria:
- Every test file in
tests/audited against testing strategy criteria - Tests that only verify implementation details identified and removed (see anti-patterns below)
- Tests that verify actual behavior kept and improved where needed
- No reduction in behavioral coverage – only implementation-detail tests removed
- Remaining tests use appropriate tooling: pytest for logic, Hypothesis for properties, inline-snapshot for structure
- Audit findings documented in
docs/reviews/sprint5-test-audit.mdwith per-file decisions (keep/delete/refactor) -
make validatepasses after refactoring -
make test_allpasses with no regressions in behavioral coverage - CHANGELOG.md updated
Technical Requirements:
- Anti-patterns to remove (from
docs/best-practices/testing-strategy.md): - Import/existence tests:
test_module_exists(),test_class_importable() - Field existence tests:
test_model_has_field_x(),hasattr()checks - Default constant tests:
assert DEFAULT_VALUE == 300 - Type-only checks:
assert isinstance(result, dict)(pyright handles this) - Over-granular tests: 8 separate tests for one Pydantic model’s fields
- Library behavior tests: testing Pydantic validation,
os.environreads, framework internals -
Trivial assertions:
x is not None,callable(func) -
Behavioral tests to keep/improve:
- Business logic: composite scoring calculations, metric extraction, weight redistribution
- Error handling: provider fallback chains, timeout behavior, missing data graceful degradation
- Integration contracts: evaluation pipeline end-to-end, trace data flow, GUI session state
-
Edge cases with real impact: empty inputs, boundary values, concurrent access
-
Per-directory audit priorities:
tests/evals/(14 files) – HIGH: core evaluation logic, likely contains both good behavioral tests and implementation-detail teststests/judge/(10 files) – HIGH: judge pipeline, plugin systemtests/integration/(5 files) – MEDIUM: integration contracts, may have over-mocked teststests/test_gui/(5 files) – MEDIUM: GUI behavior, may test Streamlit internalstests/agents/(3 files) – MEDIUM: agent wiringtests/data_models/,tests/data_utils/,tests/common/,tests/utils/– LOW: data validation, likely candidates for cleanuptests/cc_otel/(2 files) – DELETE: Phoenix replaced Opik, cc_otel tests are obsolete-
test_migration_cleanup.py– DELETE: Sprint 4 migration complete, cleanup file no longer needed -
Decision rule: If a test wouldn’t catch a real bug introduced by a code change, remove it.
Files:
tests/(all 56 test files subject to audit)docs/reviews/sprint5-test-audit.md(new – per-file audit decisions)
Ralph Loop Note: This is an audit-and-refactor task. Ralph should read each test file, apply the decision rule against the anti-pattern list, delete or rewrite failing tests, and document per-file decisions in docs/reviews/sprint5-test-audit.md. Run make test_all after each batch of changes.
Feature 12: Fix OTLP Endpoint Double-Path Bug¶
Description: The Logfire instrumentation sets OTEL_EXPORTER_OTLP_ENDPOINT to http://localhost:6006/v1/traces (logfire_instrumentation.py:59). Per the OTEL spec, the SDK auto-appends signal-specific paths to this base endpoint, producing http://localhost:6006/v1/traces/v1/traces for spans and http://localhost:6006/v1/traces/v1/metrics for metrics – both return HTTP 405 from Phoenix. All trace export silently fails despite the agent instrumentation working correctly.
Acceptance Criteria:
- Traces from agent runs appear in the Phoenix UI at
http://localhost:6006 - No HTTP 405 errors in logs for
/v1/traces/v1/tracesor/v1/traces/v1/metricspaths -
OTEL_EXPORTER_OTLP_ENDPOINTset to base URL only (http://localhost:6006), not the signal-specific path - Existing
PHOENIX_ENDPOINTenv var andphoenix_endpointconfig field continue to work - Tests: pytest unit test for endpoint construction logic (base URL without signal path)
- Tests: inline-snapshot for the constructed OTLP endpoint value
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- In
LogfireInstrumentation._configure_phoenix()(logfire_instrumentation.py:59): setOTEL_EXPORTER_OTLP_ENDPOINTtoself.config.phoenix_endpoint(base URL) instead of appending/v1/traces - Update connectivity check at line 65:
requests.head()should probe the base URL (e.g.,http://localhost:6006), not the double-path URL - Alternative: use
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT(signal-specific variable, not auto-appended by SDK) if base endpoint conflicts with other signal exporters - Remove any 405 suppression workarounds that are no longer needed after the fix
Files:
src/app/agents/logfire_instrumentation.pytests/agents/test_logfire_instrumentation.py(update)
Feature 13: Fix Tier 3 Graph Analysis Tool Accuracy and Dead Metric¶
Description: Two issues in graph_analysis.py affect Tier 3 scoring accuracy. First, add_node at line 171 overwrites success_rate each time a tool is called, so only the last call’s outcome survives – if a tool succeeds 9 times and fails once (last), success_rate=0.0. The same overwrite applies to add_edge at line 173 for agent-tool edge weights. Second, communication_overhead is computed and stored in Tier3Result but never included in overall_score (lines 392-397), making it a dead metric that inflates the model without contributing to scoring.
Acceptance Criteria:
- Tool
success_rateaccumulates across all calls (e.g., 9/10 successes = 0.9), not just the last call - Agent-tool edge
weightaccumulates or averages across repeated calls, not overwritten -
communication_overheadeither contributes tooverall_scoreor is removed fromTier3Result - If
communication_overheadis included in scoring, weights are rebalanced to sum to 1.0 - Existing multi-tool and single-tool scenarios produce correct
tool_selection_accuracy - Tests: Hypothesis property tests for tool accuracy with repeated calls (success_rate in [0.0, 1.0])
- Tests: inline-snapshot for Tier3Result with known tool call sequences
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- In
analyze_tool_usage()(graph_analysis.py:165-173): track call count and success count per tool node, computesuccess_rateassuccesses / total_callsafter the loop - For edge weights: accumulate call count per agent-tool pair, average or sum weights
- For
communication_overhead: decide include-or-remove. If removed, delete fromTier3Resultmodel inevaluation_models.pyand any downstream references. If included, rebalanceself.weightsto sum to 1.0 across 5 metrics.
Files:
src/app/judge/graph_analysis.pysrc/app/data_models/evaluation_models.py(ifcommunication_overheadremoved)tests/judge/test_graph_analysis.py(update)
Feature 14: Guard Wandb Import and Disable Crash Telemetry¶
Description: login.py:9 has an unconditional from wandb import login as wandb_login at module level. If the optional wandb package is not installed, the entire login.py module fails to import, breaking the application. Additionally, wandb sends crash telemetry to Sentry by default with no opt-out. The weave import at line 44 is already guarded inside the function body – the wandb import should follow the same pattern.
Acceptance Criteria:
- Application starts successfully when
wandbis not installed (noImportError) - When
wandbis installed andWANDB_API_KEYis set, login and weave init work as before - When
wandbis not installed,login()skips wandb/weave initialization with a debug log -
WANDB_ERROR_REPORTINGdefaults to"false"(respects user override if already set) - Dead agentops commented code removed from
login.py: commented import at line 7 (# from agentops import init as agentops_init) and commented code block at lines 30-37 - Tests: pytest unit test for login with wandb unavailable (mock ImportError)
- Tests: inline-snapshot for log output when wandb is missing
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Replace unconditional
from wandb import login as wandb_login(line 9) with guarded import:
try:
from wandb import login as wandb_login
_wandb_available = True
except ImportError:
_wandb_available = False
- Add
os.environ.setdefault("WANDB_ERROR_REPORTING", "false")before the wandb import - In
login(): guardwandb_login()andweave_init()calls withif _wandb_available - Log
logger.debug("wandb not installed, skipping wandb/weave initialization")when unavailable
Files:
src/app/utils/login.pytests/utils/test_login.py(new or update)
Feature 15: Debug Logging for Empty API Keys¶
Description: When get_api_key() returns False for a provider whose key exists in .env but resolves to empty string at runtime, there is no diagnostic log. This makes transient .env loading issues (CWD mismatch, env var unset between runs) hard to diagnose. Add a debug log when a key is expected (provider registered with env_key) but the value is empty.
Acceptance Criteria:
-
get_api_key()logs a debug message when a registered provider’s key resolves to empty string - Debug message includes the
env_keyname (e.g.,GITHUB_API_KEY) for diagnosis - No log emitted for providers without API keys (e.g., Ollama)
- No log emitted when key is correctly loaded
- Tests: pytest unit test for empty-key debug log scenario
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- In
get_api_key()(providers.py:42): addlogger.debug(f"API key '{provider_metadata.env_key}' is empty for provider '{provider}'")in the else branch - No new settings fields, no behavioral change – debug logging only
Files:
src/app/llms/providers.pytests/llms/test_providers.py(update)
Feature 16: Move PeerRead Tools from Manager to Researcher Agent¶
Description: The manager agent receives both delegation tools (researcher(), analyst(), synthesiser()) and PeerRead tools (get_peerread_paper, generate_paper_review_content_from_template, save_structured_review) via add_peerread_tools_to_manager() at agent_system.py:411. Sub-agents get minimal tools: researcher has only duckduckgo_search_tool(), analyst and synthesiser have none. Models take the path of least resistance – the manager uses PeerRead tools directly instead of delegating, resulting in zero multi-agent coordination. Moving PeerRead tools to the researcher enforces separation of concerns: manager coordinates, researcher executes.
Acceptance Criteria:
- PeerRead tools (
get_peerread_paper,generate_paper_review_content_from_template,save_structured_review) registered on the researcher agent, not the manager - Manager agent retains only delegation tools (
researcher(),analyst(),synthesiser()) - Researcher agent has PeerRead tools plus
duckduckgo_search_tool() - Manager delegates to researcher for PeerRead operations (verified via
GraphTraceDatashowing delegation events) - Tier 3 graph analysis produces non-zero
coordination_centralityandcommunication_overheadin multi-agent runs - Single-agent fallback still works if researcher is disabled via agent toggles
- Existing CLI and GUI behavior produces correct review output (no regression in review quality)
- Tests: pytest unit test for tool registration (researcher has PeerRead tools, manager does not)
- Tests: Hypothesis property tests for delegation invariant (manager never calls PeerRead tools directly)
-
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Rename
add_peerread_tools_to_manager()toadd_peerread_tools_to_researcher()inpeerread_tools.py - In
_create_manager()(agent_system.py:410-411): removeadd_peerread_tools_to_manager(manager)call - In researcher agent creation block (
agent_system.py:~370-380): add PeerRead tools alongsideduckduckgo_search_tool() - Update researcher system prompt to include PeerRead tool usage instructions (currently only on manager prompt)
- If researcher is
None(disabled), fall back to adding PeerRead tools to manager to preserve single-agent operation
Files:
src/app/agents/agent_system.pysrc/app/tools/peerread_tools.pytests/agents/test_agent_system.py(update)
Feature 17: Delete Duplicate AppEnv and Dead Code in load_settings.py¶
Description: src/app/utils/load_settings.py contains a duplicate AppEnv class (lines 22-49) that diverges from the canonical AppEnv in src/app/data_models/app_models.py (lines 219-249). The duplicate is missing ANTHROPIC_API_KEY, CEREBRAS_API_KEY, OPENAI_API_KEY, and uses LOGFIRE_TOKEN instead of LOGFIRE_API_KEY. It also eagerly instantiates chat_config = AppEnv() at module level (line 52), which runs on import. Only one consumer exists: datasets_peerread.py:23. The duplicate class and module-level instance should be deleted; load_config() can remain since it loads ChatConfig from JSON.
Acceptance Criteria:
- Duplicate
AppEnvclass removed fromload_settings.py - Module-level
chat_config = AppEnv()instance removed fromload_settings.py -
datasets_peerread.pyimport updated to use canonicalAppEnvfromapp.data_models.app_models -
load_config()function retained inload_settings.pyONLY if still used for JSON config loading; ifload_config()is unused (grep/search confirms no consumers), delete entireload_settings.pymodule - No import errors or runtime failures after removal
- Tests: pytest unit test verifying single
AppEnvsource of truth -
make validatepasses - CHANGELOG.md updated
Technical Requirements:
- Delete
class AppEnv(lines 22-49) andchat_config = AppEnv()(line 52) fromload_settings.py - In
datasets_peerread.py:23: changefrom app.utils.load_settings import chat_configtofrom app.data_models.app_models import AppEnvand instantiate where needed (or pass as parameter) - Verify no other files import from
load_settings.AppEnv(grep confirmed: onlydatasets_peerread.py)
Files:
src/app/utils/load_settings.pysrc/app/data_utils/datasets_peerread.pytests/data_utils/test_datasets_peerread.py(update if affected)
Non-Functional Requirements¶
- Maintainability:
- Use existing configuration patterns (pydantic-settings, env prefix overrides)
- No new dependencies required
- Performance:
- Provider selection adds negligible overhead (single API key check per evaluation)
- Weight redistribution is a simple arithmetic operation
- GUI background execution must not block Streamlit’s event loop
- Backward Compatibility:
- All fixes are additive; existing
JUDGE_TIER2_PROVIDER=openaiandconfig_chat.jsonvalues continue working - Default behavior changes only where current defaults produce errors (401 auth failures, silent paper skipping)
- CLI behavior unchanged by GUI features
- Testing (per
docs/best-practices/testing-strategy.md): - TDD mandatory: RED (failing test) → GREEN (minimal implementation) → REFACTOR for all implementation stories
- Use pytest as primary TDD tool for unit tests with Arrange-Act-Assert structure
- Use Hypothesis (
@given) for property-based tests: provider selection invariants, weight sum invariants, token limit bounds, paper validation with missing fields, settings value bounds - Use inline-snapshot (
snapshot()) for regression tests: log messages, model dumps, CLI help text, GUI component structures - Decision rule: Test behavior, not implementation. If a test wouldn’t catch a real bug, don’t write it.
Out of Scope¶
- Tier 1 reference comparison fix (all-1.0 self-comparison scores – requires ground-truth review integration, separate feature)
- Automatic PeerRead dataset download when splits are missing (existing error message with instructions is sufficient)
- Custom composite weight configuration via CLI/GUI (equal-weight is the defined scoring model)
- Tier 2 cost tracking or budget enforcement (existing
tier2_cost_budget_usdis informational only) - Cerebras-specific prompt optimization for structured output validation retries
- Real-time log streaming in GUI (logs displayed after execution completes)
- Persisting modified settings to
.envorconfig_chat.json(session-only changes) - Adding new tests for untested modules (audit scope is refactoring existing tests only)
- Penetration testing or runtime exploit validation (static review only)
Notes for Ralph Loop¶
Story Priority Tiers for Ralph¶
Ralph should prioritize stories in the following order to maximize value delivery:
- P0 (Quick Wins - 1-line fixes): STORY-012 (OTLP endpoint), STORY-014 (wandb import guard), STORY-015 (API key debug log), STORY-017 (duplicate AppEnv)
- P1 (Core Bugs): STORY-001 (judge provider fallback), STORY-003 (single-agent score fairness), STORY-004 (PeerRead validation), STORY-013 (graph analysis accuracy)
- P2 (GUI Features): STORY-002 (token limits), STORY-006 (background execution), STORY-007 (debug log panel), STORY-008 (wire results tabs), STORY-009 (editable settings)
- P3 (Architecture): STORY-016 (PeerRead tool delegation)
- P4 (Meta-Tasks): STORY-010 (code review), STORY-011 (test audit)
- P5 (Documentation - Blocked): STORY-005 (depends on all other stories)
Story Breakdown - Sprint 5 (17 stories total):
- Feature 1 (Judge Provider Fallback) → STORY-001: Wire Tier 2 judge provider fallback and auto-inherit agent provider
- Feature 2 (Token Limits) → STORY-002: Configurable agent token limits via CLI, GUI, and env var
- Feature 3 (Score Fairness) → STORY-003: Single-agent composite score weight redistribution
- Feature 4 (PeerRead Validation) → STORY-004: PeerRead dataset validation resilience for optional fields
- Feature 5 (Documentation Updates) → STORY-005: Update documentation and diagrams for Sprint 5 (depends: STORY-001, STORY-002, STORY-003, STORY-004, STORY-006, STORY-007, STORY-008, STORY-009, STORY-010, STORY-011, STORY-012, STORY-013, STORY-014, STORY-015, STORY-016, STORY-017)
- Feature 6 (Background Execution) → STORY-006: Background query execution with tab navigation resilience
- Feature 7 (Debug Log Panel) → STORY-007: Debug log panel in App tab (depends: STORY-006)
- Feature 8 (Wire Results Tabs) → STORY-008: Wire Evaluation Results and Agent Graph tabs to real data (depends: STORY-006)
- Feature 9 (Editable Settings) → STORY-009: Editable settings page with session-scoped persistence (depends: STORY-001)
- Feature 10 (Code Review) → STORY-010: Code quality and OWASP MAESTRO security review
- Feature 11 (Test Audit) → STORY-011: Test suite audit and behavioral refactoring
- Feature 12 (OTLP Endpoint Fix) → STORY-012: Fix OTLP endpoint double-path bug in Logfire instrumentation
- Feature 13 (Graph Analysis Fixes) → STORY-013: Fix Tier 3 tool accuracy overwrite and dead communication_overhead metric
- Feature 14 (Wandb Import Guard) → STORY-014: Guard wandb import and disable crash telemetry default
- Feature 15 (API Key Debug Log) → STORY-015: Debug logging for empty API keys in provider resolution
- Feature 16 (PeerRead Tool Delegation) → STORY-016: Move PeerRead tools from manager to researcher agent
- Feature 17 (Delete Duplicate AppEnv) → STORY-017: Delete duplicate AppEnv class and dead code in load_settings.py