Product Requirements Document - Agents-eval Sprint 10

Project Overview¶

Agents-eval evaluates multi-agent AI systems using the PeerRead dataset. The system generates scientific paper reviews via a 4-agent delegation pipeline (Manager -> Researcher -> Analyst -> Synthesizer) and evaluates them through three tiers: traditional metrics, LLM-as-Judge, and graph analysis.

Sprint 10 goal: E2E parity between CLI and GUI for all execution modes. The CC engine (solo + teams) works from CLI but is broken in the GUI — app.main() ignores the engine parameter and always runs MAS. Graphs must build and visualize for all modes including CC. Provider coverage is expanded with 7 new inference providers. Judge settings UX is improved, PydanticAI deprecated APIs are migrated, and inspect.getsource test anti-patterns are replaced.

Current State¶

Mode	CLI	GUI	Gap
Free-text query (MAS)	Works	Works	None
Paper review (MAS)	Works (`--paper-id`)	Works (dropdown)	None
CC solo	Broken — `args.pop("engine")` removes it before `main()`, MAS always runs after CC	Radio exists, but `app.main()` ignores `engine` — runs MAS	Broken (both)
CC teams	Broken — same CLI bug as solo	No toggle exists	Broken + Missing
Graph visualization	N/A (CLI)	Works for MAS; CC produces no graph data	Partial
CC evaluation	Pipeline is engine-agnostic (plain strings), but CC review text discarded (`_RESULT_KEYS` omits `"result"`)	Not wired	No path
Reference reviews	`reference_reviews=None` for ALL modes (MAS included) — Tier 1 scores against empty	Same	Bug (all modes)

Development Methodology¶

All implementation stories MUST follow these practices. Ralph Loop enforces this order.

TDD Workflow (Mandatory for all features)¶

RED: Write failing tests first using testing-python skill. Tests define expected behavior before any implementation code exists.
GREEN: Implement minimal code to pass tests using implementing-python skill. No extra functionality.
REFACTOR: Clean up while keeping tests green. Run make validate before marking complete.

Test Tool Selection¶

Tool	Use for	NOT for
pytest	Core logic, unit tests, known edge cases (primary TDD tool)	Random inputs
Hypothesis	Property invariants, bounds, all-input guarantees	Snapshots, known cases
inline-snapshot	Regression, model dumps, complex structures	TDD red-green, ranges

Decision rule: If the test wouldn’t catch a real bug, don’t write it. Test behavior, not implementation.

Mandatory Practices¶

Mock external dependencies (HTTP, LLM providers, file systems, subprocess) using @patch. Never call real APIs in unit tests.
Test behavior, not implementation – test observable outcomes (return values, side effects, error messages), not internal structure.
Google-style docstrings for every new file, function, class, and method.
# Reason: comments for non-obvious logic.
make validate MUST pass before any story is marked complete. No exceptions.

Skills Usage¶

Story type	Skills to invoke
Implementation (all features)	`testing-python` (RED) -> `implementing-python` (GREEN)
Codebase research	`researching-codebase` (before non-trivial implementation)
Design phase	`researching-codebase` -> `designing-backend`

Functional Requirements¶

Feature 1: Connect All Execution Modes to the Same Three-Tier Evaluation Pipeline¶

Description: All execution modes (MAS, CC solo, CC teams) must produce comparable evaluation results through the same evaluate_comprehensive() call. The evaluation pipeline interface is already engine-agnostic — all three tiers operate on plain strings and dicts, not MAS types:

evaluate_comprehensive(
    paper: str,                    # Tier 1 + Tier 2
    review: str,                   # Tier 1 + Tier 2
    execution_trace: GraphTraceData | dict | None,  # Tier 2 (dict) + Tier 3 (GraphTraceData)
    reference_reviews: list[str] | None,            # Tier 1
) -> CompositeResult

No GeneratedReview wrapping needed — the pipeline already accepts plain strings. The work is building an adapter layer that translates each mode’s output into these 4 parameters, plus fixing 4 bugs discovered during analysis:

Bug	Location	Impact
CLI `engine` pop	`run_cli.py:107` — `args.pop("engine")` removes engine before `main()`	MAS always runs after CC
`main()` ignores engine	`app.py:228` — logs value, unconditionally calls `_run_agent_execution()`	CC engine has no effect
CC review text discarded	`cc_engine.py:85` — `_RESULT_KEYS` omits `"result"`	CC response text lost
Reference reviews never loaded	`evaluation_runner.py:154` — `reference_reviews=None` for ALL modes	Tier 1 scores empty strings

Acceptance Criteria:

Technical Requirements:

Capture CC review text: Add "result" to _RESULT_KEYS in cc_engine.py:85 so cc_result.output_data["result"] contains the review text. Add extract_cc_review_text(cc_result) -> str helper
Build GraphTraceData from CC artifacts: Add cc_result_to_graph_trace(cc_result) -> GraphTraceData that maps team_artifacts Task/TeamCreate events to agent_interactions, tool_calls, and coordination_events. CC solo: minimal GraphTraceData(execution_id=cc_result.execution_id) with empty lists — CompositeScorer._detect_single_agent_mode() already redistributes coordination_quality weight. CC teams: Task.owner -> delegation interactions, completed tasks -> tool_calls, TeamCreate -> coordination_events
Load reference reviews for all modes: In evaluation_runner.py, before evaluate_comprehensive(), load from PeerRead: paper.reviews[*].comments when paper_id is set. This fixes the existing bug for ALL modes (MAS included)
Add engine_type to CompositeResult: engine_type: str = Field(default="mas") — enables downstream consumers to know the source engine. Backward-compatible default
Wire main() to branch on engine: Add cc_result: CCResult | None = None param. When engine == "cc": skip _run_agent_execution() entirely, extract review text via extract_cc_review_text(), build GraphTraceData via cc_result_to_graph_trace(), load paper content + reference reviews from PeerRead, call evaluate_comprehensive() with same 4 parameters as MAS, build nx.DiGraph via build_interaction_graph()
Fix CLI wiring: Pass engine and cc_result explicitly to main(): run(main(**args, engine=engine, cc_result=cc_result)). Remove pattern where CC runs first then MAS runs anyway
Fix GUI wiring: In _execute_query_background(), add CC branch that calls run_cc_solo() / run_cc_teams() before calling main() with cc_result. Add CC teams checkbox visible when engine is CC
Fix run_cc_teams timeout: Use start_new_session=True + os.killpg(os.getpgid(proc.pid), signal.SIGTERM) then proc.kill() to clean up teammate child processes
Mock subprocess.run and subprocess.Popen in tests — never call real claude CLI

Comparability Matrix:

Metric	Tier	MAS vs CC	Rationale
`output_similarity`	1	Comparable	Same review text vs same references
`task_success`	1	Comparable	Same threshold on same similarity scores
`time_taken`	1	Comparable	Wall-clock time for both
`technical_accuracy`	2	Comparable	LLM judges review text quality
`constructiveness`	2	Comparable	LLM judges review text quality
`planning_rationality`	2	Partial	CC has sparse trace -> less signal
`coordination_centrality`	3	Not comparable	MAS: rich delegation graph; CC solo: empty; CC teams: flat
`tool_selection_accuracy`	3	Partial	CC teams: task completions as proxy
`path_convergence`	3	Not comparable	Structurally different graphs
`task_distribution_balance`	3	Partial	CC teams has distribution data

Tier 1 + Tier 2 (review quality) are directly comparable. Tier 3 (graph/coordination) is structurally different but still computed — the composite scorer handles this via single_agent_mode weight redistribution for CC solo.

Files:

src/app/engines/cc_engine.py (edit – add "result" to _RESULT_KEYS, add extract_cc_review_text(), add cc_result_to_graph_trace(), fix run_cc_teams process group kill)
src/app/data_models/evaluation_models.py (edit – add engine_type field to CompositeResult)
src/app/judge/evaluation_runner.py (edit – load reference reviews from PeerRead for all modes, accept cc_result/engine_type params, CC adapter branch)
src/app/app.py (edit – add cc_result param to main(), CC engine branch that skips _run_agent_execution())
src/run_cli.py (edit – pass engine=engine, cc_result=cc_result to main(), remove MAS-after-CC pattern)
src/gui/pages/run_app.py (edit – CC branch in _execute_query_background(), add CC teams checkbox)
tests/engines/test_cc_engine.py (edit – tests for extract_cc_review_text, cc_result_to_graph_trace, "result" in _RESULT_KEYS)
tests/cli/test_cc_engine_wiring.py (edit – test CLI passes engine+cc_result to main, CC does not invoke MAS)
tests/judge/test_evaluation_runner.py (edit – test reference reviews loaded for all modes, CC result adapter path)

Feature 2: Graph Visualization Polish for All Execution Modes¶

Description: Feature 1 builds GraphTraceData and nx.DiGraph for CC runs. This feature handles the visualization layer: the Agent Graph page must distinguish between no-execution-yet, empty graph (CC solo), and populated graph (MAS or CC teams). CC Tier 3 graph metrics need “informational” labeling since they aren’t comparable to MAS scores. CCResult.team_artifacts already retains parsed events from the JSONL stream (per cc_engine.py:111-112).

Acceptance Criteria:

AC1: CC solo produces an nx.DiGraph (may be minimal — single node) displayed on Agent Graph page
AC2: CC teams produces an nx.DiGraph showing team member nodes and delegation edges
AC3: Empty graphs (0 nodes, 0 edges) display a descriptive warning (e.g., “CC solo mode — no agent interactions to display”) instead of generic “No agent interaction data available”
AC4: MAS graph visualization continues to work unchanged
AC5: Tier 3 graph metrics from CC runs are labeled “informational — not comparable to MAS scores” in evaluation display
AC6: make validate passes with no regressions

Technical Requirements:

In agent_graph.py: distinguish between graph is None (no execution yet), empty graph (execution produced no interactions — show mode-specific message using CompositeResult.engine_type), and populated graph
For Tier 3 metrics on CC runs: when engine_type starts with "cc", prefix metric labels with “Informational” in evaluation display
Graph building itself is handled by Feature 1 (cc_result_to_graph_trace() + build_interaction_graph())

Files:

src/gui/pages/agent_graph.py (edit – differentiate empty vs missing graph messages, CC graph labeling)
src/gui/pages/evaluation_results.py (edit – CC-specific Tier 3 labeling when engine_type is CC)
tests/test_gui/test_agent_graph.py (new – graph rendering for MAS, CC solo, CC teams)

Feature 3: Expand Inference Provider Registry and Update Stale Models¶

Description: The current PROVIDER_REGISTRY has 12 providers but is missing many popular OpenAI-compatible inference providers. Key omissions: Groq, Fireworks AI, DeepSeek, Mistral, SambaNova, Nebius, Cohere. The anthropic provider entry falls through to the generic OpenAIChatModel handler in create_llm_model() instead of using PydanticAI’s native Anthropic support. Several existing config_chat.json entries have stale/deprecated model IDs – two are live bugs: huggingface uses facebook/bart-large-mnli (a classification model, not chat – will fail immediately) and together uses Llama-3.3-70B-Instruct-Turbo-Free (removed Jul 2025 – will fail silently). Multiple max_content_length values are wrong (e.g., cerebras says 8192 but gpt-oss-120b has 128K context; grok says 15000 but should be 131K). Values must reflect the maximum token usage allowed on each provider’s free tier before requests get blocked. See Inference-Providers.md for the full provider analysis.

Acceptance Criteria:

Technical Requirements:

Add entries to PROVIDER_REGISTRY in src/app/data_models/app_models.py with correct base URLs:
groq: https://api.groq.com/openai/v1, env: GROQ_API_KEY
fireworks: https://api.fireworks.ai/inference/v1, env: FIREWORKS_API_KEY
deepseek: https://api.deepseek.com/v1, env: DEEPSEEK_API_KEY
mistral: https://api.mistral.ai/v1, env: MISTRAL_API_KEY
sambanova: https://api.sambanova.ai/v1, env: SAMBANOVA_API_KEY
nebius: https://api.studio.nebius.ai/v1, env: NEBIUS_API_KEY
cohere: https://api.cohere.com/v2, env: COHERE_API_KEY
Add matching entries to config_chat.json with models from Inference-Providers.md
Update stale existing config_chat.json model IDs and max_content_length values (see AC4-AC7 and analysis doc)
Add anthropic branch in create_llm_model() using from pydantic_ai.models.anthropic import AnthropicModel
Add groq branch in create_llm_model() with openai_supports_strict_tool_definition=False
Providers that need openai_supports_strict_tool_definition=False: groq, cerebras (already handled), fireworks, together, sambanova
Add choices=list(PROVIDER_REGISTRY.keys()) to CLI --chat-provider argparse definition for early validation

Files:

src/app/data_models/app_models.py (edit – add 7 providers to PROVIDER_REGISTRY)
src/app/llms/models.py (edit – add anthropic, groq branches in create_llm_model(), update strict-tool handling)
src/app/config/config_chat.json (edit – add 7 provider config entries, fix 2 live bugs, update 7 stale models)
src/run_cli.py (edit – add choices= to --chat-provider argument)
tests/llms/test_models.py (edit – test new provider branches)

Feature 4: Judge Auto Mode – Conditional Settings Display¶

Description: When tier2_provider is set to "auto" in the GUI Settings page, the downstream Tier 2 LLM Judge controls (model, fallback provider, fallback model, fallback strategy, timeout) are still displayed. Since “auto” delegates provider selection to the runtime, these manual overrides are confusing and logically redundant. They should be hidden when “auto” is selected.

Acceptance Criteria:

AC1: When tier2_provider is "auto", the following controls are hidden: primary model selectbox, fallback provider, fallback model, fallback strategy
AC2: When tier2_provider is changed from "auto" to a specific provider, the hidden controls reappear immediately
AC3: Timeout and cost budget controls remain visible regardless of provider selection (they apply to all modes)
AC4: Session state values for hidden controls retain their defaults (not cleared when hidden)
AC5: make validate passes with no regressions

Technical Requirements:

In _render_tier2_llm_judge() in settings.py, wrap the model/fallback controls in if selected_provider != "auto": conditional
Keep tier2_timeout_seconds and tier2_cost_budget_usd outside the conditional – they apply regardless
Ensure _build_judge_settings_from_session() in run_app.py still constructs a valid JudgeSettings when auto is selected (fields use defaults from the model)

Files:

src/gui/pages/settings.py (edit – conditional display in _render_tier2_llm_judge())
tests/test_gui/test_settings_judge_auto.py (new – verify controls hidden/shown based on provider selection)

Feature 5: PydanticAI API Migration – `manager.run()`, `RunContext`, and Private Attribute Access¶

Description: agent_system.py:543-551 uses the deprecated manager.run() PydanticAI API with 3 FIXME markers and broad type: ignore directives (reportDeprecated, reportUnknownArgumentType, reportCallOverload, call-overload). The result.usage() call also requires type: ignore. Additionally, RunContext may be deprecated in the installed PydanticAI version (Review F6), and _model_name private attribute access at agent_system.py:537 should use the public model_name API (Review F23). Migrate all three patterns in one pass.

Acceptance Criteria:

AC1: manager.run() replaced with current PydanticAI API (non-deprecated call)
AC2: All type: ignore comments on lines 548 and 551 removed – pyright passes cleanly
AC3: All 3 FIXME comments (lines 543-544, 550) removed
AC4: Agent execution produces identical results (same execution_id, same result.output)
AC5: RunContext verified against installed PydanticAI version; updated to current name (e.g., AgentRunContext) if deprecated (Review F6)
AC6: _model_name private attribute access replaced with public model_name API (Review F23)
AC7: make validate passes with no new type errors or test failures

Technical Requirements:

Research current PydanticAI Agent.run() signature and migrate mgr_cfg dict unpacking accordingly
Verify result.usage() return type is properly typed after migration
Verify RunContext deprecation status: python -c "from pydantic_ai import RunContext; print(RunContext)". If deprecated, update all tool function signatures in agent_system.py and peerread_tools.py
Replace getattr(manager, "model")._model_name with getattr(manager, "model").model_name (public attribute) with fallback to "unknown"
Preserve trace_collector start/end calls and error handling structure
Mock PydanticAI agent in tests – never call real LLM providers

Files:

src/app/agents/agent_system.py (edit – lines 537-551, migrate manager.run(), fix _model_name, check RunContext)
src/app/tools/peerread_tools.py (edit – update RunContext import if deprecated)
tests/agents/test_agent_system.py (edit – update/add tests for migrated call)

Feature 6: Replace `inspect.getsource` Tests with Behavioral Tests¶

Description: Six test files use inspect.getsource(module) then assert string presence (e.g., 'engine != "cc"' in source). This pattern breaks on code reformatting, passes if the string appears anywhere in source, and couples tests to implementation rather than behavior. Identified as a top-3 anti-pattern by prevalence in the tests parallel review (H5, H6, M14, M15 – ~20 occurrences across 6 files).

Acceptance Criteria:

AC1: tests/utils/test_weave_optional.py – inspect.getsource replaced with behavioral test: import module with weave absent, verify op() is a callable no-op decorator (tests-review H5)
AC2: tests/gui/test_story012_a11y_fixes.py – all 11 inspect.getsource occurrences replaced with Streamlit mock-based assertions (tests-review H6)
AC3: tests/gui/test_story013_ux_fixes.py – source inspection replaced with behavioral widget assertions (tests-review H6)
AC4: tests/gui/test_story010_gui_report.py – 2 source inspections replaced with output assertions (tests-review H6)
AC5: tests/cli/test_cc_engine_wiring.py – 4 source inspections removed; behavioral tests already exist alongside (tests-review H6, M15)
AC6: tests/gui/test_prompts_integration.py – source file read + string assertion replaced with render function mock test (tests-review M14)
AC7: Zero occurrences of inspect.getsource remain in tests/ directory
AC8: make validate passes with no regressions

Technical Requirements:

Replace source-level string assertions with behavioral tests: call the function with relevant inputs and assert outputs
For UI tests, verify widgets called via Streamlit mocks instead of inspecting source
For CLI tests, remove redundant source inspections where behavioral parse_args tests already cover the logic
Run grep -r "inspect.getsource" tests/ to verify zero remaining occurrences

Files:

tests/utils/test_weave_optional.py (edit)
tests/gui/test_story012_a11y_fixes.py (edit)
tests/gui/test_story013_ux_fixes.py (edit)
tests/gui/test_story010_gui_report.py (edit)
tests/cli/test_cc_engine_wiring.py (edit)
tests/gui/test_prompts_integration.py (edit)

Non-Functional Requirements¶

Report generation latency target: < 5s for rule-based suggestions, < 30s for LLM-assisted
No new external dependencies without PRD validation
Change comments: Every non-trivial code change must include a concise inline comment with sprint, story, and reason. Format: # S10-F{N}: {why}. Keep comments to one line. Omit for trivial changes (string edits, config values).

Out of Scope¶

Deferred from original Sprint 10 plan (not aligned with E2E parity goal):

GUI Sweep Page – full sweep GUI with progress indicators, multi-select papers, composition toggles. SweepRunner hardcodes MAS-first ordering, doesn’t support engine parameter, and sweep results shape differs from single-run session state format. Needs design work before implementation.
GUI Layout Refactor – sidebar tabs and page separation (cosmetic, not blocking E2E)
Data Layer Robustness – narrow exceptions + contradictory log (Review F9, F17)
Dispatch Chain Registry Refactor in datasets_peerread.py (Review F10)
create_llm_model() registry pattern refactor – the if/elif chain is fine for 19 providers
Provider health checks or connectivity validation
--judge-provider CLI validation – judge provider uses a separate settings model, not part of E2E parity
CC-specific Tier 3 graph metrics (delegation fan-out, task completion rate, teammate utilization) – MAS-specific metrics labeled “informational” for CC runs is sufficient for Sprint 10

Deferred test review findings (MEDIUM/LOW from tests-parallel-review-2026-02-21.md):

assert isinstance() replacements with behavioral assertions (H4, M1-M3) – ~30+ occurrences across 12 files
Subdirectory conftest.py creation for tests/agents/, tests/tools/, tests/evals/, tests/judge/ (M5, M6)
@pytest.mark.parametrize additions for provider tests and recommendation tests (M7, M8)
hasattr() replacements with behavioral tests (M4)
Weak assertion strengthening in test_suggestion_engine.py and test_report_generator.py (M18, L5)
Hardcoded relative path fix in test_peerread_tools_error_handling.py (H8)
tempfile -> tmp_path in integration tests (L7, L8)
@pytest.mark.slow markers on performance baselines (L10)

Deferred to future sprint (TBD acceptance criteria, low urgency):

Centralized Tool Registry with Module Allowlist (MAESTRO L7.2) – architectural, needs design
Plugin Tier Validation at Registration (MAESTRO L7.1) – architectural, needs design
Error Message Sanitization (MAESTRO) – TBD acceptance criteria
Configuration Path Traversal Protection (MAESTRO) – TBD acceptance criteria
GraphTraceData Construction Simplification (model_validate()) – TBD acceptance criteria
Timeout Bounds Enforcement – low urgency
Hardcoded Settings Audit – continuation of Sprint 7
Time Tracking Consistency Across Tiers – low urgency
BDD Scenario Tests for Evaluation Pipeline – useful but not blocking
Cerebras Structured Output Validation Retries – provider-specific edge case
PlantUML Diagram Audit – cosmetic, no user impact

Notes for Ralph Loop¶

Priority Order¶

P1 (E2E parity): STORY-010 (CC eval pipeline parity — biggest story, most files), STORY-011 (graph viz polish)
P2 (infrastructure): STORY-012 (providers), STORY-013 (judge auto UX)
P3 (code health): STORY-014 (PydanticAI migration), STORY-015 (source inspection tests)

Running Ralph in CC Agent Teams Mode¶

Ralph supports CC Agent Teams for inter-story parallelism. Use this when multiple stories can run concurrently (different files, no conflicts).

# Teams mode: lead coordinates, 2 teammates implement in parallel
make ralph_run TEAMS=true MAX_ITERATIONS=12 MODEL=opus

# Worktree + teams (isolated branch):
make ralph_run_worktree BRANCH=ralph/sprint10-e2e-parity TEAMS=true MAX_ITERATIONS=12

How teams mode works in Ralph (see CC-agent-teams-orchestration.md):

Sets CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 automatically when TEAMS=true
Lead picks primary story, delegates wave peers to teammates via the shared task list
Teammates implement in parallel; each runs make quick_validate on their story’s files
Lead runs make validate at wave boundaries before advancing to next wave
Scoped lint/tests in teams mode: only story-specific files are checked per teammate
Wave boundary detection: when primary story passes, Ralph checks if next story is in a new wave

Key limitations (from orchestration analysis):

No session resumption — if Ralph times out, restart from scratch
Task status can lag — teammates sometimes don’t mark tasks complete, blocking dependents
Linear token cost — each teammate is a separate Claude instance (~3x cost for 2 teammates)
Cross-story interference possible in teams mode — scoped validation mitigates but doesn’t eliminate

Recommendation for Sprint 10: Use TEAMS=true for Wave 1 (STORY-010 + STORY-012 are independent). Run Wave 2 sequentially — both stories depend on Wave 1 and share indirect file conflicts.

Notes for CC Agent Teams¶

Team Structure: Lead + 2 teammates max
Delegate Mode: Recommended – lead coordinates, teammates implement

File-Conflict Dependencies¶

Stories sharing files need blockedBy deps beyond logical depends_on.

Story	Logical Dep	+ File-Conflict Dep	Shared File	Reason
STORY-011	STORY-010	–	`cc_engine.py`, `evaluation_runner.py`	Graph viz polish uses `cc_result_to_graph_trace()` added by F1
STORY-014	STORY-010	+ STORY-012	`models.py` (via `agent_system.py` provider usage)	PydanticAI migration touches agent_system.py which imports from models.py; provider changes in STORY-012 must land first

Orchestration Waves¶

Wave 1 (independent, no file conflicts):
  teammate-1: STORY-012 (F3 providers) then STORY-013 (F4 judge auto)
  teammate-2: STORY-010 (F1 CC eval pipeline parity) — largest story, give full wave

Wave 2 (after Wave 1 completes):
  teammate-1: STORY-011 (F2 graph viz polish, depends: STORY-010) then STORY-015 (F6 source inspection)
  teammate-2: STORY-014 (F5 PydanticAI migration, depends: STORY-010, STORY-012)

Quality Gates: Teammate runs make quick_validate; lead runs make validate after each wave
Teammate Prompt Template: Sprint 8 pattern with TDD [RED]/[GREEN] commit markers

Story Breakdown - Phase 1 (6 stories total):

Feature 1 → STORY-010: Connect all execution modes to the same three-tier evaluation pipeline Fix 4 bugs (CLI engine pop, main() ignores engine, CC review text discarded, reference reviews never loaded). Add adapter layer: extract_cc_review_text(), cc_result_to_graph_trace() in cc_engine.py. Add engine_type to CompositeResult. Wire main() CC branch (skip MAS, pass cc_result). Fix CLI to pass engine+cc_result to main(). Fix GUI to run CC before main(). Load reference reviews from PeerRead for all modes. Files: src/app/engines/cc_engine.py, src/app/data_models/evaluation_models.py, src/app/judge/evaluation_runner.py, src/app/app.py, src/run_cli.py, src/gui/pages/run_app.py, tests/engines/test_cc_engine.py, tests/cli/test_cc_engine_wiring.py, tests/judge/test_evaluation_runner.py.
Feature 2 → STORY-011: Graph visualization polish for all execution modes (depends: STORY-010) Handle empty vs missing graphs on Agent Graph page. Label CC Tier 3 metrics as “informational.” Graph building itself is done by Feature 1. Files: src/gui/pages/agent_graph.py, src/gui/pages/evaluation_results.py, tests/test_gui/test_agent_graph.py.
Feature 3 → STORY-012: Expand inference provider registry and update stale models Add 7 new providers (Groq, Fireworks, DeepSeek, Mistral, SambaNova, Nebius, Cohere). Fix 2 live bugs (huggingface classification model, together removed free model). Update 7 stale model IDs. Set max_content_length to free-tier token limits. Fix Anthropic to use native PydanticAI model. See docs/analysis/Inference-Providers.md. Files: src/app/data_models/app_models.py, src/app/llms/models.py, src/app/config/config_chat.json, src/run_cli.py, tests/llms/test_models.py.
Feature 4 → STORY-013: Judge auto mode – conditional settings display Hide downstream Tier 2 controls when provider is “auto”. Files: src/gui/pages/settings.py, tests/test_gui/test_settings_judge_auto.py.
Feature 5 → STORY-014: PydanticAI API migration (depends: STORY-010, STORY-012) Migrate manager.run(), fix RunContext, replace _model_name. Files: src/app/agents/agent_system.py, src/app/tools/peerread_tools.py, tests/agents/test_agent_system.py.
Feature 6 → STORY-015: Replace inspect.getsource tests with behavioral tests Rewrite ~20 inspect.getsource assertions across 6 test files. Files: tests/utils/test_weave_optional.py, tests/gui/test_story012_a11y_fixes.py, tests/gui/test_story013_ux_fixes.py, tests/gui/test_story010_gui_report.py, tests/cli/test_cc_engine_wiring.py, tests/gui/test_prompts_integration.py.

Product Requirements Document - Agents-eval Sprint 10

Project Overview¶

Current State¶

Development Methodology¶

TDD Workflow (Mandatory for all features)¶

Test Tool Selection¶

Mandatory Practices¶

Skills Usage¶

Functional Requirements¶

Feature 1: Connect All Execution Modes to the Same Three-Tier Evaluation Pipeline¶

Feature 2: Graph Visualization Polish for All Execution Modes¶

Feature 3: Expand Inference Provider Registry and Update Stale Models¶

Feature 4: Judge Auto Mode – Conditional Settings Display¶

Feature 5: PydanticAI API Migration – manager.run(), RunContext, and Private Attribute Access¶

Feature 6: Replace inspect.getsource Tests with Behavioral Tests¶

Non-Functional Requirements¶

Out of Scope¶

Notes for Ralph Loop¶

Priority Order¶

Running Ralph in CC Agent Teams Mode¶

Notes for CC Agent Teams¶

File-Conflict Dependencies¶

Orchestration Waves¶

Feature 5: PydanticAI API Migration – `manager.run()`, `RunContext`, and Private Attribute Access¶

Feature 6: Replace `inspect.getsource` Tests with Behavioral Tests¶