PRD Sprint6 Ralph

title: Product Requirements Document: Agents-eval Sprint 6 description: Benchmarking infrastructure, CC baseline completion, tool access refinement, security hardening (CVE mitigations, input sanitization, log scrubbing), and test quality improvements for the Agents-eval MAS evaluation framework. version: 1.2.0 created: 2026-02-16 updated: 2026-02-16

Project Overview¶

Agents-eval evaluates multi-agent AI systems using the PeerRead dataset for scientific paper review assessment. The system generates reviews via a 4-agent delegation pipeline (Manager -> Researcher -> Analyst -> Synthesizer) and evaluates them through a three-tier engine: Tier 1 (traditional text metrics), Tier 2 (LLM-as-Judge), and Tier 3 (graph analysis).

Sprint 5 delivered runtime fixes, GUI enhancements, architectural improvements, code quality review (OWASP MAESTRO), and test suite audit across 17 stories.

Sprint 6 focuses on benchmarking infrastructure, baseline completion, tool access refinement, security hardening, and test quality across 15 stories:

Cleanup (Features 1-2, 6): Remove Opik entirely, fix Phoenix Docker recipe, delete orphaned cc_otel module
CC Baseline (Features 3-5): Fix adapter path handling, create collection scripts, wire paper extraction
Benchmarking (Feature 7): Build MAS composition sweep infrastructure
Tool Access (Features 8-9): Conditional review tool placement, enable review tools by default
Security Hardening (Features 10-13): CVE mitigations, prompt input sanitization, log/trace scrubbing, security test suite
Test Quality (Features 14-15): Increase coverage on critical modules, execute test audit refactoring
Quick Win (bundled with Feature 2): Fix empty Agent Interaction Graph (one-line change)

Development Methodology¶

All implementation stories MUST follow these practices. Ralph Loop enforces this order.

TDD Workflow (Mandatory for all features)¶

RED: Write failing tests first using testing-python skill. Tests define expected behavior before any implementation code exists.
GREEN: Implement minimal code to pass tests using implementing-python skill. No extra functionality.
REFACTOR: Clean up while keeping tests green. Run make validate before marking complete.

Test Tool Selection¶

Tool	Use for	NOT for
pytest	Core logic, unit tests, known edge cases (primary TDD tool)	Random inputs
Hypothesis	Property invariants, bounds, all-input guarantees	Snapshots, known cases
inline-snapshot	Regression, model dumps, complex structures	TDD red-green, ranges

Decision rule: If the test wouldn’t catch a real bug, don’t write it. Test behavior, not implementation.

Mandatory Practices¶

Mock external dependencies (HTTP, LLM providers, file systems, subprocess) using @patch. Never call real APIs in unit tests.
Test behavior, not implementation — test observable outcomes (return values, side effects, error messages), not internal structure (isinstance checks, property existence, default constants).
Google-style docstrings for every new file, function, class, and method. Auto-generated documentation depends on this.
# Reason: comments for non-obvious logic (e.g., regex patterns, XML delimiter choices, fallback order).

Core Principles¶

KISS: Simplest solution that passes tests. Clear > clever.
DRY: Reuse existing patterns (CompositeResult, EvaluationPipeline, CCTraceAdapter). Don’t rebuild.
YAGNI: Implement only what acceptance criteria require. No speculative features.

Skills Usage¶

Story type	Skills to invoke
Implementation (1-12, 14)	`testing-python` (RED) → `implementing-python` (GREEN)
Security tests (13)	`testing-python` (RED) → `implementing-python` (GREEN)
Test refactoring (15)	`testing-python` (for validation after deletions)
Codebase research	`researching-codebase` (before non-trivial implementation)

Functional Requirements¶

Feature 1: Remove Opik Entirely¶

Description: Remove all Opik-related code, configuration, Docker infrastructure, Makefile targets, documentation, and tests from the project. Opik was replaced by Logfire + Phoenix in Sprint 4. Deprecated stubs (opik_instrumentation.py, OpikConfig) and the full Docker stack (docker-compose.opik.yaml, 11 services) remain as dead code. This cleanup removes ~800 lines of unused code and configuration.

Acceptance Criteria:

Technical Requirements:

Delete files: src/app/agents/opik_instrumentation.py, docker-compose.opik.yaml, docs/howtos/opik-setup-usage-integration.md
Delete test files: tests/integration/test_opik_integration.py, tests/evals/test_opik_metrics.py
In src/app/utils/load_configs.py: delete OpikConfig class (the DEPRECATED class), keep LogfireConfig
In Makefile: delete all opik targets (setup_opik, setup_opik_env, start_opik, stop_opik, clean_opik, status_opik), remove setup_opik from setup_devc_full and setup_devc_ollama_full
In .env.example: remove Opik env vars (OPIK_URL_OVERRIDE, OPIK_WORKSPACE, OPIK_PROJECT_NAME)
In .gitignore: remove opik/ and .opik_install_reported entries
In CONTRIBUTING.md: remove Opik make commands from command reference table and setup instructions
Verify cleanup: grep -ri opik src/app/ returns no matches

Files:

src/app/agents/opik_instrumentation.py (delete)
src/app/utils/load_configs.py (edit — remove OpikConfig, keep LogfireConfig)
docker-compose.opik.yaml (delete)
Makefile (edit)
.env.example (edit)
.gitignore (edit)
CONTRIBUTING.md (edit)
docs/howtos/opik-setup-usage-integration.md (delete)
tests/integration/test_opik_integration.py (delete)
tests/evals/test_opik_metrics.py (delete)
docs/analysis/CC-agent-teams-orchestration.md (edit — update 13 Opik references)

Feature 2: Fix Phoenix Docker Recipe + Agent Graph Fix (P0 Quick Win Bundle)¶

Description: The current make start_phoenix recipe has three problems: (1) no volume mount — trace data is lost on docker rm, (2) missing gRPC port 4317 — only HTTP OTLP on 6006 is exposed, (3) no restart policy — container dies on devcontainer restart (exit code 255) and doesn’t come back. Additionally, make start_phoenix fails with “container name already in use” when a stopped container exists. Fix all four issues.

Bundled Quick Win: The Agent Interaction Graph tab in the GUI shows “No agent interaction data available” even when trace data exists because graph building is coupled to evaluation success (app.py:267 only builds graph when composite_result is not None). Fix: change conditional graph building to unconditional when execution_id exists (one-line change).

Acceptance Criteria:

Technical Requirements:

Update start_phoenix recipe in Makefile:

start_phoenix:
  docker rm -f $(PHOENIX_CONTAINER_NAME) 2>/dev/null || true
  docker run -d --name $(PHOENIX_CONTAINER_NAME) \
    --restart unless-stopped \
    -p $(PHOENIX_PORT):$(PHOENIX_PORT) \
    -p 4317:4317 \
    -v phoenix_data:/mnt/data \
    -e PHOENIX_WORKING_DIR=/mnt/data \
    $(PHOENIX_IMAGE)

Update stop_phoenix to only stop (not remove) so volume persists
Update status_phoenix to show both port mappings
Add PHOENIX_GRPC_PORT := 4317 variable alongside existing PHOENIX_PORT
Phoenix does NOT support /v1/metrics — keep OTEL_METRICS_EXPORTER=none in logfire_instrumentation.py:70 as-is

Files:

Makefile (edit)
src/app/app.py (edit — quick win graph fix at line 267)
tests/infra/test_makefile_recipes.py (new — Makefile recipe validation)
tests/app/test_app.py (update — graph fix behavior test; mock _build_graph_from_trace)

Feature 3: Fix CCTraceAdapter Path Handling¶

Description: The CC baseline infrastructure was built in Sprint 4 but has a teams mode path mismatch — adapter expects tasks/ as child of teams dir, but CC stores tasks at ~/.claude/tasks/{team-name}/ (sibling of ~/.claude/teams/). Fix the adapter to support both layouts.

Acceptance Criteria:

Teams mode adapter accepts separate teams_dir and tasks_dir parameters (or auto-discovers tasks/ as sibling)
Adapter works with real ~/.claude/teams/{name}/ + ~/.claude/tasks/{name}/ directory layout
Backward compatible: still works if tasks/ is a subdirectory of teams dir
CLI --cc-teams-dir accepts teams directory; tasks directory auto-discovered or specified separately
Tests: pytest tests with both directory layouts (sibling and child)
make validate passes
CHANGELOG.md updated

Technical Requirements:

In CCTraceAdapter.__init__(): accept optional tasks_dir: Path | None parameter alongside existing teams_dir
When tasks_dir is None: auto-discover by checking teams_dir.parent / "tasks" / teams_dir.name (sibling layout), then teams_dir / "tasks" (child layout)
In src/run_cli.py: add --cc-teams-tasks-dir optional flag that maps to tasks_dir parameter
Preserve existing behavior when tasks/ is a child directory (backward compatible)

Files:

src/app/judge/cc_trace_adapter.py (edit)
tests/judge/test_cc_trace_adapter.py (update)
src/run_cli.py (edit — add --cc-teams-tasks-dir optional flag)

Feature 4: Create CC Artifact Collection Scripts¶

Description: CC doesn’t natively export artifacts in the format expected by CCTraceAdapter. Create bash scripts to collect solo session and teams mode artifacts into adapter-compatible directory structures.

Acceptance Criteria:

scripts/collect-cc-traces/collect-cc-solo.sh captures CC solo session data into adapter-expected format (metadata.json + tool_calls.jsonl)
scripts/collect-cc-traces/collect-cc-teams.sh copies ~/.claude/teams/{name}/ + ~/.claude/tasks/{name}/ into single adapter-compatible directory
Both scripts accept named parameters: --name <session/team-name> and --output-dir <path> (required)
Both scripts validate output directory structure matches adapter expectations
Exit code 0 on success, exit code 1 on validation failure (missing source dirs, malformed artifacts), exit code 2 on usage error (missing required params)
README in scripts/ documents usage, examples, and exit codes
Tests: pytest tests invoking scripts via subprocess.run(), verifying exit codes and output directory structure
make validate passes
CHANGELOG.md updated

Technical Requirements:

scripts/collect-cc-traces/collect-cc-solo.sh: parse --name and --output-dir args, locate CC session data in ~/.claude/projects/ or user-specified path, create metadata.json (session name, timestamp, model) and tool_calls.jsonl (one JSON object per tool call) in output dir
scripts/collect-cc-traces/collect-cc-teams.sh: parse --name and --output-dir args, copy ~/.claude/teams/{name}/config.json and ~/.claude/tasks/{name}/*.json into output dir preserving structure
Both scripts: validate output structure matches CCTraceAdapter expectations (required files exist, valid JSON), exit 1 on validation failure, exit 2 on usage error
Use set -euo pipefail for strict error handling in both scripts

Files:

scripts/collect-cc-traces/collect-cc-solo.sh (new)
scripts/collect-cc-traces/collect-cc-teams.sh (new)
scripts/collect-cc-traces/README.md (new)
tests/scripts/test_collect_cc_scripts.py (new)

Feature 5: Wire Paper and Review Extraction¶

Description: evaluation_runner.py:101-106 passes empty strings for paper="" and review="" to evaluate_comprehensive(), making Tier 1 text similarity scores meaningless (near-zero). The manager run result contains both paper ID and generated review, but run_manager() only returns the execution_id string — discarding result.output. Fix: return the result object alongside execution_id, extract the review text and paper content, and pass them to the evaluation pipeline.

Acceptance Criteria:

Technical Requirements:

In agent_system.py:510: change run_manager() return from str to tuple[str, Any], return (execution_id, result.output)
In app.py:112: destructure return: execution_id, manager_output = await run_manager(...)
In app.py:256: pass manager_output to _run_evaluation_if_enabled()
In evaluation_runner.py:101-106: extract fields:
review_text = manager_output.review.comments (from ReviewGenerationResult)
paper_id = manager_output.paper_id
paper_content = PeerReadLoader(...).load_parsed_pdf_content(paper_id) with abstract fallback
Pass extracted strings to pipeline.evaluate_comprehensive(paper=paper_content, review=review_text, ...)
Mock strategy: mock run_manager() return value, mock PeerReadLoader.load_parsed_pdf_content() for unit tests

Files:

src/app/agents/agent_system.py (change run_manager() return type)
src/app/app.py (destructure return, pass to evaluation)
src/app/judge/evaluation_runner.py (extract content from result)
tests/judge/test_evaluation_runner.py (update)

Feature 6: Delete Orphaned cc_otel Module¶

Description: src/app/cc_otel/ is an orphaned module containing CCOtelConfig — a Pydantic settings model for configuring Claude Code’s OpenTelemetry environment variables from Python. This approach is fundamentally wrong: CC tracing is configured via infrastructure-level env vars (set in shell or .claude/settings.json), not application code. The module has no consumers — no imports of app.cc_otel exist anywhere in the codebase. The correct approach for CC baseline comparison is headless invocation via claude -p (Feature 7) with post-hoc artifact collection. This is independent of Opik removal (Feature 1) — cc_otel was for Claude Code OTel configuration, not Opik.

Acceptance Criteria:

src/app/cc_otel/ directory deleted (including __init__.py, config.py)
tests/cc_otel/ directory deleted (including test_cc_otel_config.py, test_cc_otel_instrumentation.py)
No remaining imports of app.cc_otel in codebase (verified via grep)
make validate passes
CHANGELOG.md updated

Technical Requirements:

Delete src/app/cc_otel/ directory entirely (2 files: __init__.py, config.py)
Delete tests/cc_otel/ directory entirely (2 files: test_cc_otel_config.py, test_cc_otel_instrumentation.py)
Verify cleanup: grep -ri cc_otel src/app/ and grep -ri cc_otel tests/ return no matches

Files:

src/app/cc_otel/ (delete entire directory)
tests/cc_otel/ (delete entire directory)

Feature 7: MAS Composition Sweep Infrastructure¶

Description: Build automated benchmarking infrastructure to run the PydanticAI MAS evaluation pipeline across configurable agent composition variations and optionally invoke Claude Code in headless mode (claude -p) for CC baseline comparison. The default composition set is all 8 combinations of include_researcher / include_analyst / include_synthesiser toggles (2^3 = 8), but both the number of compositions and the agent toggles within each composition are configurable. Each composition runs a configurable number of repetitions on the same paper(s) for statistical significance. Results are aggregated with mean/stddev per metric per composition and output as both JSON (machine-readable) and Markdown (human-readable).

Acceptance Criteria:

Technical Requirements:

src/app/benchmark/sweep_config.py (~70 lines): SweepConfig Pydantic model
compositions: list[AgentComposition] — defaults to all 8 combinations via generate_all_compositions()
AgentComposition model: {"include_researcher": bool, "include_analyst": bool, "include_synthesiser": bool}
repetitions: int = 3 — runs per composition per paper
paper_numbers: list[str] — PeerRead paper IDs
chat_provider: str — provider for all MAS runs
cc_baseline_enabled: bool = False — when True, invoke CC headless per paper
cc_solo_dir: Path | None — pre-collected CC solo artifacts (alternative to live CC runs)
cc_teams_dir: Path | None — pre-collected CC teams artifacts
output_dir: Path = Path("results/sweeps")
generate_all_compositions() -> list[AgentComposition] — produces all 2^3 = 8 toggle combinations
src/app/benchmark/sweep_runner.py (~180 lines): orchestration loop
run_sweep(config: SweepConfig) -> SweepResults — main entry
Calls main() from app.py for each composition x paper x repetition
Collects CompositeResult per run
When cc_baseline_enabled: invokes claude -p "Generate a structured peer review for paper '{paper_number}'" --output-format json via subprocess.run(), collects output to temp dir, parses via CCTraceAdapter
When pre-collected CC artifact dirs provided: evaluates once (same result across compositions)
src/app/benchmark/sweep_analysis.py (~100 lines): statistics and reporting
analyze(results: SweepResults) -> SweepSummary — per-composition stats
generate_markdown_report(summary: SweepSummary) -> str — table output
src/run_sweep.py (~50 lines): CLI argument parsing, loads config, calls runner
Makefile: add sweep target
CONTRIBUTING.md: add make sweep to command reference table
Mock strategy: mock app.main() to return synthetic CompositeResult, mock subprocess.run() for CC headless invocation, mock filesystem for output dir creation

Files:

src/app/benchmark/__init__.py (new)
src/app/benchmark/sweep_config.py (new)
src/app/benchmark/sweep_runner.py (new)
src/app/benchmark/sweep_analysis.py (new)
src/run_sweep.py (new)
Makefile (edit)
.gitignore (edit - add results/sweeps/)
CONTRIBUTING.md (edit — add make sweep to command reference table)
tests/benchmark/test_sweep_config.py (new)
tests/benchmark/test_sweep_runner.py (new — mock main() and subprocess.run())
tests/benchmark/test_sweep_analysis.py (new)

Feature 8: Review Tools Conditional Access¶

Description: Sprint 5 STORY-016 moved PeerRead base tools from manager to researcher. However, review tools (generate_paper_review_content_from_template, save_paper_review, save_structured_review) are still added unconditionally to the manager via conditionally_add_review_tools(). When a researcher agent is present, review tools should be placed on the researcher (alongside base PeerRead tools and DuckDuckGo). When no researcher is present (single-agent mode), review tools should fall back to the manager so single-agent review generation continues to work.

Acceptance Criteria:

When include_researcher=True: review tools registered on researcher agent, not manager
When include_researcher=False: review tools registered on manager agent (single-agent fallback)
Manager retains only delegation tools (researcher(), analyst(), synthesiser()) in multi-agent mode
Researcher has: PeerRead base tools + review tools + duckduckgo_search_tool() in multi-agent mode
Single-agent mode produces correct review output (no regression)
Multi-agent mode delegates PeerRead + review operations to researcher (verified via trace data)
Tests: pytest tests for tool registration (which agent has which tools) in both modes
make validate passes
CHANGELOG.md updated

Technical Requirements:

In src/app/agents/agent_system.py:
conditionally_add_review_tools() (line 462): add researcher parameter
When researcher is not None and enable=True: add review tools to researcher
When researcher is None and enable=True: add review tools to manager (fallback)
Pass researcher from get_manager() scope into conditionally_add_review_tools()
In src/app/tools/peerread_tools.py:
Rename add_peerread_review_tools_to_manager() to add_peerread_review_tools() (agent-agnostic name)
Function signature already accepts Agent[None, BaseModel] — no parameter change needed
Mock strategy: mock PydanticAI Agent to verify tool registration lists without LLM calls

Files:

src/app/agents/agent_system.py
src/app/tools/peerread_tools.py
tests/agents/test_agent_system.py (update)

Feature 9: Enable Review Tools by Default¶

Description: Review tools (--enable-review-tools) currently default to False, requiring explicit opt-in for review generation. Since the primary use case of this project is PeerRead paper review evaluation, review tools should be enabled by default. Users who want to run general queries without review tools can opt out via --no-review-tools.

Acceptance Criteria:

enable_review_tools defaults to True in main() signature (app.py)
CLI: --no-review-tools flag disables review tools (replaces opt-in with opt-out)
CLI: --enable-review-tools flag kept for backward compatibility (no-op since default is True)
GUI: Review tools checkbox in settings defaults to checked
Auto-enable logic from _prepare_query() still works (no regression when --paper-number provided)
Tests: pytest tests for default-on behavior and opt-out flag
Tests: inline-snapshot for CLI help text showing new flag
make validate passes
CHANGELOG.md updated

Technical Requirements:

In src/app/app.py:203: change enable_review_tools: bool = False to enable_review_tools: bool = True
In src/run_cli.py: add --no-review-tools flag that sets enable_review_tools=False
Keep --enable-review-tools for backward compatibility (already True by default, becomes no-op)
In src/app/app.py:94: adjust OR logic — _prepare_query() auto-enable no longer needed since default is True, but keep for explicitness

Files:

src/app/app.py
src/run_cli.py
tests/app/test_cli_baseline.py (update)

Feature 10: CVE Mitigations (SSRF URL Allowlist)¶

Description: The Sprint 5 MAESTRO security review (Finding CVE-1, docs/reviews/sprint5-code-review.md) identified CVE-2026-25580, a CRITICAL PydanticAI SSRF vulnerability allowing information disclosure via malicious URLs in message history. Agent tools that process URLs (PeerRead dataset downloads, DuckDuckGo search) need domain-allowlist validation to prevent SSRF attacks against internal services. CVE-2026-25640 (Stored XSS in PydanticAI web UI) does not affect this project since we don’t use clai web or Agent.to_web() — document this as a known advisory. CVE-2024-5206 (scikit-learn) is already mitigated by scikit-learn>=1.8.0 in pyproject.toml.

Acceptance Criteria:

Technical Requirements:

Create src/app/utils/url_validation.py (~40 lines):

from pydantic_settings import BaseSettings

class UrlValidationSettings(BaseSettings):
    allowed_domains: frozenset[str] = frozenset({
        "raw.githubusercontent.com", "arxiv.org",
        "api.openai.com", "api.anthropic.com", "api.cerebras.ai",
    })

_settings = UrlValidationSettings()

def validate_url(url: str) -> str:
    parsed = urlparse(url)
    if parsed.scheme != "https":
        raise ValueError("Only HTTPS URLs allowed")
    if parsed.netloc not in _settings.allowed_domains:
        raise ValueError(f"URL domain not allowed: {parsed.netloc}")
    return url

In datasets_peerread.py: call validate_url() before client.get(url) in download functions
Create SECURITY.md with known advisory for CVE-2026-25640 (XSS — not applicable) and CVE-2026-25580 (SSRF — mitigated by URL allowlist)

Files:

src/app/utils/url_validation.py (new)
src/app/data_utils/datasets_peerread.py (edit — add URL validation before downloads)
SECURITY.md (new — known advisories)
tests/utils/test_url_validation.py (new)

Feature 11: LLM Prompt Input Sanitization¶

Description: The Sprint 5 MAESTRO review (Finding L1.1, HIGH) and parallel pipeline review (Item 1, CRITICAL) both identified unsanitized user input flowing into LLM prompts. llm_evaluation_managers.py:177-188 interpolates paper_excerpt and review via f-strings. peerread_tools.py:295 uses .format() with paper_title and paper_abstract from the PeerRead dataset. Malicious paper content could inject prompt instructions or trigger unintended LLM behavior. Add length-limited structured inputs and XML delimiter wrapping.

Acceptance Criteria:

Technical Requirements:

Create src/app/utils/prompt_sanitization.py (~40 lines):
sanitize_for_prompt(text: str, max_length: int, label: str) -> str — truncates and wraps in <{label}>...</{label}>
sanitize_paper_title(title: str) -> str — max 500 chars
sanitize_paper_abstract(abstract: str) -> str — max 5000 chars
sanitize_review_text(review: str) -> str — max 50000 chars
In llm_evaluation_managers.py:177-188: replace raw f-string interpolation with sanitized inputs
In peerread_tools.py:295: replace .format() with string.Template.safe_substitute()
Sanitization module is reusable for any future LLM prompt construction

Files:

src/app/utils/prompt_sanitization.py (new)
src/app/judge/llm_evaluation_managers.py (edit — use sanitized inputs in prompts)
src/app/tools/peerread_tools.py (edit — use safe_substitute for template formatting)
tests/utils/test_prompt_sanitization.py (new)

Feature 12: Log and Trace Data Scrubbing¶

Description: The Sprint 5 MAESTRO review identified three related data leakage risks: (1) no Logfire scrubbing patterns configured (Finding L4.2, HIGH), so trace data exported to Phoenix contains unredacted API keys and user content; (2) no Loguru log filtering (Finding L4.1, MEDIUM), so exception traces may contain local variables with API key values; (3) setup_llm_environment() in providers.py:80 logs env var names at INFO level. Add scrubbing patterns to both Logfire (trace export) and Loguru (file/console logging).

Acceptance Criteria:

Technical Requirements:

Create src/app/utils/log_scrubbing.py (~40 lines):
SENSITIVE_PATTERNS: list[str] — shared pattern list for both Loguru and Logfire
scrub_log_record(record: dict) -> dict — Loguru filter function
get_logfire_scrubbing_patterns() -> list[str] — returns patterns for Logfire configuration
In src/app/utils/log.py: add filter=scrub_log_record to the Loguru file sink
In src/app/common/log.py: consolidate with utils/log.py — replace duplicate loguru config with re-export: from app.utils.log import logger (DRY fix — both files are near-identical, but only utils/log.py will have scrubbing)
In src/app/agents/logfire_instrumentation.py: pass scrubbing_patterns to logfire.configure()
In src/app/llms/providers.py:80: change logger.info(f"Set environment variable: {env_var}") to logger.debug(...)

Files:

src/app/utils/log_scrubbing.py (new)
src/app/utils/log.py (edit — add scrubbing filter to file sink)
src/app/common/log.py (edit — replace with re-export from utils/log.py)
src/app/agents/logfire_instrumentation.py (edit — configure Logfire scrubbing patterns)
src/app/llms/providers.py (edit — downgrade log level for env var setup)
tests/utils/test_log_scrubbing.py (new)

Feature 13: Security Test Suite¶

Description: The Sprint 5 MAESTRO review (Recommendations, Priority 4) explicitly tagged “Add comprehensive security test suite” for Sprint 6. Zero security-focused tests currently exist. Create tests/security/ with tests validating the security controls added by Features 10-12 and testing additional attack vectors identified in the review: plugin input size limits, tool registration scope, and prompt injection scenarios.

Acceptance Criteria:

Technical Requirements:

Create tests/security/__init__.py
Create tests/security/test_ssrf_prevention.py — test validate_url() from Feature 10 with: allowed domains, blocked domains, HTTP (non-HTTPS), 169.254.169.254 (AWS metadata), localhost, 0.0.0.0, unicode domain IDN homograph attacks
Create tests/security/test_prompt_injection.py — test sanitize_for_prompt() from Feature 11 with: "Ignore previous instructions" payloads, format string attempts ({__import__}), oversized inputs, null bytes
Create tests/security/test_sensitive_data_filtering.py — test scrub_log_record() from Feature 12 with: messages containing api_key=sk-..., password=secret, Bearer token patterns
Create tests/security/test_input_size_limits.py — test plugin evaluate() with oversized agent_output (>100KB) and reference_texts (>10 items)
Create tests/security/test_tool_registration.py — verify agent tool lists match expected registrations per agent role

Files:

tests/security/__init__.py (new)
tests/security/test_ssrf_prevention.py (new)
tests/security/test_prompt_injection.py (new)
tests/security/test_sensitive_data_filtering.py (new)
tests/security/test_input_size_limits.py (new)
tests/security/test_tool_registration.py (new)

Feature 14: Increase Coverage for Critical Modules¶

Description: The Sprint 5 MAESTRO review (Recommendations, Priority 5) identified five modules with critically low test coverage that handle core data loading, agent tools, and orchestration. These modules have high regression risk and are frequently modified across sprints. Add targeted behavioral tests to increase coverage before the test audit (Feature 15) removes low-value tests elsewhere.

Acceptance Criteria:

datasets_peerread.py: 27% -> 60% — tests for download error handling, URL construction, paper validation with missing fields, retry logic
peerread_tools.py: 22% -> 60% — tests for tool registration, PDF extraction error handling, content truncation, template loading
llms/models.py: 24% -> 50% — tests for model creation with different providers, error handling for unsupported models
agent_factories.py: 39% -> 60% — tests for agent creation with various toggle combinations, system prompt construction
agent_system.py: 47% -> 60% — tests for delegation flow, usage limit enforcement, single-agent fallback
All new tests verify behavior (error handling, data flow, edge cases), not implementation details
Coverage measured via make coverage_all before and after
make validate passes
CHANGELOG.md updated

Technical Requirements:

Tests go in existing test directories (mirror src/app/ structure):
tests/data_utils/test_datasets_peerread.py (update — add download error, validation tests)
tests/agents/test_peerread_tools.py (update — add PDF extraction, truncation tests)
tests/llms/test_models.py (new or update — model creation tests)
tests/agents/test_agent_factories.py (new or update — agent creation tests)
tests/agents/test_agent_system.py (update — delegation and limit tests)
Mock external dependencies (HTTP, file system, LLM providers) — test logic, not network
Use Hypothesis for property tests on data validation (arbitrary missing fields, boundary values)

Files:

tests/data_utils/test_datasets_peerread.py (update)
tests/agents/test_peerread_tools.py (update)
tests/llms/test_models.py (new or update)
tests/agents/test_agent_factories.py (new or update)
tests/agents/test_agent_system.py (update)

Feature 15: Execute Test Audit Refactoring¶

Description: Sprint 5 STORY-011 produced docs/reviews/sprint5-test-audit.md — a detailed per-file audit with explicit keep/delete/refactor decisions for all test files. The audit was completed but the actual refactoring (deleting ~55 implementation-detail tests from 9 files) was not executed. This story executes the audit plan. Note: test_migration_cleanup.py is already deleted, and tests/cc_otel/ is deleted by Feature 6 (cc_otel removal).

Acceptance Criteria:

Technical Requirements:

Follow execution plan in docs/reviews/sprint5-test-audit.md exactly (Phase 2: Delete Implementation-Detail Tests)
Delete tests by removing specific test functions or classes, not entire files (files contain mix of keep and delete tests)
Run make test_all after each file modification to catch regressions immediately
Expected net reduction: ~55 tests from 9 files

Files:

tests/evals/test_judge_settings.py (edit)
tests/common/test_common_settings.py (edit)
tests/utils/test_logfire_config.py (edit)
tests/judge/test_plugin_base.py (edit)
tests/judge/test_trace_store.py (edit)
tests/judge/test_plugin_llm_judge.py (edit)
tests/judge/test_plugin_traditional.py (edit)
tests/judge/test_plugin_graph.py (edit)
tests/evals/test_graph_analysis.py (edit — if applicable)

Non-Functional Requirements¶

All sweep runs must complete within provider rate limits (no concurrent API calls within a single sweep iteration)
Phoenix Docker container must survive devcontainer restarts without trace data loss
Sweep results must be deterministic given same paper content and provider (modulo LLM non-determinism)
No new pip dependencies — reuse existing networkx, pydantic, arize-phoenix, logfire

Out of Scope¶

CC Agent Teams mode invocation from sweep (only CC solo headless mode via claude -p; teams requires manual setup)
CC OTel env var configuration in .claude/settings.json (infrastructure-level, not application code)
Phoenix cloud deployment or authentication setup
Sweep visualization dashboard (Markdown tables are sufficient for Sprint 6)
Heterogeneous model support in sweep (all agents use same LLM per composition)
GUI integration for sweep (CLI-only for Sprint 6)
Centralized tool registry with module allowlist (architecture improvement — Sprint 7+, per MAESTRO review L7.2)
Plugin tier validation at registration (prevents tier mismatch — Sprint 7+, per MAESTRO review L7.1)
Immutable trace storage / audit trail signing (low priority — Sprint 7+, per MAESTRO review L4.3)
Complete docstring coverage for llms/ and data_utils/ modules (Sprint 7+, per MAESTRO review CQ.1)
Removing API keys from os.environ entirely (PydanticAI requires env vars for provider auth — would need upstream changes)
Performance bottleneck remediation automation (auto-adjusting timeouts from historical data — Sprint 7+, per parallel review Item 3)
Additional evaluation fallback strategies beyond tier1_only (Sprint 7+, per parallel review Item 5)
Error message sanitization / information leakage prevention (sanitize error metadata sizes — Sprint 7+, per parallel review Item 2)
GraphTraceData construction simplification (replace manual .get() with model_validate() — Sprint 7+, per parallel review Item 8)
Timeout bounds enforcement (min/max limits on user-configurable timeouts — Sprint 7+, per parallel review Item 9)
Configuration path traversal protection (validate config paths against allowlist — Sprint 7+, per parallel review Item 10)
BDD scenario tests for evaluation pipeline (end-to-end user workflow tests — Sprint 7+, per parallel review Item 12)
Time tracking consistency across tiers (standardize timing pattern — Sprint 7+, per parallel review Item 7)
Hardcoded settings audit: search codebase for module-level constants (e.g., ALLOWED_DOMAINS, timeout values, default providers) that should be extracted into Pydantic BaseSettings or settings.json for runtime configurability (Sprint 7+, discovered during STORY-010)

Notes for Ralph Loop¶

Priority Order:

P0 (Quick Wins): STORY-001 (Opik removal), STORY-002 (Phoenix recipe + graph fix), STORY-006 (cc_otel deletion)
P1 (Security Hardening): STORY-010 (CVE mitigations), STORY-011 (input sanitization), STORY-012 (log scrubbing)
P1 (CC Baseline): STORY-003 (adapter paths), STORY-004 (collection scripts), STORY-005 (paper extraction)
P2 (Tool Access): STORY-008 (conditional access), STORY-009 (default enabled)
P2 (Test Quality): STORY-014 (coverage improvements), STORY-015 (audit execution)
P3 (Security Verification): STORY-013 (security test suite)
P3 (Benchmarking): STORY-007 (sweep infrastructure)

Split Option for STORY-007: If sweep implementation exceeds single-story scope, split into STORY-007a (config + runner) and STORY-007b (analysis + CLI + Makefile). Both remain P3.

File Conflict Notes:

peerread_tools.py: touched by STORY-008 (review tools) and STORY-011 (input sanitization) — different functions, no code conflict, but avoid parallel execution
logfire_instrumentation.py: touched by STORY-012 (log scrubbing) only — no conflict
agent_system.py: touched by STORY-005 (paper extraction) and STORY-008 (review tools) — different functions, avoid parallel execution

Story Breakdown - Phase 1 (15 stories total):

Feature 1 (Remove Opik) → STORY-001: Remove all Opik code, config, Docker, docs, and tests
Feature 2 (Phoenix Recipe) → STORY-002: Fix Phoenix Docker recipe with volume, ports, restart policy + Agent graph fix (one-line change bundled as P0 quick win)
Feature 3 (CC Adapter Paths) → STORY-003: Fix CCTraceAdapter path handling for sibling teams/tasks directories
Feature 4 (CC Collection Scripts) → STORY-004: Create CC artifact collection scripts (depends: STORY-003)
Feature 5 (Paper Extraction) → STORY-005: Wire paper and review extraction in evaluation runner
Feature 6 (Delete cc_otel) → STORY-006: Delete orphaned cc_otel module (independent of Opik)
Feature 7 (Composition Sweep) → STORY-007: Build MAS composition sweep infrastructure with statistical analysis (depends: STORY-003, STORY-004, STORY-005)
Feature 8 (Review Tools Conditional) → STORY-008: Move review tools to researcher when present, manager when single-agent (note: shares agent_system.py with STORY-005 — different functions, no dependency, but avoid parallel execution)
Feature 9 (Review Tools Default) → STORY-009: Enable review tools by default with opt-out flag (depends: STORY-008)
Feature 10 (CVE Mitigations) → STORY-010: Add SSRF URL allowlist and document known CVE advisories
Feature 11 (Input Sanitization) → STORY-011: Add prompt input sanitization with length limits and XML delimiters (note: shares peerread_tools.py with STORY-008 — different functions, avoid parallel execution)
Feature 12 (Log Scrubbing) → STORY-012: Configure Logfire scrubbing patterns and Loguru sensitive data filter
Feature 13 (Security Tests) → STORY-013: Create security test suite in tests/security/ (depends: STORY-010, STORY-011, STORY-012)
Feature 14 (Coverage Improvements) → STORY-014: Increase test coverage for 5 critical low-coverage modules
Feature 15 (Test Audit Execution) → STORY-015: Execute Sprint 5 test audit refactoring plan — delete ~55 implementation-detail tests (depends: STORY-014, STORY-006)