Product Requirements Document - Agents-eval Sprint 12
Project Overview¶
Agents-eval evaluates multi-agent AI systems using the PeerRead dataset. The system generates scientific paper reviews via a 4-agent delegation pipeline (Manager -> Researcher -> Analyst -> Synthesizer) and evaluates them through three tiers: traditional metrics, LLM-as-Judge, and graph analysis.
Sprint 12 goal: Fix CC teams mode classification and evaluation wiring. CC teams runs are misclassified as cc_solo because (1) the JSONL stream parser looks for event types (TeamCreate, Task) that CC never emits — real team events use type=system, subtype=task_started, and (2) engine_type is inferred from parsed artifacts instead of the user’s explicit mode selection. This causes downstream evaluation failures: Tier 3 graph analysis is skipped, coordination/tool metrics default to 0, and the results JSON reports the wrong engine.
Additionally, the composite scoring system has 5 bugs producing misleading evaluation results: (1) time_taken is always ~0.999 because _execute_tier1 passes two near-identical timestamps instead of actual agent execution duration, (2) Tier 3 returns all-zeros for empty trace data instead of triggering fallback, (3) evaluate_composite_with_trace (single-agent weight redistribution) exists but is never called from production code, (4) semantic_score duplicates cosine_score because BERTScore is disabled and the fallback delegates to the same cosine function, (5) task_success is binary 0/1 with a harsh 0.8 threshold providing no gradient for generative tasks.
Current State¶
| Area | Status | Gap |
|---|---|---|
| CC teams engine_type | Broken | engine_type set to "cc_solo" even when CC teams mode is selected (app.py:262) |
| JSONL stream team event parsing | Broken | _TEAM_EVENT_TYPES expects {"TeamCreate", "Task"} but CC emits {"type": "system", "subtype": "task_started"} (cc_engine.py:34) |
| CC teams evaluation scores | Degraded | Tier 3 N/A, coordination_quality=0, tool_efficiency=0 because graph trace has no team artifacts |
cc_teams flag passthrough |
Missing | cc_teams boolean consumed in CLI/GUI, never forwarded to main() or _run_cc_engine_path() |
| Tier 3 empty-trace handling | Broken | Empty tool_calls + agent_interactions returns all-zero Tier3Result (not None), bypassing fallback (graph_analysis.py:224-269) |
| Single-agent weight redistribution | Dead code | evaluate_composite_with_trace never called from production pipeline (evaluation_pipeline.py:279-303) |
time_taken metric |
Broken | Always ~0.999 — _execute_tier1 passes two time.time() calls microseconds apart (evaluation_pipeline.py:161,173) |
semantic_score duplication |
Bug | compute_semantic_similarity delegates to compute_cosine_similarity — cosine gets 0.7 effective weight in Tier 1 formula (traditional_metrics.py:232) |
task_success binary cliff |
Design flaw | Returns 0.0 or 1.0 at 0.8 threshold — no gradient for generative tasks (traditional_metrics.py:278) |
| Output directory structure | Poor UX | All streams, traces, reviews, reports dumped flat in separate dirs — no per-run grouping, inconsistent timestamps, no cross-artifact linking (config_app.py:16-22) |
Development Methodology¶
All implementation stories MUST follow these practices. Ralph Loop and CC Agent Teams enforce this order.
Full references: docs/best-practices/tdd-best-practices.md, docs/best-practices/testing-strategy.md, .claude/skills/testing-python/SKILL.md.
TDD Workflow (Mandatory for all features)¶
Every feature follows the Red-Green-Refactor cycle. Invoke testing-python skill for RED phase, implementing-python skill for GREEN phase.
- RED: Write failing tests first using
testing-pythonskill. Tests define expected behavior before any implementation code exists. Use Arrange-Act-Assert (AAA) structure. Name teststest_{module}_{component}_{behavior}. - GREEN: Implement minimal code to pass tests using
implementing-pythonskill. No extra functionality beyond what tests require. - REFACTOR: Clean up while keeping tests green. Run
make quick_validate(teammate) ormake validate(lead/wave boundary) before marking complete.
Test Tool Selection¶
| Tool | Use for | NOT for |
|---|---|---|
| pytest | Core logic, unit tests, known edge cases (primary TDD tool) | Random inputs |
| Hypothesis | Property invariants, bounds, all-input guarantees | Snapshots, known cases |
| inline-snapshot | Regression, model dumps, complex structures | TDD red-green, ranges |
Decision rule: If the test wouldn’t catch a real bug, don’t write it. Test behavior, not implementation. See testing-strategy.md “Patterns to Remove” for anti-patterns.
Mandatory Practices¶
- Mock external dependencies (HTTP, LLM providers, file systems, subprocess) using
@patchwithspec=RealClass. Never call real APIs in unit tests. BareMagicMock()silently accepts any attribute — usespec=to constrain to the real interface. - Test behavior, not implementation – test observable outcomes (return values, side effects, error messages), not internal structure.
- Use
tmp_pathfixture for all test filesystem operations. Never usetempfile.mkdtemp()or hardcoded paths (see AGENT_LEARNINGS “Test Filesystem Isolation”). - Google-style docstrings for every new file, function, class, and method.
# Reason:comments for non-obvious logic.# S12-F{N}:change comments for non-trivial code changes.make validateMUST pass before any story is marked complete. No exceptions.
Skills Usage¶
| Story type | Skills to invoke |
|---|---|
| Implementation (all features) | testing-python (RED) → implementing-python (GREEN) |
| Codebase research | researching-codebase (before non-trivial implementation) |
Quality Gates (Per Story and Per Wave)¶
Teammate (per story):
- Tests written FIRST (RED phase) using
testing-pythonskill - Tests fail for the right reason before implementation begins
- Minimal implementation passes all tests (GREEN phase)
-
make quick_validatepasses (lint + type check + complexity + duplication)
Lead (per wave boundary):
-
make validatepasses (lint + type check + full test suite) - No regressions in existing tests
- All story ACs verified before advancing to next wave
Functional Requirements¶
Feature 1: Fix CC Teams Stream Event Parsing¶
Description: The JSONL stream parser (parse_stream_json via _apply_event) checks for "type": "TeamCreate" and "type": "Task" events via the _TEAM_EVENT_TYPES set (cc_engine.py:34). However, CC’s actual stream-json output uses "type": "system" with "subtype": "task_started" (and "task_type": "local_agent") for team sub-agent events. The parser never matches real team events, so team_artifacts is always empty in production.
Observed in the CC teams JSONL stream (cc_teams_66a8e8d4-..._.jsonl):
{"type":"system","subtype":"task_started","task_id":"a0310d0243dc18105","description":"Explore paper review codebase","task_type":"local_agent","session_id":"66a8e8d4-..."}
{"type":"system","subtype":"task_started","task_id":"a99881260fa015660","description":"Technical soundness review","task_type":"local_agent","session_id":"66a8e8d4-..."}
These events have "type": "system", not "TeamCreate" or "Task", so _apply_event line 157 (elif event_type in _TEAM_EVENT_TYPES) never fires.
Acceptance Criteria:
- AC1:
_apply_eventcaptures"type": "system", "subtype": "task_started"events as team artifacts - AC2:
_apply_eventcaptures"type": "system", "subtype": "task_completed"events as team artifacts - AC3:
_TEAM_EVENT_TYPESis removed or updated to reflect actual CC stream event types - AC4: Existing
"type": "system", "subtype": "init"handling is not broken (init events must NOT be captured as team artifacts) - AC5:
parse_stream_jsonreturns populatedteam_artifactswhen given a real CC teams stream - AC6:
make validatepasses with no regressions
Technical Requirements:
- Update
_apply_event()incc_engine.pyto detect team events bytype == "system"ANDsubtype in {"task_started", "task_completed"}instead of checking_TEAM_EVENT_TYPES - Remove or repurpose
_TEAM_EVENT_TYPESconstant — the old values ("TeamCreate","Task") do not appear in real CC output - Keep the existing
initevent handler (type == "system" and subtype == "init") — it must take priority over the new team artifact handler - Order of checks in
_apply_event: (1) init event, (2) result event, (3) team task events
Files:
src/app/engines/cc_engine.py(edit – update_apply_event, remove/update_TEAM_EVENT_TYPES)tests/engines/test_cc_engine.py(edit – updateparse_stream_jsontests to use real event format, add tests fortask_started/task_completedcapture)
Feature 2: Pass cc_teams Flag Through to engine_type Assignment¶
Description: engine_type is set at app.py:262 based on whether cc_result.team_artifacts is non-empty: "cc_teams" if cc_result.team_artifacts else "cc_solo". This is fragile — if CC runs in teams mode but emits no parseable team events (Bug 1, or a short run), engine_type is wrong. The user’s explicit cc_teams flag is the source of truth for mode selection but is consumed in CLI (run_cli.py:115) and GUI (run_app.py:331) and never forwarded to main() or _run_cc_engine_path().
Acceptance Criteria:
- AC1:
main()accepts acc_teams: bool = Falseparameter - AC2:
_run_cc_engine_path()accepts acc_teams: boolparameter - AC3:
engine_typeis set fromcc_teamsflag:"cc_teams" if cc_teams else "cc_solo"(not fromteam_artifacts) - AC4: CLI (
run_cli.py) passescc_teamstomain() - AC5: GUI (
run_app.py:_execute_query_background) passescc_teamstomain() - AC6: When
cc_teams=Trueandteam_artifactsis empty,engine_typeis still"cc_teams" - AC7: When
cc_teams=False,engine_typeis"cc_solo"regardless ofteam_artifactscontent - AC8:
make validatepasses with no regressions
Technical Requirements:
- Add
cc_teams: bool = Falseparameter tomain()signature (app.py:334) - Add
cc_teams: boolparameter to_run_cc_engine_path()signature (app.py:218) - Change
app.py:262from"cc_teams" if cc_result.team_artifacts else "cc_solo"to"cc_teams" if cc_teams else "cc_solo" - CLI fix (
run_cli.py:149): passcc_teams=cc_teamstomain()call - GUI fix (
run_app.py:334): passcc_teams=cc_teamstomain()call - Forward
cc_teamsfrommain()to_run_cc_engine_path()at the CC branch call site
Files:
src/app/app.py(edit – addcc_teamsparam tomain()and_run_cc_engine_path(), fixengine_typeassignment)src/run_cli.py(edit – passcc_teamstomain())src/gui/pages/run_app.py(edit – passcc_teamstomain())tests/cli/test_cc_engine_wiring.py(edit – updateengine_typetests to usecc_teamsflag instead ofteam_artifactsinference)
Feature 3: Skip Tier 3 for Empty Trace Data¶
Description: When GraphTraceData has empty tool_calls and empty agent_interactions (e.g., CC solo runs with no trace artifacts), evaluate_graph_metrics returns an all-zero Tier3Result. This non-None result bypasses the fallback strategy (_apply_fallback_strategy), silently penalizing the composite score by 0.334 (two metrics × 0.167 weight). The fix: return None from _execute_tier3 when trace data is empty, triggering the existing tier1_only fallback which creates neutral 0.5 scores.
Acceptance Criteria:
- AC1:
_execute_tier3returns(None, 0.0)whenGraphTraceDatahas emptytool_callsAND emptyagent_interactions - AC2: A log message at INFO level is emitted when Tier 3 is skipped due to empty trace
- AC3:
performance_monitor.record_tier_execution(3, 0.0)is called for the skip case - AC4: Existing Tier 3 behavior is unchanged when trace data has tool_calls or agent_interactions
- AC5: The
tier1_onlyfallback strategy creates neutral Tier 3 result (0.5 scores) when Tier 3 returns None - AC6:
make validatepasses with no regressions
Technical Requirements:
- In
_execute_tier3(evaluation_pipeline.py:323), aftertrace_data = self._create_trace_data(execution_trace), add early return guard checkingnot trace_data.tool_calls and not trace_data.agent_interactions - Record tier execution with 0.0 time before returning to keep performance stats consistent
- The existing
_apply_fallback_strategy(evaluation_pipeline.py:369) already handlesresults.tier3 is Noneby creating aTier3Resultwith 0.5 scores — no changes needed there
Files:
src/app/judge/evaluation_pipeline.py(edit – add empty-trace early return in_execute_tier3)tests/evals/test_evaluation_pipeline.py(edit – add test for empty-trace skip behavior)
Feature 4: Wire evaluate_composite_with_trace into Production Pipeline¶
Description: CompositeScorer.evaluate_composite_with_trace detects single-agent mode from GraphTraceData and redistributes coordination_quality weight to remaining metrics. However, it is never called from production code — _generate_composite_score only calls evaluate_composite or evaluate_composite_with_optional_tier2. This means CC solo runs (and any single-agent execution) never benefit from weight redistribution, and coordination_quality=0 silently penalizes the composite score.
Acceptance Criteria:
- AC1:
_generate_composite_scoreaccepts an optionaltrace_data: GraphTraceData | Noneparameter - AC2: When
trace_datais provided andresults.is_complete(),evaluate_composite_with_traceis called - AC3: When
trace_datais None, existing routing toevaluate_composite/evaluate_composite_with_optional_tier2is preserved - AC4:
evaluate_comprehensiveretains theGraphTraceDataobject and passes it to_generate_composite_score - AC5: CC solo runs with empty
agent_interactionstrigger single-agent detection and weight redistribution - AC6:
make validatepasses with no regressions
Technical Requirements:
- In
evaluate_comprehensive(evaluation_pipeline.py:476), retain aGraphTraceDatareference when convertingexecution_traceto dict — currently the object is discarded after conversion - Add
trace_data: GraphTraceData | None = Noneparameter to_generate_composite_score(evaluation_pipeline.py:279) - New routing: if
trace_data is not None and results.is_complete()→ callself.composite_scorer.evaluate_composite_with_trace(results, trace_data); otherwise fall through to existing logic evaluate_composite_with_tracealready handles both single-agent and multi-agent cases internally (composite_scorer.py:456-517)
Files:
src/app/judge/evaluation_pipeline.py(edit – update_generate_composite_scoresignature and routing, updateevaluate_comprehensiveto retain and pass trace data)tests/evals/test_evaluation_pipeline.py(edit – add test for trace-aware composite scoring path)tests/evals/test_composite_scorer.py(edit – add integration test for trace-aware path)
Feature 5: Propagate Actual Execution Timestamps to time_taken Metric¶
Description: time_taken is always ~0.999 because _execute_tier1 captures start_evaluation = time.time() and immediately passes time.time() as end_time — both timestamps are microseconds apart. The measure_execution_time formula exp(-duration) then returns exp(~0) ≈ 0.999. The actual agent execution (e.g., CC solo ran for 158 seconds) is never measured or propagated. The fix: capture wall-clock timestamps around the subprocess/agent execution and propagate them through the pipeline to _execute_tier1.
Acceptance Criteria:
- AC1:
CCResulthasstart_time: floatandend_time: floatfields - AC2:
run_cc_solocapturestime.time()before and aftersubprocess.run()and stores onCCResult - AC3:
run_cc_teamscapturestime.time()before and afterPopenblock and stores onCCResult - AC4:
run_evaluation_if_enabledacceptsexecution_start_time: float = 0.0andexecution_end_time: float = 0.0 - AC5:
evaluate_comprehensiveaccepts and forwardsexecution_start_time/execution_end_timeto_execute_tier1 - AC6:
_execute_tier1uses external timestamps when non-zero, falls back totime.time()when zero - AC7: MAS engine path captures timing around
run_manager()and passes to evaluation - AC8: CC engine path extracts
cc_result.start_time/cc_result.end_timeand passes to evaluation - AC9:
make validatepasses with no regressions
Technical Requirements:
- Add
start_time: float = Field(default=0.0)andend_time: float = Field(default=0.0)toCCResult(cc_engine.py:67-87) - Wrap
subprocess.run()inrun_cc_solo(cc_engine.py:~380) withtime.time()before/after - Wrap
Popenblock inrun_cc_teams(cc_engine.py:~440) withtime.time()before/after; setstart_time/end_timeonCCResultafter construction - Add
execution_start_time: float = 0.0andexecution_end_time: float = 0.0torun_evaluation_if_enabled(evaluation_runner.py:115); forward topipeline.evaluate_comprehensive - Add same params to
evaluate_comprehensive(evaluation_pipeline.py:476) and_execute_tier1(evaluation_pipeline.py:138) - In
_execute_tier1, replacestart_evaluation = time.time()/time.time()with external timestamps when non-zero - In
_run_cc_engine_path(app.py:218): passcc_result.start_time/cc_result.end_time - In
_run_mas_engine_path(app.py:266): wraprun_manager()withtime.time()before/after
Files:
src/app/engines/cc_engine.py(edit – add timing fields toCCResult, capture inrun_cc_solo/run_cc_teams)src/app/app.py(edit – capture and pass timing from both engine paths)src/app/judge/evaluation_runner.py(edit – add timing params, forward to pipeline)src/app/judge/evaluation_pipeline.py(edit – accept and use external timestamps inevaluate_comprehensiveand_execute_tier1)tests/evals/test_evaluation_pipeline.py(edit – add test for timestamp propagation)tests/judge/test_evaluation_runner.py(edit – add timing params to call sites, add forward-propagation test)tests/engines/test_cc_engine.py(edit – verifyCCResulttiming fields populated)
Feature 6: Deduplicate semantic_score from cosine_score¶
Description: compute_semantic_similarity (traditional_metrics.py:218) delegates to compute_cosine_similarity because BERTScore is disabled due to build issues. This means semantic_score == cosine_score always, giving cosine 0.7 effective weight in the Tier 1 formula (0.4 × semantic + 0.3 × cosine) while Jaccard gets only 0.2. The fix: use Levenshtein similarity (already available via textdistance in pyproject.toml, with compute_levenshtein_similarity already implemented in the same class) as the semantic fallback. This provides a distinct character-level sequence similarity signal.
Acceptance Criteria:
- AC1:
compute_semantic_similaritydelegates tocompute_levenshtein_similarityinstead ofcompute_cosine_similarity - AC2:
semantic_scoreandcosine_scoreproduce different values for non-identical texts - AC3:
semantic_scorereturns 1.0 for identical texts and 0.0 for empty-vs-nonempty texts - AC4:
Tier1Result.semantic_scorefield description updated to reflect Levenshtein-based calculation - AC5: No new dependencies added — uses existing
textdistancelibrary - AC6:
make validatepasses with no regressions
Technical Requirements:
- In
compute_semantic_similarity(traditional_metrics.py:218), changereturn self.compute_cosine_similarity(text1, text2)toreturn self.compute_levenshtein_similarity(text1, text2) - Update the method’s docstring and log message to say “Levenshtein” not “cosine similarity fallback”
- In
evaluation_models.py, updateTier1Result.semantic_scorefield description from “BERT-based” to “Levenshtein-based sequence similarity (BERTScore disabled)” compute_levenshtein_similarityalready exists attraditional_metrics.py:190with its own fallback chain
Files:
src/app/judge/traditional_metrics.py(edit – changecompute_semantic_similaritydelegation)src/app/data_models/evaluation_models.py(edit – updatesemantic_scorefield description)tests/evals/test_traditional_metrics.py(edit – update semantic similarity tests; remove any assertions thatsemantic == cosine)
Feature 7: Replace Binary task_success with Continuous Score¶
Description: assess_task_success (traditional_metrics.py:256) returns exactly 1.0 or 0.0 based on whether weighted similarity meets the 0.8 threshold. For generative review tasks where typical text similarity ranges 0.3–0.6, this almost always returns 0.0, providing zero useful signal in the composite score. The fix: use proportional credit min(1.0, similarity / threshold) which gives linear gradient below threshold and full credit at/above threshold.
Acceptance Criteria:
- AC1:
assess_task_successreturns continuous float in[0.0, 1.0]instead of binary{0.0, 1.0} - AC2: When weighted similarity >= threshold, returns 1.0
- AC3: When weighted similarity < threshold, returns
weighted_similarity / threshold(proportional credit) - AC4: When weighted similarity is 0.0, returns 0.0
- AC5: When threshold is 0.0, returns 0.0 (avoid division by zero)
- AC6:
make validatepasses with no regressions
Technical Requirements:
- In
assess_task_success(traditional_metrics.py:256), replacereturn 1.0 if overall_similarity >= threshold else 0.0withreturn min(1.0, overall_similarity / threshold) if threshold > 0.0 else 0.0 - Update the method’s docstring to document continuous scoring behavior
- No config changes — the 0.8 threshold still represents “full credit” target; the change is in how sub-threshold scores are handled
Files:
src/app/judge/traditional_metrics.py(edit – changeassess_task_successreturn logic)tests/evals/test_traditional_metrics.py(edit – update tests from binary assertions to continuous range checks)
Feature 8: Consolidate Run Artifacts into Per-Run Directories¶
Description: Currently, run artifacts are scattered across 4 flat directories (logs/Agent_evals/cc_streams/, logs/Agent_evals/traces/, results/MAS_reviews/, results/reports/) with inconsistent naming and no per-run grouping. After 20+ runs, finding all artifacts for a single run requires cross-referencing execution IDs across directories. Filenames sort poorly because execution ID (hex hash) precedes the timestamp. Timestamp formats vary across writers (3 different formats). The fix: introduce an output/ directory with runs/ and sweeps/ subdirectories, a unified timestamp format, a RunContext that tracks the current run’s output path, and a metadata.json file that makes each run self-describing. Remove legacy path constants and all code writing to the old locations.
Current state (6 writers, 4 directories, 3 timestamp formats):
| Writer | Current path | Filename pattern | Timestamp format |
|---|---|---|---|
cc_engine.py:334 |
logs/Agent_evals/cc_streams/ |
cc_solo_{exec_id}_{ts}.json |
%Y%m%dT%H%M%S |
cc_engine.py:431 |
logs/Agent_evals/cc_streams/ |
cc_teams_{exec_id}_{ts}.jsonl |
%Y%m%dT%H%M%S |
trace_processors.py:312 |
logs/Agent_evals/traces/ |
trace_{exec_id}_{ts}.jsonl |
%Y-%m-%dT%H-%M-%SZ |
review_persistence.py:38 |
results/MAS_reviews/ |
{paper_id}_{ts}.json |
%Y-%m-%dT%H-%M-%SZ |
run_cli.py:164 |
results/reports/ |
{ts}.md |
%Y%m%dT%H%M%S |
sweep_runner.py:228 |
results/sweeps/{ts}/ |
results.json, summary.md |
%Y%m%d_%H%M%S |
Target state (unified output directory):
output/
runs/
{YYYYMMDD_HHMMSS}_{engine}_{paper_id}_{exec_id_8}/
metadata.json ← engine_type, paper_id, exec_id, timestamps, CLI args
stream.json ← CC solo output (if CC solo)
stream.jsonl ← CC teams output (if CC teams)
trace.jsonl ← MAS trace (if MAS)
review.json ← MAS review (if MAS)
evaluation.json ← pipeline results (currently in-memory only)
report.md ← evaluation report (if --generate-report)
traces.db ← shared SQLite trace index (across all runs)
sweeps/
{YYYYMMDD_HHMMSS}/
results.json ← raw per-evaluation scores
summary.md ← Markdown statistical summary
This feature is split into 3 stories to manage scope:
- STORY-008:
RunContext+metadata.json+ path constants — foundational infrastructure - STORY-009: Migrate all 6 writers to use
RunContext— the actual file moves - STORY-010: Persist evaluation results to
evaluation.json— new capability enabled by per-run dirs
8.1 Introduce RunContext and Per-Run Directory Infrastructure (STORY-008)¶
Description: Create a RunContext dataclass that owns the per-run output directory. It is created at the start of each main() invocation with the run’s engine type, paper ID, and execution ID. It creates output/runs/{YYYYMMDD_HHMMSS}_{engine}_{paper_id}_{exec_id_8}/, writes metadata.json, and exposes path helpers (stream_path, trace_path, review_path, report_path, evaluation_path). Replace legacy path constants in config_app.py with single OUTPUT_PATH. Adopt unified timestamp format %Y%m%dT%H%M%S everywhere.
Acceptance Criteria:
- AC1:
RunContextdataclass exists with fields:engine_type,paper_id,execution_id,start_time,run_dir(Path) - AC2:
RunContext.create(engine_type, paper_id, execution_id)creates the directoryoutput/runs/{YYYYMMDD_HHMMSS}_{engine}_{paper_id}_{exec_id_8}/and writesmetadata.json - AC3:
metadata.jsoncontains:engine_type,paper_id,execution_id,start_time(ISO),cli_args(optional dict) - AC4: Path helpers return correct filenames:
stream_path→stream.json/stream.jsonl(based on engine_type),trace_path→trace.jsonl,review_path→review.json,report_path→report.md,evaluation_path→evaluation.json - AC5:
OUTPUT_PATH = "output"constant added toconfig_app.py - AC6: Legacy constants
CC_STREAMS_PATH,MAS_REVIEWS_PATH,RESULTS_PATHremoved fromconfig_app.py - AC7:
LOGS_PATH(Loguru logs) andLOGS_BASE_PATHremain unchanged — application logs are not per-run - AC8:
JudgeSettings.trace_storage_pathdefault changed fromlogs/Agent_evals/tracestooutput/runs(fallback whenrun_diris None) - AC9:
main()createsRunContextafter engine execution completes (onceexecution_idis known) and passes it to evaluation and writer paths - AC10:
output/added to.gitignore(results/entry kept for existing artifacts) - AC11:
make validatepasses with no regressions
Technical Requirements:
- New file
src/app/utils/run_context.pywithRunContextdataclass (Pydantic model) RunContext.create()classmethod: generatesrun_dirname fromdatetime.now().strftime("%Y%m%dT%H%M%S"),engine_type,paper_id,execution_id[:8]; callsmkdir(parents=True)underoutput/runs/; writesmetadata.jsonviamodel_dump_json()- Update
config_app.py: addOUTPUT_PATH = "output", removeCC_STREAMS_PATH,MAS_REVIEWS_PATH,RESULTS_PATH - Update
app.py:main(): createRunContextafter engine type is known, pass to_run_cc_engine_path()and_run_mas_engine_path() - For CC paths:
RunContextis created afterrun_cc_solo/run_cc_teamsreturns (execution_id only known after CC runs). The stream file is written to a temp location first, then moved into the run dir. This matches the existing pattern where cc_teams renames the stream file after extracting session_id. ArtifactRegistrycalls updated to register paths fromRunContext- GUI evaluation page
default_traces_dir(evaluation.py:320) updated to"output/runs/"
Files:
src/app/utils/run_context.py(new –RunContextdataclass with path helpers and metadata writer)src/app/config/config_app.py(edit – addOUTPUT_PATH, removeCC_STREAMS_PATH,MAS_REVIEWS_PATH,RESULTS_PATH)src/app/config/judge_settings.py(edit – removetrace_storage_pathdefault or point toOUTPUT_PATH)src/app/app.py(edit – createRunContextinmain(), pass to engine/eval paths)src/gui/pages/evaluation.py(edit – updatedefault_traces_dir).gitignore(edit – addoutput/, keepresults/)tests/utils/test_run_context.py(new – test directory creation, metadata.json content, path helpers)
8.2 Migrate All Writers to Per-Run Directories (STORY-009, depends: STORY-008, STORY-005)¶
Description: Update all 6 file writers to use RunContext path helpers instead of constructing paths from legacy constants. Each writer receives RunContext (or run_dir: Path) and writes to the run directory. Remove timestamp generation from individual writers — RunContext owns the timestamp. Remove CC_STREAMS_PATH usage from cc_engine.py, LOGS_BASE_PATH/traces from trace_processors.py, MAS_REVIEWS_PATH from review_persistence.py, and hardcoded results/reports from run_cli.py.
Acceptance Criteria:
- AC1:
run_cc_solowrites stream torun_context.stream_pathinstead ofcc_streams/cc_solo_{exec_id}_{ts}.json - AC2:
run_cc_teamswrites stream torun_context.stream_pathinstead ofcc_streams/cc_teams_{exec_id}_{ts}.jsonl - AC3:
TraceCollector._store_trace()writes torun_context.trace_pathinstead oftraces/trace_{exec_id}_{ts}.jsonl - AC4:
ReviewPersistence.save_review()writes torun_context.review_pathinstead ofMAS_reviews/{paper_id}_{ts}.json - AC5: CLI report save writes to
run_context.report_pathinstead ofresults/reports/{ts}.md - AC6:
traces.dbSQLite database writes tooutput/runs/traces.db(shared across runs, not per-run) - AC7:
review_loader.pydeleted — dead code (no imports insrc/, no tests), references removedMAS_REVIEWS_PATH - AC8: No code references
CC_STREAMS_PATH,MAS_REVIEWS_PATH,RESULTS_PATH, orLOGS_BASE_PATH/tracesfor file writes - AC9:
ArtifactRegistryentries point to new per-run paths - AC10: Sweep runner default
output_dirchanged fromresults/sweeps/{ts}tooutput/sweeps/{ts} - AC11:
--output-dirCLI override onrun_sweep.pystill works - AC12:
make validatepasses with no regressions
Technical Requirements:
cc_engine.py:run_cc_solo()andrun_cc_teams()acceptrun_dir: Pathparameter; write stream torun_dir / "stream.json"(solo) orrun_dir / "stream.jsonl"(teams); removeCC_STREAMS_PATHimport and local timestamp generationtrace_processors.py:TraceCollector.__init__()accepts optionalrun_dir: Path;_store_trace()writes torun_dir / "trace.jsonl"when set;traces.dbmoves toresolve_project_path(OUTPUT_PATH) / "runs" / "traces.db"(shared index)review_persistence.py:ReviewPersistence.__init__()accepts optionalrun_dir: Path;save_review()writes torun_dir / "review.json"when set; removeMAS_REVIEWS_PATHimportrun_cli.py: report save usesrun_context.report_pathinstead of constructingPath("results") / "reports" / f"{timestamp}.md"sweep_runner.py: removeRESULTS_PATHimport (no defaultoutput_dirhere —SweepConfig.output_diris a required field)run_sweep.py: change defaultoutput_dirfromf"results/sweeps/{ts}"tof"output/sweeps/{ts}"(run_sweep.py:150owns the default); update--output-dirargparse default if hardcodedapp.py: passRunContext(orrun_dir) to CC engine functions and trace/review components- All writers: remove individual
strftime()calls —RunContextdirectory name carries the timestamp
Files:
src/app/engines/cc_engine.py(edit – acceptrun_dir, write stream to run dir, removeCC_STREAMS_PATH)src/app/judge/trace_processors.py(edit – acceptrun_dir, write trace to run dir, movetraces.db)src/app/data_utils/review_persistence.py(edit – acceptrun_dir, write review to run dir, removeMAS_REVIEWS_PATH)src/app/data_utils/review_loader.py(delete – dead code, no imports in src/, no tests)src/run_cli.py(edit – userun_context.report_pathfor report save)src/app/app.py(edit – plumbRunContextto all writers)src/app/benchmark/sweep_runner.py(edit – change defaultoutput_dirtooutput/sweeps/)src/run_sweep.py(edit – update default--output-dirif hardcoded)tests/engines/test_cc_engine.py(edit – update stream write tests to userun_dir)tests/judge/test_trace_processors.py(edit – update trace write tests)tests/data_utils/test_review_persistence.py(edit – update review write tests)
8.3 Persist Evaluation Results to evaluation.json (STORY-010, depends: STORY-009)¶
Description: Evaluation pipeline results are currently returned in-memory and never written to disk (except indirectly via sweep results.json). With per-run directories, write the composite evaluation result to run_dir/evaluation.json after evaluate_comprehensive completes. This makes each run fully self-contained: stream/trace + review + evaluation + report all in one directory.
Acceptance Criteria:
- AC1:
evaluation.jsonis written torun_context.evaluation_pathafterevaluate_comprehensivereturns - AC2:
evaluation.jsoncontains the fullCompositeResult(tier1, tier2, tier3, composite scores) - AC3:
evaluation.jsonis only written when evaluation actually ran (not whenskip_eval=True) - AC4:
ArtifactRegistryregistersevaluation.jsonas"Evaluation"artifact - AC5:
make validatepasses with no regressions
Technical Requirements:
- In
run_evaluation_if_enabled(evaluation_runner.py), after pipeline returns results, writeresult_dicttorun_context.evaluation_pathviajson.dumps()withindent=2 - Guard: only write if
run_contextis provided and results are non-None - Register artifact path in
ArtifactRegistry
Files:
src/app/judge/evaluation_runner.py(edit – writeevaluation.jsonafter pipeline completes)tests/judge/test_evaluation_runner.py(edit – verifyevaluation.jsonwritten with correct content)
Non-Functional Requirements¶
- No new external dependencies
- Scoring changes: Features 3–7 change evaluation score behavior. Existing score comparisons against historical runs will not be directly comparable after these changes.
- Output directory migration: Feature 8 consolidates all output under
output/and removes legacy paths (logs/Agent_evals/cc_streams/,logs/Agent_evals/traces/,results/). Existing artifacts in those directories are not migrated. No backward compatibility layer. - Change comments: Every non-trivial code change must include a concise inline comment with sprint, story, and reason. Format:
# S12-F{N}: {why}. Keep comments to one line. Omit for trivial changes (string edits, config values).
Out of Scope¶
- CC-specific Tier 3 graph metrics (delegation fan-out, task completion rate, teammate utilization) — requires separate design
- Richer CC stream event parsing (tool use events, assistant messages) — only task lifecycle events needed for now
- GUI Sweep Page — deferred from Sprint 11
create_llm_model()registry pattern refactor — deferred from Sprint 11- BERTScore re-enablement — blocked by build issues, Levenshtein sufficient for deduplication
Notes for Ralph Loop¶
Priority Order¶
- P0 (bug fix): STORY-001 (stream event parsing — root cause), STORY-002 (cc_teams flag passthrough — enables correct engine_type)
- P1 (scoring fix): STORY-003 → STORY-004 (Tier 3 fallback + single-agent redistribution), STORY-005 (time_taken timestamps), STORY-006 (semantic dedup), STORY-007 (task_success continuous)
- P2 (UX): STORY-008 → STORY-009 → STORY-010 (per-run output directories)
Story Breakdown (10 stories total)¶
-
Feature 1 → STORY-001: Fix CC teams stream event parsing Update
_apply_eventto capturetask_started/task_completedsystem events as team artifacts. Remove stale_TEAM_EVENT_TYPESconstant. TDD: update existingparse_stream_jsontests to use real CC event format, add new tests for task lifecycle events. Files:src/app/engines/cc_engine.py,tests/engines/test_cc_engine.py. -
Feature 2 → STORY-002: Pass
cc_teamsflag through toengine_typeassignment (depends: STORY-001) Addcc_teamsparam tomain()and_run_cc_engine_path(). Wire from CLI and GUI. Changeengine_typeto use flag instead ofteam_artifactsinference. TDD: updatetest_cc_engine_wiring.pytests. Files:src/app/app.py,src/run_cli.py,src/gui/pages/run_app.py,tests/cli/test_cc_engine_wiring.py. -
Feature 3 → STORY-003: Skip Tier 3 for empty trace data Add early return in
_execute_tier3when trace has no tool_calls or agent_interactions. Triggers existing fallback (neutral 0.5 scores). Files:src/app/judge/evaluation_pipeline.py,tests/evals/test_evaluation_pipeline.py. -
Feature 4 → STORY-004: Wire
evaluate_composite_with_traceinto production (depends: STORY-003) Update_generate_composite_scoreto accept trace data and route toevaluate_composite_with_tracefor single-agent detection. RetainGraphTraceDataref inevaluate_comprehensive. Files:src/app/judge/evaluation_pipeline.py,tests/evals/test_evaluation_pipeline.py,tests/evals/test_composite_scorer.py. -
Feature 5 → STORY-005: Propagate actual execution timestamps to
time_taken(depends: STORY-004) Add timing toCCResult, capture around subprocess inrun_cc_solo/run_cc_teams, propagate throughevaluation_runner→evaluation_pipeline→_execute_tier1. Files:src/app/engines/cc_engine.py,src/app/app.py,src/app/judge/evaluation_runner.py,src/app/judge/evaluation_pipeline.py, tests. -
Feature 6 → STORY-006: Deduplicate
semantic_scorefromcosine_scoreChangecompute_semantic_similarityto use Levenshtein instead of cosine. Uses existingtextdistancelibrary. Files:src/app/judge/traditional_metrics.py,src/app/data_models/evaluation_models.py,tests/evals/test_traditional_metrics.py. -
Feature 7 → STORY-007: Replace binary
task_successwith continuous score Changeassess_task_successfrom0/1tomin(1.0, similarity/threshold). Files:src/app/judge/traditional_metrics.py,tests/evals/test_traditional_metrics.py. -
Feature 8.1 → STORY-008: Introduce
RunContextand per-run directory infrastructure CreateRunContextdataclass with path helpers,metadata.jsonwriter, unified timestamp. AddOUTPUT_PATH, removeCC_STREAMS_PATH/MAS_REVIEWS_PATH/RESULTS_PATH. Create inmain(). Files:src/app/utils/run_context.py(new),src/app/config/config_app.py,src/app/config/judge_settings.py,src/app/app.py,src/gui/pages/evaluation.py,tests/utils/test_run_context.py(new). -
Feature 8.2 → STORY-009: Migrate all writers to per-run directories (depends: STORY-008, STORY-005) Update
cc_engine.py,trace_processors.py,review_persistence.py,run_cli.py,sweep_runner.pyto write viaRunContext/OUTPUT_PATHpaths. Delete deadreview_loader.py. Remove legacy path constants usage. Files:src/app/engines/cc_engine.py,src/app/judge/trace_processors.py,src/app/data_utils/review_persistence.py,src/app/data_utils/review_loader.py(delete),src/run_cli.py,src/app/benchmark/sweep_runner.py,src/app/app.py, tests. -
Feature 8.3 → STORY-010: Persist evaluation results to
evaluation.json(depends: STORY-009) WriteCompositeResulttorun_dir/evaluation.jsonafter pipeline completes. Files:src/app/judge/evaluation_runner.py,tests/judge/test_evaluation_runner.py.
Notes for CC Agent Teams¶
Reference: docs/analysis/CC-agent-teams-orchestration.md
Teammate Definitions¶
| Teammate | Role | Model | Permissions | TDD Responsibility |
|---|---|---|---|---|
| Lead | Coordination, wave gates, make validate |
sonnet | delegate mode | Runs full validation at wave boundaries |
| teammate-1 | Developer (src/ + tests/) | opus | acceptEdits | testing-python (RED) → implementing-python (GREEN) → make quick_validate |
| teammate-2 | Developer (traditional_metrics + tests) | opus | acceptEdits | testing-python (RED) → implementing-python (GREEN) → make quick_validate |
File-Conflict Dependencies¶
| Story | Logical Dep | Shared File / Reason |
|---|---|---|
| STORY-002 | STORY-001 | cc_engine.py (STORY-001 changes event parsing that STORY-002’s tests depend on) |
| STORY-004 | STORY-003 | evaluation_pipeline.py (STORY-003 changes _execute_tier3; STORY-004 changes _generate_composite_score in same file) |
| STORY-005 | STORY-004 | evaluation_pipeline.py (STORY-005 adds timestamp params to methods STORY-004 modified) |
| STORY-009 | STORY-008 | All writer files (STORY-009 uses RunContext from STORY-008) |
| STORY-009 | STORY-005 | cc_engine.py, app.py (STORY-005 adds timing fields that STORY-009’s writer migration must preserve) |
| STORY-010 | STORY-009 | evaluation_runner.py (STORY-010 adds evaluation.json write after STORY-009 plumbs RunContext) |
Orchestration Waves¶
Wave 0 (P0 bug fixes — sequential due to shared cc_engine.py):
teammate-1: STORY-001 (F1 stream event parsing fix) → STORY-002 (F2 cc_teams flag passthrough)
gate: lead runs `make validate`
Wave 1 (P1 scoring fixes — sequential on evaluation_pipeline.py, parallel on traditional_metrics.py):
teammate-1: STORY-003 (F3 Tier 3 empty-trace skip) → STORY-004 (F4 wire composite_with_trace) → STORY-005 (F5 timestamp propagation)
teammate-2: STORY-006 (F6 semantic dedup) → STORY-007 (F7 task_success continuous)
gate: lead runs `make validate`
Wave 2 (P2 output restructuring — sequential, touches many files):
teammate-1: STORY-008 (F8.1 RunContext infrastructure) → STORY-009 (F8.2 migrate writers) → STORY-010 (F8.3 evaluation.json)
shutdown teammate-2 after Wave 1 gate (no Wave 2 work assigned — saves token cost)
gate: lead runs `make validate`
Quality Gate Workflow¶
- Teammate completes story: runs
make quick_validate, marks task completed viaTaskUpdate - Teammate picks next story: checks
TaskListfor unblocked pending tasks, claims viaTaskUpdatewithowner - Wave boundary: when all stories in a wave are completed, lead runs
make validate(full suite) - Lead advances: if
make validatepasses, lead confirms sprint complete; if it fails, lead assigns fix tasks - Shutdown: after Wave 2, lead sends
shutdown_requestto all teammates, thenTeamDelete