Core Principles 80/20
Current State: 18% principle adherence (82% bloat) Target State: 80% principle adherence (80% code reduction) Value Impact: Zero (100% retention)
Core Principles¶
1. Measure Review Quality vs Reference¶
Truth: Users need similarity scores (cosine, Jaccard, BERTScore) comparing generated reviews to references. Violation: Graph complexity analysis, LLM-as-Judge evaluation tiers measure how not what.
2. One Tier Evaluation¶
Truth: Traditional metrics (cosine, Jaccard, BERTScore) are sufficient for quality assessment. Violation: Three-tier evaluation (Traditional → LLM Judge → Graph) is complexity theater.
3. Minimal Code, Maximum Value¶
Truth: Each dependency must justify its >100MB footprint and maintenance burden. Violation: 4 tracing systems (agentops, logfire, weave, opik), unused HuggingFace/PyTorch.
80/20 Analysis¶
Keep (20% - Core Value)¶
evaluation_core:
- src/app/evals/evaluation_pipeline.py: "Orchestrator (simplify to 50 lines)"
- src/app/evals/traditional_metrics.py: "cosine/Jaccard/BERTScore only"
data_layer:
- src/app/data_utils/datasets_peerread.py: "Dataset loader"
- src/app/data_models/peerread_models.py: "Data contracts"
agent_runtime:
- src/app/agents/agent_system.py: "Single agent runner"
total_files: 5
total_lines: ~800 (down from ~4000)
Delete (80% - Bloat)¶
tracing_theater:
- "4 tracing systems → 0": "agentops, logfire, weave, opik"
- "Impact": "Remove 4 dependencies, 500+ lines config/integration code"
evaluation_bloat:
- src/app/evals/graph_analysis.py: "NetworkX complexity theater"
- src/app/evals/llm_evaluation_managers.py: "API judging APIs"
- src/app/evals/composite_scorer.py: "Multi-tier orchestration"
- "Impact": "Remove 3 files, 800+ lines, NetworkX dependency"
agent_zoo:
- src/app/agents/orchestration.py: "Manager/Researcher/Analyst/Synthesizer"
- src/app/agents/agent_factories.py: "Factory pattern for 1 agent type"
- "Impact": "Merge to single agent, remove 2 files, 400+ lines"
config_sprawl:
- src/app/evals/evaluation_config.py: "Multi-tier config"
- src/app/utils/load_configs.py: "Over-abstracted config loading"
- "Impact": "Simplify to single config file, remove 200+ lines"
total_deletion: ~2000 lines, 6 dependencies
Principle Violations (Hit List)¶
Priority 1 (Immediate Deletion)¶
-
Graph Analysis Module → Violates Principle 1 (measure output, not complexity) - File:
src/app/evals/graph_analysis.py- Dependency: NetworkX - Reason: Counting tool calls ≠ measuring review quality -
LLM-as-Judge Tier → Violates Principle 2 (one tier evaluation) - File:
src/app/evals/llm_evaluation_managers.py- Reason: Using expensive API to judge… other APIs -
Tracing Quadruplet → Violates Principle 3 (minimal dependencies) - Dependencies: agentops, logfire, weave, opik - Reason: Four systems doing same job (choose ONE or ZERO)
Priority 2 (Next Sprint)¶
-
Multi-Agent Orchestration → Violates Principle 2 (one path) - Files:
orchestration.py,agent_factories.py- Reason: Manager→Researcher→Analyst→Synthesizer when one agent suffices -
Composite Scoring → Violates Principle 1 (measure output) - File:
src/app/evals/composite_scorer.py- Reason: Complex formula combining tiers that shouldn’t exist
Priority 3 (Technical Debt)¶
-
Performance Monitor → Violates Principle 3 (minimal code) - File:
src/app/evals/performance_monitor.py- Reason: Sophisticated timing whentime.time()suffices -
Trace Processors → Violates Principle 3 (minimal code) - File:
src/app/evals/trace_processors.py- Reason: Processing traces we shouldn’t collect
Streamlined Future Architecture¶
Before (Current Bloat)¶
40+ files → 4 tracing systems → 3 evaluation tiers →
Multi-agent orchestration → Graph complexity → Composite scores
After (Laser-Focused)¶
# The ENTIRE evaluation pipeline
def evaluate(paper: str, agent: Agent) -> EvalResult:
"""Generate review and compare to reference."""
generated = agent.run(paper)
reference = load_reference(paper)
return EvalResult(
bleu=calculate_bleu(generated, reference),
rouge=calculate_rouge(generated, reference),
bertscore=calculate_bertscore(generated, reference),
execution_time=measure_time()
)
# That's it. 15 lines vs 2000.
Dependencies Before → After¶
delete:
- agentops: "Tracing theater"
- logfire: "Tracing theater"
- weave: "Tracing theater"
- opik: "Tracing theater"
- networkx: "Graph theater"
- torchmetrics: "Already disabled, remove entirely"
keep:
- pydantic: "Data validation (core)"
- pydantic-ai-slim: "Agent runtime (core)"
- scikit-learn: "cosine/Jaccard metrics (core)"
- textdistance: "Text similarity (core)"
- httpx: "HTTP client (core)"
reduction: 60% fewer dependencies
Implementation Roadmap¶
Sprint 1: Core Elimination¶
week_1:
- Delete graph_analysis.py and NetworkX dependency
- Delete llm_evaluation_managers.py (Tier 2)
- Remove 3 of 4 tracing systems (keep opik or NONE)
week_2:
- Simplify evaluation_pipeline.py to single tier
- Delete composite_scorer.py
- Remove performance_monitor.py (use basic timing)
Sprint 2: Agent Consolidation¶
week_3:
- Merge multi-agent orchestration to single agent
- Delete agent_factories.py and orchestration.py
- Simplify agent_system.py
week_4:
- Consolidate config files
- Remove trace_processors.py
- Update documentation to reflect simplicity
Success Criteria¶
- Code Reduction: 80% (4000 → 800 lines)
- Dependency Reduction: 60% (15 → 6 packages)
- Principle Adherence: 80% (up from 18%)
- User Workflows: 100% functional
- Maintainability: 10x improved
Validation Checklist¶
- All PeerRead evaluation workflows still work
- cosine/Jaccard/BERTScore metrics still calculate correctly
- Agent generates reviews from papers
- Execution time measured accurately
- Zero feature regression for users
- Documentation updated to reflect simplicity
-
make validatepasses all checks
Bottom Line: Delete 2000 lines, remove 6 dependencies, keep 100% functionality. No complexity theater. Just measure review quality vs reference. That’s the job.