Agents-eval¶
Evaluate multi-agent AI systems objectively — Three-tiered framework for researchers and developers building autonomous agent teams
A Multi-Agent System (MAS) evaluation framework using PydanticAI that generates and evaluates scientific paper reviews through a three-tiered assessment approach: Tier 1 (Traditional Metrics), Tier 2 (LLM-as-a-Judge), and Tier 3 (Graph-Based Analysis).
I am a: User/Researcher | Human Developer | AI Agent
Quick Start¶
make setup_dev && make app_quickstart # downloads sample data, evaluates smallest paper
make app_cli ARGS="--help" # all CLI options
Common commands:
make app_cli ARGS="--paper-id=1105.1072" # evaluate a specific paper
make app_cli ARGS="--paper-id=1105.1072 --engine=cc" # Claude Code engine (requires claude CLI)
make app_cli ARGS="--paper-id=1105.1072 --engine=cc --cc-teams" # CC multi-agent orchestration
make app_sweep ARGS="--paper-ids 1105.1072 --repetitions 1 --all-compositions" # benchmark all 8 agent compositions
make app_batch_run ARGS="--paper-ids 1105.1072 --parallel 4" # parallel runs, resilient to errors
make app_batch_eval # summarize all runs into output/summary.md
All commands use the default provider (
github). Set your API key in.envor pass--chat-provider=<provider>. See .env.example.
User/Researcher¶
- Documentation Site — Complete reference
- UserStory.md — User workflows, use cases, and acceptance criteria
- Agent Tools & CLI Reference — Tool signatures, CLI examples by category, troubleshooting
- Codespace — Immediate access in browser
Human Developer¶
- CONTRIBUTING.md — Commands, workflows, coding patterns
- architecture.md — Technical design and decisions
- roadmap.md — Development roadmap
- Development flow: Setup → Code →
make validate→ Commit
AI Agent¶
- READ FIRST: AGENTS.md — Behavioral rules and compliance requirements
- Technical Patterns: CONTRIBUTING.md — Implementation standards and commands
Project Outline¶
System: Multi-agent evaluation pipeline (Manager → Researcher → Analyst → Synthesizer) with PydanticAI, processing PeerRead scientific papers.
Evaluation Approach: Tier 1 (Traditional Metrics) + Tier 2 (LLM-as-a-Judge) + Tier 3 (Graph-Based Analysis) → Composite scoring. See architecture.md for metric definitions.
For version history see the CHANGELOG.
Diagrams
Show Customer Journey
Show Review Workflow
Show Eval Metrics Sweep
Examples¶
See src/examples/README.md for self-contained demonstrations:
basic_evaluation.py, judge_settings_customization.py, engine_comparison.py.
References¶
- AI Agent Evaluation Landscape — Frameworks, tools, datasets, benchmarks
- Tracing & Observation Methods — Observability analysis
- List of papers inspected
- Enhancement Recommendations
- Papers Meta Review
- Papers Comprehensive Analysis