Agents-eval¶

Evaluate multi-agent AI systems objectively — Three-tiered framework for researchers and developers building autonomous agent teams

A Multi-Agent System (MAS) evaluation framework using PydanticAI that generates and evaluates scientific paper reviews through a three-tiered assessment approach: Tier 1 (Traditional Metrics), Tier 2 (LLM-as-a-Judge), and Tier 3 (Graph-Based Analysis).

I am a: User/Researcher | Human Developer | AI Agent

Quick Start¶

make setup_dev && make app_quickstart    # downloads sample data, evaluates smallest paper
make app_cli ARGS="--help"               # all CLI options

Common commands:

make app_cli ARGS="--paper-id=1105.1072"                                          # evaluate a specific paper
make app_cli ARGS="--paper-id=1105.1072 --engine=cc"                              # Claude Code engine (requires claude CLI)
make app_cli ARGS="--paper-id=1105.1072 --engine=cc --cc-teams"                   # CC multi-agent orchestration
make app_sweep ARGS="--paper-ids 1105.1072 --repetitions 1 --all-compositions"    # benchmark all 8 agent compositions
make app_batch_run ARGS="--paper-ids 1105.1072 --parallel 4"                      # parallel runs, resilient to errors
make app_batch_eval                                                               # summarize all runs into output/summary.md

All commands use the default provider (github). Set your API key in .env or pass --chat-provider=<provider>. See .env.example.

User/Researcher¶

Documentation Site — Complete reference
UserStory.md — User workflows, use cases, and acceptance criteria
Agent Tools & CLI Reference — Tool signatures, CLI examples by category, troubleshooting
Codespace — Immediate access in browser

Human Developer¶

CONTRIBUTING.md — Commands, workflows, coding patterns
architecture.md — Technical design and decisions
roadmap.md — Development roadmap
Development flow: Setup → Code → make validate → Commit

AI Agent¶

READ FIRST: AGENTS.md — Behavioral rules and compliance requirements
Technical Patterns: CONTRIBUTING.md — Implementation standards and commands

Project Outline¶

System: Multi-agent evaluation pipeline (Manager → Researcher → Analyst → Synthesizer) with PydanticAI, processing PeerRead scientific papers.

Evaluation Approach: Tier 1 (Traditional Metrics) + Tier 2 (LLM-as-a-Judge) + Tier 3 (Graph-Based Analysis) → Composite scoring. See architecture.md for metric definitions.

For version history see the CHANGELOG.

Diagrams

Show Customer Journey

Show Review Workflow

Show Eval Metrics Sweep

Examples¶

See src/examples/README.md for self-contained demonstrations: basic_evaluation.py, judge_settings_customization.py, engine_comparison.py.

References¶

AI Agent Evaluation Landscape — Frameworks, tools, datasets, benchmarks
Tracing & Observation Methods — Observability analysis
List of papers inspected
Enhancement Recommendations
Papers Meta Review
Papers Comprehensive Analysis