Skip to content

Agents-eval

Evaluate multi-agent AI systems objectively — Three-tiered framework for researchers and developers building autonomous agent teams

A Multi-Agent System (MAS) evaluation framework using PydanticAI that generates and evaluates scientific paper reviews through a three-tiered assessment approach: Tier 1 (Traditional Metrics), Tier 2 (LLM-as-a-Judge), and Tier 3 (Graph-Based Analysis).

I am a: User/Researcher | Human Developer | AI Agent

License Version Deploy Docs CodeQL CodeFactor ruff pytest Link Checker

llms.txt Flat Repo (UitHub) Flat Repo (GitToDoc) vscode.dev Codespace Dev Codespace Dev Ollama

Quick Start

make setup_dev && make app_quickstart    # downloads sample data, evaluates smallest paper
make app_cli ARGS="--help"               # all CLI options

Common commands:

make app_cli ARGS="--paper-id=1105.1072"                                          # evaluate a specific paper
make app_cli ARGS="--paper-id=1105.1072 --engine=cc"                              # Claude Code engine (requires claude CLI)
make app_cli ARGS="--paper-id=1105.1072 --engine=cc --cc-teams"                   # CC multi-agent orchestration
make app_sweep ARGS="--paper-ids 1105.1072 --repetitions 1 --all-compositions"    # benchmark all 8 agent compositions
make app_batch_run ARGS="--paper-ids 1105.1072 --parallel 4"                      # parallel runs, resilient to errors
make app_batch_eval                                                               # summarize all runs into output/summary.md

All commands use the default provider (github). Set your API key in .env or pass --chat-provider=<provider>. See .env.example.

User/Researcher

Human Developer

  • CONTRIBUTING.md — Commands, workflows, coding patterns
  • architecture.md — Technical design and decisions
  • roadmap.md — Development roadmap
  • Development flow: Setup → Code → make validate → Commit

AI Agent

  • READ FIRST: AGENTS.md — Behavioral rules and compliance requirements
  • Technical Patterns: CONTRIBUTING.md — Implementation standards and commands

Project Outline

System: Multi-agent evaluation pipeline (Manager → Researcher → Analyst → Synthesizer) with PydanticAI, processing PeerRead scientific papers.

Evaluation Approach: Tier 1 (Traditional Metrics) + Tier 2 (LLM-as-a-Judge) + Tier 3 (Graph-Based Analysis) → Composite scoring. See architecture.md for metric definitions.

For version history see the CHANGELOG.

Diagrams
Show Customer Journey Customer Journey Customer Journey
Show Review Workflow Review Workflow Review Workflow
Show Eval Metrics Sweep Eval Metrics Sweep Eval Metrics Sweep

Examples

See src/examples/README.md for self-contained demonstrations: basic_evaluation.py, judge_settings_customization.py, engine_comparison.py.

References