Agents-eval¶
This project aims to implement an evaluation pipeline to assess the effectiveness of open-source agentic AI systems using the PeerRead dataset. Nonetheless intending to focusing on use case agnostic metrics that measure core capabilities such as task decomposition, tool integration, adaptability, and overall performance.
Status¶
(DRAFT) (WIP) ----> Not fully implemented yet
For version history have a look at the CHANGELOG.
Setup and Usage¶
make setup_prod
make setup_dev
ormake setup_dev_ollama
make run_cli
ormake run_cli ARGS="--help"
make run_gui
make test_all
Environment¶
.env.example contains examples for usage of API keys and variables.
Configuration¶
- config_app.py contains configuration constants for the application.
- config_chat.json contains inference provider configuration and prompts. inference endpoints used should adhere to OpenAI Model Spec 2024-05-08 which is used by pydantic-ai OpenAI-compatible Models.
- config_eval.json contains evaluation metrics and their weights.
- data_models.py contains the pydantic data models for agent system configuration and results.
Note¶
- The contained chat configuration uses free inference endpoints which are subject to change by the providers. See lists such as free-llm-api-resources to find other providers.
- The contained chat configuration uses models which are also subject to change by the providers and have to be updated from time to time.
- LLM-as-judge is also subject to the chat configuration.
Documentation¶
Project Outline¶
# TODO
Customer Journey and User Story¶
Have a look at the example user story.
Show Customer Journey


Agents¶
Manager Agent¶
- Description: Oversees research and analysis tasks, coordinating the efforts of the research, analysis, and synthesizer agents to provide comprehensive answers to user queries. Delegates tasks and ensures the accuracy of the information.
- Responsibilities:
- Coordinates the research, analysis, and synthesis agents.
- Delegates research tasks to the Research Agent.
- Delegates analysis tasks to the Analysis Agent.
- Delegates synthesis tasks to the Synthesizer Agent.
- Ensures the accuracy of the information.
- Location: src/app/agents/agent_system.py
Researcher Agent¶
- Description: Gathers and analyzes data relevant to a given topic, utilizing search tools to collect data and verifying the accuracy of assumptions, facts, and conclusions.
- Responsibilities:
- Gathers and analyzes data relevant to the topic.
- Uses search tools to collect data.
- Checks the accuracy of assumptions, facts, and conclusions.
- Tools:
- DuckDuckGo Search Tool
- Location: src/app/agents/agent_system.py
Analyst Agent¶
- Description: Checks the accuracy of assumptions, facts, and conclusions in the provided data, providing relevant feedback and ensuring data integrity.
- Responsibilities:
- Checks the accuracy of assumptions, facts, and conclusions.
- Provides relevant feedback if the result is not approved.
- Ensures data integrity.
- Location: src/app/agents/agent_system.py
Synthesizer Agent¶
- Description: Outputs a well-formatted scientific report using the data provided, maintaining the original facts, conclusions, and sources.
- Responsibilities:
- Outputs a well-formatted scientific report using the provided data.
- Maintains the original facts, conclusions, and sources.
- Location: src/app/agents/agent_system.py
Dataset used¶
PeerRead Scientific Paper Review Dataset¶
The system includes comprehensive integration with the PeerRead dataset for scientific paper review evaluation:
- Purpose: Generate and evaluate scientific paper reviews using the Multi-Agent System
- Architecture: Clean separation between review generation (MAS) and evaluation (external system)
- Workflow:
1. MAS: PDF → Review Generation → Persistent Storage (
src/app/data_utils/reviews/
) 2. External Evaluation: Load Reviews → Similarity Analysis → Results - Documentation: See PeerRead Agent Usage Guide
Review Workflow¶
Show Review Workflow


LLM-as-a-Judge¶
# TODO
Custom Evaluations Metrics Baseline¶
As configured in config_eval.json.
{
"evaluators_and_weights": {
"planning_rational": "1/6",
"task_success": "1/6",
"tool_efficiency": "1/6",
"coordination_quality": "1/6",
"time_taken": "1/6",
"text_similarity": "1/6"
}
}
Eval Metrics Sweep¶
Eval Metrics Sweep


Tools available¶
Other pydantic-ai agents and pydantic-ai DuckDuckGo Search Tool.
Agentic System Architecture¶
Show MAS Overview


Show MAS Detailed


Project Repo Structure¶
|- .claude # AI agent framework and commands
|- commands
|- generate-frp.md # FRP generation command
\- execute-frp.md # FRP execution command
|- .devcontainer # pre-configured dev env
|- .github # workflows
|- .streamlit # config.toml
|- .vscode # extensions, settings
|- assets/images
|- context # AI agent context framework
|- config
\- paths.md # path variables and definitions
|- templates
\- 2_frp_base.md # FRP template with quality framework
|- features # feature descriptions for FRP generation
|- FRPs # generated feature requirements prompts
|- examples # code patterns and examples
\- logs # agent execution logs
|- docs
|- src # source code
|- app
|- agents
|- config
|- evals
|- utils
|- __init__.py
|- main.py
\- py.typed
|- examples
|- gui
\- run_gui.py
|- tests
|- .env.example # example env vars
|- .gitignore
|- .gitmessage
|- AGENTS.md # north star document for AI agents (agentsmd.com)
|- CHANGEOG.md # short project history
|- CLAUDE.md # points to AGENTS.md
|- Dockerfile # create app image
|- LICENSE.md
|- Makefile # helper scripts
|- mkdocs.yaml # docu from docstrings
|- pyproject.toml # project settings
|- README.md # project description
\- uv.lock # resolved package versions
Landscape overview¶
Agentic System Frameworks¶
Agent-builder¶
Evaluation¶
- Focusing on agentic systems
- AgentNeo
- AutoGenBench
- Langchain AgentEvals, trajectory or LLM-as-a-judge
- Mosaic AI Agent Evaluation
- RagaAI-Catalyst
- AgentBench
- RAG oriented
- RAGAs
- LLM apps
- DeepEval
- Langchain OpenEvals
- MLFlow LLM Evaluate
- DeepEval (DeepSeek)
Observation, Monitoring, Tracing¶
- AgentOps - Agency
- arize
- Langtrace
- LangSmith - Langchain
- Weave - Weights & Biases
- Pydantic- Logfire
- comet Opik
- Langfuse
- helicone
- langwatch
Datasets¶
Scientific¶
- SWIF2T, Automated Focused Feedback Generation for Scientific Writing Assistance, 2024, 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation
- PeerRead, A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications, 2018, 14K paper drafts and the corresponding accept/reject decisions, over 10K textual peer reviews written by experts for a subset of the papers, structured JSONL, clear labels, See A Dataset of Peer Reviews (PeerRead):Collection, Insights and NLP Applications
- BigSurvey, Generating a Structured Summary of Numerous Academic Papers: Dataset and Method, 2022, 7K survey papers and 430K referenced papers abstracts
- SciXGen, A Scientific Paper Dataset for Context-Aware Text Generation, 2021, 205k papers
- scientific_papers, 2018, two sets of long and structured documents, obtained from ArXiv and PubMed OpenAccess, 300k+ papers, total disk 7GB
Reasoning, Deduction, Commonsense, Logic¶
- LIAR, fake news detection, only 12.8k records, single label
- X-Fact, Benchmark Dataset for Multilingual Fact Checking, 31.1k records, large, multilingual
- MultiFC, A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims, 34.9k records
- FEVER, Fact Extraction and VERification, 185.4k records
- TODO GSM8K, bAbI, CommonsenseQA, DROP, LogiQA, MNLI
Planning, Execution¶
- Plancraft, an evaluation dataset for planning with LLM agents, both a text-only and multi-modal interface
- IDAT, A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents
- PDEBench, set of benchmarks for scientific machine learning
- MatSci-NLP, evaluating the performance of natural language processing (NLP) models on materials science text
- TODO BigBench Hard, FSM Game
Tool Use, Function Invocation¶
- Trelis Function Calling
- KnowLM Tool
- StatLLM, statistical analysis tasks, LLM-generated SAS code, and human evaluation scores
- TODO ToolComp
Benchmarks¶
- SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks
- AgentEvals CORE-Bench Leaderboard
- Berkeley Function-Calling Leaderboard
- Chatbot Arena LLM Leaderboard
- GAIA Leaderboard
- GalileoAI Agent Leaderboard
- WebDev Arena Leaderboard
- MiniWoB++: a web interaction benchmark for reinforcement learning
Research Agents¶
Further Reading¶
- List of papers inspected: further_reading
- Visualization of Papers inspected
- Agents-eval Enhancement Recommendations based on the Papers
- Papers Meta Review
- Papers Comprehensive Analysis
Note: Context Framework for AI Agents¶
This project includes a comprehensive context framework for AI coding agents. It can be used to implement new features using a top-down approach. The user has to provide feature descriptions which will then be transformed into Feature Request Prompts (FRPs) which in turn will be transformed into code implementation.
CLI/Extensions used¶
Core Components¶
- AGENTS.md: North star document with project patterns, conventions, and quality evaluation framework
- FRP Workflow: Feature Requirements Prompt generation and execution system
1.
context/templates/1_feature_description.md
: User provides feature description, e.g., by using this template 2..claude/commands/generate-frp.md
: Creates comprehensive implementation prompts from feature descriptions 3..claude/commands/execute-frp.md
: Executes features using generated FRPs with structured validation
Agent Development Workflow¶
- Follow AGENTS.md - Read project conventions, patterns, and quality standards
- Generate FRP - Use
generate-frp.md
command for comprehensive feature planning and research - Execute Implementation - Use
execute-frp.md
command for structured development with quality gates
Quality Framework Integration¶
- Built-in quality evaluation with minimum thresholds (Context: 8/10, Clarity: 7/10, Alignment: 8/10, Success: 7/10)
- BDD/TDD approach integration following project patterns
- Automatic validation using unified command reference with error recovery
- TodoWrite tool integration for progress tracking and transparency
For AI Agents: Quick Start¶
- Read the North Star: Start with AGENTS.md for project patterns and conventions
- Generate FRP: Use
/generate-frp <feature-name>
command in Claude Code - Execute Implementation: Use
/execute-frp <feature-name>
command with generated FRP - Follow Quality Gates: Ensure all AGENTS.md thresholds are met before proceeding