Agents-eval¶

This project aims to implement an evaluation pipeline to assess the effectiveness of open-source agentic AI systems across various use cases, focusing on use case agnostic metrics that measure core capabilities such as task decomposition, tool integration, adaptability, and overall performance.

DevEx

Status¶

(DRAFT) (WIP) ----> Not fully implemented yet

For version history have a look at the CHANGELOG.

Setup and Usage¶

make setup_prod
make setup_dev or make setup_dev_claude or make setup_dev_ollama
make run_cli or make run_cli ARGS="--help"
make run_gui
make test_all

Configuration¶

config_app.py contains configuration constants for the application.
config_chat.json contains inference provider configuration and prompts. inference endpoints used should adhere to OpenAI Model Spec 2024-05-08 which is used by pydantic-ai OpenAI-compatible Models.
config_eval.json contains evaluation metrics and their weights.
data_models.py contains the pydantic data models for agent system configuration and results.

Environment¶

.env.example contains examples for usage of API keys and variables.

# inference EP
GEMINI_API_KEY="xyz"

# tools
TAVILY_API_KEY=""

# log/mon/trace
WANDB_API_KEY="xyz"

Customer Journey¶

Show Customer Journey

Note¶

The contained chat configuration uses free inference endpoints which are subject to change by the providers. See lists such as free-llm-api-resources to find other providers.
The contained chat configuration uses models which are also subject to change by the providers and have to be updated from time to time.
LLM-as-judge is also subject to the chat configuration.

Documentation¶

Agents-eval

Project Outline¶

# TODO

Agents¶

Manager Agent¶

Description: Oversees research and analysis tasks, coordinating the efforts of the research, analysis, and synthesizer agents to provide comprehensive answers to user queries. Delegates tasks and ensures the accuracy of the information.
Responsibilities:
Coordinates the research, analysis, and synthesis agents.
Delegates research tasks to the Research Agent.
Delegates analysis tasks to the Analysis Agent.
Delegates synthesis tasks to the Synthesizer Agent.
Ensures the accuracy of the information.
Location: src/app/agents/agent_system.py

Researcher Agent¶

Description: Gathers and analyzes data relevant to a given topic, utilizing search tools to collect data and verifying the accuracy of assumptions, facts, and conclusions.
Responsibilities:
Gathers and analyzes data relevant to the topic.
Uses search tools to collect data.
Checks the accuracy of assumptions, facts, and conclusions.
Tools:
DuckDuckGo Search Tool
Location: src/app/agents/agent_system.py

Analyst Agent¶

Description: Checks the accuracy of assumptions, facts, and conclusions in the provided data, providing relevant feedback and ensuring data integrity.
Responsibilities:
Checks the accuracy of assumptions, facts, and conclusions.
Provides relevant feedback if the result is not approved.
Ensures data integrity.
Location: src/app/agents/agent_system.py

Synthesizer Agent¶

Description: Outputs a well-formatted scientific report using the data provided, maintaining the original facts, conclusions, and sources.
Responsibilities:
Outputs a well-formatted scientific report using the provided data.
Maintains the original facts, conclusions, and sources.
Location: src/app/agents/agent_system.py

Datasets used¶

# TODO

Evalutions metrics¶

# TODO

Time to complete task (time_taken)
Task success rate (task_success)
Agent coordination (coordination_quality)
Tool usage efficiency (tool_efficiency)
Plan coherence (planning_rational)
Text response quality (text_similarity)
Autonomy vs. human intervention (HITL, user feedback)
Reactivity (adapt to changes of tasks and environments)
Memory consistency

Evaluations Metrics Baseline¶

As configured in config_eval.json.

{
    "evaluators_and_weights": {
        "planning_rational": "1/6",
        "task_success": "1/6",
        "tool_efficiency": "1/6",
        "coordination_quality": "1/6",
        "time_taken": "1/6",
        "text_similarity": "1/6"
    }
}

Eval Metrics Sweep¶

Eval Metrics Sweep

Tools available¶

Other pydantic-ai agents and pydantic-ai DuckDuckGo Search Tool.

Agentic System Architecture¶

Show Agentic System Architecture

Project Repo Structure¶

Show Repo Structure

|- .claude  # claude code config and commands
|- .devcontainer  # pre-configured dev env
|- .github  # workflows
|- .streamlit  # config.toml
|- .vscode  # extensions, settings
|- assets/images
|- docs
|- src  # source code
   |- app
      |- agents
      |- config
      |- evals
      |- utils
      |- __init__.py
      |- main.py
      \- py.typed
   |- examples
   |- gui
   \- run_gui.py
|- tests
|- .env.example  # example env vars
|- .gitignore
|- .gitmessage
|- AGENTS.md  # common file like agentsmd.com
|- CHANGEOG.md  # short project history
|- CLAUDE.md  # points to AGENTS.md
|- Dockerfile  # create app image
|- LICENSE.md
|- Makefile  # helper scripts
|- mkdocs.yaml  # docu from docstrings
|- pyproject.toml  # project settings
|- README.md  # project description
\- uv.lock  # resolved package versions

Landscape overview¶

Agentic System Frameworks¶

Agent-builder¶

Evaluation¶

Observation, Monitoring, Tracing¶

Datasets¶

awesome-reasoning - Collection of datasets

Scientific¶

SWIF2T, Automated Focused Feedback Generation for Scientific Writing Assistance, 2024, 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation
PeerRead, A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications, 2018, 14K paper drafts and the corresponding accept/reject decisions, over 10K textual peer reviews written by experts for a subset of the papers, structured JSONL, clear labels
BigSurvey, Generating a Structured Summary of Numerous Academic Papers: Dataset and Method, 2022, 7K survey papers and 430K referenced papers abstracts
SciXGen, A Scientific Paper Dataset for Context-Aware Text Generation, 2021, 205k papers
scientific_papers, 2018, two sets of long and structured documents, obtained from ArXiv and PubMed OpenAccess, 300k+ papers, total disk 7GB

Reasoning, Deduction, Commonsense, Logic¶

LIAR, fake news detection, only 12.8k records, single label
X-Fact, Benchmark Dataset for Multilingual Fact Checking, 31.1k records, large, multilingual
MultiFC, A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims, 34.9k records
FEVER, Fact Extraction and VERification, 185.4k records
TODO GSM8K, bAbI, CommonsenseQA, DROP, LogiQA, MNLI

Planning, Execution¶

Plancraft, an evaluation dataset for planning with LLM agents, both a text-only and multi-modal interface
IDAT, A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents
PDEBench, set of benchmarks for scientific machine learning
MatSci-NLP, evaluating the performance of natural language processing (NLP) models on materials science text
TODO BigBench Hard, FSM Game

Tool Use, Function Invocation¶

Trelis Function Calling
KnowLM Tool
StatLLM, statistical analysis tasks, LLM-generated SAS code, and human evaluation scores
TODO ToolComp

Benchmarks¶

Research Agents¶

Ai2 Scholar QA

Agents-eval¶

Status¶

Setup and Usage¶

Configuration¶

Environment¶

Customer Journey¶

Note¶

Documentation¶

Project Outline¶

Agents¶

Manager Agent¶

Researcher Agent¶

Analyst Agent¶

Synthesizer Agent¶

Datasets used¶

Evalutions metrics¶

Evaluations Metrics Baseline¶

Eval Metrics Sweep¶

Tools available¶

Agentic System Architecture¶

Project Repo Structure¶

Landscape overview¶

Agentic System Frameworks¶

Agent-builder¶

Evaluation¶

Observation, Monitoring, Tracing¶

Datasets¶

Scientific¶

Reasoning, Deduction, Commonsense, Logic¶

Planning, Execution¶

Tool Use, Function Invocation¶

Benchmarks¶

Research Agents¶

Further Reading¶