Eval Metrics
Comprehensive catalog of evaluation metrics for AI agent systems, with definitions, use cases, and primary research references for each metric.
Related Document: Evaluation & Data Resources Landscape - Tools, platforms, datasets, and benchmarks for implementing these metrics
Core Evaluation Metrics¶
Text Generation Quality¶
See also: Traditional Metrics Libraries in landscape-evaluation-data-resources.md
BLEU (Bilingual Evaluation Understudy)¶
- Definition: N-gram precision metric measuring overlap between generated and reference text
- Use Case: Evaluate generated review similarity to reference reviews
- Strengths: Fast computation, established baseline
- Limitations: Ignores semantic meaning, favors exact matches
- Reference: BLEU: a Method for Automatic Evaluation of Machine Translation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)¶
- Definition: Recall-based metric measuring content overlap using n-grams and longest common subsequences
- Use Case: Assess coverage of key paper concepts in generated reviews
- Strengths: Captures content coverage, multiple variants (ROUGE-1, ROUGE-2, ROUGE-L)
- Limitations: Surface-level matching, no semantic understanding
- Reference: ROUGE: A Package for Automatic Evaluation of Summaries
BERTScore¶
- Definition: Contextual embedding-based similarity using pre-trained BERT models
- Use Case: Measure semantic similarity beyond lexical matching
- Strengths: Captures semantic meaning, correlates with human judgment
- Limitations: Computationally expensive, model-dependent
- Reference: BERTScore: Evaluating Text Generation with BERT
Semantic Similarity (Cosine)¶
- Definition: Vector similarity between sentence embeddings using cosine distance
- Use Case: Compare semantic content of generated vs reference reviews
- Strengths: Fast, captures semantic relationships
- Limitations: Single similarity score, no aspect-specific assessment
- Reference: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
LLM-as-a-Judge Quality Assessment¶
See also: Agent Evaluation & Benchmarking and LLM Evaluation & Benchmarking in landscape-evaluation-data-resources.md
Answer Relevancy¶
- Definition: LLM assessment of how well generated content addresses the input query/paper
- Use Case: Evaluate if generated reviews address key paper aspects
- Strengths: Contextual understanding, query-specific evaluation
- Limitations: LLM bias, requires careful prompt engineering
- Reference: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Faithfulness¶
- Definition: Degree to which generated content remains factually consistent with source material
- Use Case: Ensure generated reviews don’t hallucinate paper content
- Strengths: Detects factual inconsistencies, source-grounded
- Limitations: Requires clear source-target relationships
- Reference: TRUE: Re-evaluating Factual Consistency Evaluation
Hallucination Detection¶
- Definition: Identification of generated content not supported by source documents
- Use Case: Detect fabricated claims about paper methodology/results
- Strengths: Critical for academic accuracy, prevents misinformation
- Limitations: Difficult to define ground truth, context-dependent
- Reference: Survey of Hallucination in Natural Language Generation
Context Relevance¶
- Definition: Assessment of how well retrieved/provided context relates to the query
- Use Case: Evaluate if paper sections support generated review claims
- Strengths: RAG-specific, improves retrieval quality
- Limitations: Requires clear context-query relationships
- Reference: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Landscape Reference: RAG System Evaluation
Agent Performance Metrics¶
See also: Agent Evaluation & Benchmarking and Observability & Monitoring Platforms in landscape-evaluation-data-resources.md
Tool Selection Accuracy¶
- Definition: Percentage of correct tool choices for given tasks
- Use Case: Assess agent ability to select appropriate research tools
- Measurement:
correct_selections / total_selections - Strengths: Directly measures decision-making quality
- Limitations: Requires clear correct/incorrect labels
- Reference: ReAct: Synergizing Reasoning and Acting in Language Models
Response Time¶
- Definition: End-to-end processing time from input to output
- Use Case: Evaluate system performance for real-time applications
- Measurement: Wall clock time in seconds/milliseconds
- Strengths: Simple, directly impacts user experience
- Limitations: Hardware-dependent, varies with load
- Reference: The Computer Systems Performance Handbook (standard performance measurement)
Token Usage Efficiency¶
- Definition: Ratio of useful output tokens to total consumed tokens
- Use Case: Optimize LLM API costs and computational efficiency
- Calculation:
output_tokens / (input_tokens + output_tokens) - Strengths: Cost optimization, resource management
- Limitations: Doesn’t account for output quality
- Reference: Training language models to follow instructions with human feedback (RLHF efficiency considerations)
Path Convergence¶
- Definition: Ratio of minimum required steps to actual steps taken
- Use Case: Measure agent execution efficiency in completing evaluation tasks
- Calculation:
minimum_steps / actual_steps - Strengths: Quantifies workflow efficiency, identifies optimization opportunities
- Limitations: Requires determination of optimal path
- Reference: WebArena: A Realistic Web Environment for Building Autonomous Agents
- Landscape Reference: Arize Phoenix - Path Metrics
Tool Call Accuracy¶
- Definition: Percentage of successful tool calls compared to attempted calls
- Use Case: Evaluate agent reliability in tool selection and parameter extraction
- Calculation:
successful_tool_calls / total_tool_calls - Strengths: Direct measure of agent competency with tools
- Limitations: Requires clear success/failure definitions
- Reference: ReAct: Synergizing Reasoning and Acting in Language Models
- Landscape Reference: Arize Phoenix - LLM-as-a-Judge Templates
Behavioral Fingerprint Profile¶
- Definition: Diagnostic assessment revealing agent behavioral characteristics beyond performance metrics
- Use Case: Agent identity evaluation, consistency tracking, alignment validation
- Components: Personality clustering (ISTJ/ESTJ patterns), semantic robustness, sycophancy detection
- Strengths: Captures behavioral patterns, reveals alignment choices
- Limitations: Requires diagnostic prompt suite, model-dependent patterns
- Reference: Behavioral Fingerprinting of LLMs (Agents4Science 2025)
- Application: Inform Tier 3 Graph Analysis for agent identity persistence and behavioral consistency tracking across interactions
Cost-Performance Efficiency (Pareto)¶
- Definition: Average USD cost per task paired with average success rate (0-1) across a benchmark suite
- Use Case: Framework selection on the efficiency frontier — choose the framework achieving acceptable success at lowest cost
- Calculation: Plot (avg_cost_per_task, avg_success_rate) per framework; Pareto-optimal frameworks are those where no other framework dominates on both dimensions simultaneously
- Strengths: Makes cost/quality trade-off explicit; avoids optimizing only for accuracy at unbounded cost
- Limitations: Cost varies with model/provider pricing; task distribution must be representative
- Reference: [2602.22953] General Agent Evaluation / Exgentic (IBM Research, Feb 2026)
- Example: SmolAgents + Claude Opus 4.5 = 0.66 success at $4.39/task vs OpenAI MCP + Claude Opus 4.5 = 0.73 success at $8.54/task — 2× cost for 11% success gain
- Application: Use during framework selection for PeerRead batch evaluation to identify the cost-performance knee point
Multi-Agent Coordination Metrics¶
See also: Graph Analysis & Network Tools and Agent Frameworks in landscape-evaluation-data-resources.md
Step Efficiency¶
- Definition: Ratio of productive work steps to total execution steps
- Use Case: Identify coordination overhead in multi-agent workflows
- Calculation:
productive_steps / total_steps - Strengths: Measures workflow optimization effectiveness
- Limitations: Requires classification of step types
- Reference: Multi-agent coordination in distributed systems
- Landscape Reference: Arize Phoenix - Path Metrics
Centrality Measures¶
- Definition: Graph theory metrics measuring agent importance in coordination networks
- Variants: Betweenness, closeness, degree centrality
- Use Case: Identify coordination bottlenecks and key agents
- Strengths: Quantifies structural importance
- Limitations: Requires graph construction from interaction logs
- Reference: Networks: An Introduction (Newman, 2010)
- Landscape Reference: Graph Analysis & Network Tools
Communication Overhead¶
- Definition: Ratio of coordination messages to productive work messages
- Use Case: Optimize agent communication efficiency
- Calculation:
coordination_messages / total_messages - Strengths: Measures coordination cost
- Limitations: Requires message classification
- Reference: Coulouris et al., “Distributed Systems: Concepts and Design” (5th Edition, 2012)
Task Distribution Balance¶
- Definition: Measure of workload evenness across agents using statistical variance
- Use Case: Ensure fair load balancing in multi-agent systems
- Calculation:
1 - std_dev(agent_tasks) / mean(agent_tasks) - Strengths: Quantifies load balancing effectiveness
- Limitations: Doesn’t account for task complexity differences
- Reference: Multi-agent coordination in distributed systems (coordination metrics)
- Landscape Reference: Agent Frameworks
Production Framework Metrics¶
Metrics derived from production evaluation frameworks and competition benchmarks
Plan Adherence¶
- Definition: Degree to which agent follows its own generated execution plan
- Use Case: Detect agents that generate good plans but deviate during execution
- Calculation:
planned_steps_executed / total_planned_steps - Strengths: Separates planning quality from execution discipline
- Limitations: Requires plan extraction and step matching
- Reference: DeepEval PlanAdherenceMetric
- Landscape Reference: DeepEval Framework
Argument Correctness¶
- Definition: Accuracy of arguments passed to tool calls beyond tool selection
- Use Case: Detect subtle tool misuse where correct tool is called with wrong parameters
- Calculation:
correct_arguments / total_tool_calls - Strengths: Catches parameter-level errors missed by tool selection metrics
- Limitations: Requires ground truth for argument values
- Reference: DeepEval ArgumentCorrectnessMetric
Handoff Quality¶
- Definition: Effectiveness of context transfer between agents in multi-agent systems
- Use Case: Evaluate multi-agent coordination efficiency
- Calculation:
successful_handoffs / total_handoffswith context preservation score - Strengths: Quantifies multi-agent coordination effectiveness
- Limitations: Requires handoff event detection and context comparison
- Reference: Arize Phoenix Multi-Agent Evaluation
- Landscape Reference: Arize Phoenix
Semantic Outcome¶
- Definition: Meaningful task completion beyond surface-level metrics
- Use Case: Evaluate if agent achieved the underlying intent, not just literal task
- Measurement: LLM-as-Judge assessment of outcome meaningfulness
- Strengths: Captures intent fulfillment vs. task completion
- Limitations: Requires clear outcome definitions
- Reference: LangSmith Multi-turn Evals
Evaluator Alignment¶
- Definition: Consistency between different evaluation methods on same outputs
- Use Case: Validate evaluation reliability and detect metric drift
- Calculation: Correlation coefficient between evaluator scores
- Strengths: Ensures evaluation trustworthiness
- Limitations: Requires multiple evaluation methods
- Reference: Self-Improving LLM Evals (Arize), TEAM-PHI (Agents4Science 2025)
- Recent Advance: TEAM-PHI demonstrates multi-LLM evaluator consensus through majority voting matches supervised evaluation without costly expert annotations
Fix Rate¶
- Definition: Percentage of issues successfully resolved in long-horizon tasks
- Use Case: Measure partial progress in complex multi-step tasks
- Calculation:
issues_fixed / total_issues_attempted - Strengths: Rewards incremental progress, not just full completion
- Limitations: Requires issue tracking and resolution verification
- Reference: SWE-EVO Benchmark (arXiv:2512.18470)
Rubric Alignment¶
- Definition: Degree to which agent output matches structured evaluation criteria
- Use Case: Self-grading against predefined quality rubrics
- Calculation: Weighted rubric criterion scores
- Strengths: Enables reproducible quality assessment
- Limitations: Requires rubric design and criterion weighting
- Reference: Rubric Rewards for AI Co-Scientists (arXiv:2512.23707)
Elicitation Rate¶
- Definition: Percentage of evaluation runs achieving threshold score (≥7/10)
- Use Case: Measure behavioral consistency across multiple runs
- Calculation:
runs_above_threshold / total_runs - Strengths: Simple, threshold-based behavioral assessment
- Limitations: Requires multiple runs and score threshold definition
- Reference: Bloom Framework (Anthropic)
- Landscape Reference: Bloom
Session Continuity¶
- Definition: Quality of state preservation across agent sessions or context windows
- Use Case: Evaluate long-running agent performance with context handoffs
- Measurement: State reconstruction accuracy after session boundary
- Strengths: Critical for production agents with context limits
- Limitations: Requires session boundary detection and state comparison
- Reference: Anthropic Effective Harnesses Pattern
Scope Adherence¶
- Definition: Alignment between planned work scope and actual executed work
- Use Case: Detect scope creep or premature completion in agents
- Calculation:
overlap(planned_scope, actual_scope) / planned_scope - Strengths: Identifies planning-execution gaps
- Limitations: Requires scope definition and extraction
- Reference: Anthropic Effective Harnesses Pattern
Verification Coverage¶
- Definition: Completeness of agent self-testing and validation
- Use Case: Measure how thoroughly agent verifies its own work
- Calculation:
verified_outputs / total_outputs - Strengths: Quantifies agent reliability and self-correction
- Limitations: Requires verification action detection
- Reference: Anthropic Effective Harnesses Pattern
Observability-Based Metrics¶
See also: Observability & Monitoring Platforms in landscape-evaluation-data-resources.md
Trace Coverage¶
- Definition: Percentage of agent execution paths captured in observability traces
- Use Case: Ensure comprehensive monitoring of agent behavior
- Calculation:
traced_execution_paths / total_execution_paths - Strengths: Validates observability completeness
- Limitations: Requires trace path definition
- Reference: OpenTelemetry Specification
- Landscape Reference: AgentNeo - Observability Platform
Error Recovery Rate¶
- Definition: Percentage of errors that agents successfully recover from
- Use Case: Measure system resilience and self-correction capability
- Calculation:
recovered_errors / total_errors - Strengths: Quantifies system robustness
- Limitations: Requires error classification and recovery detection
- Reference: Fault tolerance in distributed systems
- Landscape Reference: Browser Use - Self-Correcting Architecture
Memory Utilization Efficiency¶
- Definition: Ratio of relevant retrieved memory to total memory accessed
- Use Case: Optimize agent memory systems and context management
- Calculation:
relevant_memory_retrieved / total_memory_accessed - Strengths: Measures memory system effectiveness
- Limitations: Requires relevance assessment
- Reference: MemGPT: Towards LLMs as Operating Systems
- Landscape Reference: Letta - Advanced Memory Architecture
Security & Safety Metrics¶
See also: AI Model Testing & Validation Platforms in landscape-evaluation-data-resources.md
Hallucination Rate¶
- Definition: Percentage of generated content not supported by source material
- Use Case: Ensure factual accuracy in academic review generation
- Calculation:
hallucinated_statements / total_statements - Strengths: Critical for academic integrity
- Limitations: Requires ground truth verification
- Reference: Survey of Hallucination in Natural Language Generation
- Landscape Reference: Patronus AI - Hallucination Detection
Bias Detection Score¶
- Definition: Quantified measurement of bias in agent outputs across demographic groups
- Use Case: Ensure fair evaluation across diverse academic content
- Calculation: Statistical variance in performance across protected attributes
- Strengths: Promotes fair and equitable agent behavior
- Limitations: Requires demographic data and bias definitions
- Reference: Bias in AI Systems
- Landscape Reference: Patronus AI - Bias Assessment
Prompt Injection Resistance¶
- Definition: Ability to maintain intended behavior despite malicious input attempts
- Use Case: Prevent manipulation of academic evaluation processes
- Measurement: Success rate against standardized injection attacks
- Strengths: Essential for production security
- Limitations: Requires comprehensive attack vectors
- Reference: Prompt injection attacks against large language models
- Landscape Reference: Giskard - Security Testing
LLM Evaluator Vulnerability¶
- Definition: Susceptibility of LLM-based evaluators to manipulation strategies
- Use Case: Validate robustness of LLM-as-Judge evaluation tier
- Attack Vectors: TooGoodGains (exaggerating metrics), BaselineSelect (cherry-picking), StatTheater (statistical facades), CoherencePolish (presentation quality), ProofGap (hiding oversights)
- Observed Rate: 67-82% acceptance rates for manipulated content
- Critical Issue: Concern-acceptance conflict where reviewers flag issues yet assign acceptance scores
- Strengths: Identifies critical LLM judge vulnerabilities
- Limitations: Requires adversarial testing framework
- Reference: BadScientist (Agents4Science 2025)
- Mitigation: Implement adversarial robustness validation and meta-evaluation to detect manipulation
Implementation Frameworks¶
Evaluation Platform Integration¶
For comprehensive implementation guidance, see Agent Evaluation & Benchmarking in landscape-evaluation-data-resources.md
- AutoGenBench: Docker-isolated evaluation with benchmark performance metrics
- Swarms Agent Evaluation: Continuous monitoring with real-time performance tracking
- DeepEval: 30+ LLM-as-a-judge metrics with pytest integration
- Braintrust Agent Evaluation: Architecture-specific assessment with custom scorers
- Google ADK Evaluation: Trajectory analysis with multi-turn conversation testing
Observability Tool Integration¶
For detailed technical analysis, see Observability & Monitoring Platforms in landscape-evaluation-data-resources.md
- Pydantic Logfire: First-party PydanticAI instrumentation via
logfire.instrument_pydantic_ai()with OTel-based tracing - Comet Opik: OpenTelemetry-compatible spans with local deployment
- Arize Phoenix: Path convergence metrics with LLM-as-a-judge templates
- Langfuse: Comprehensive prompt management with evaluation integration
- AgentNeo: Decorator-based tracing with SQLite storage
- TruLens: RAG Triad metrics with multi-step workflow assessment
Graph Analysis Integration¶
For network analysis capabilities, see Graph Analysis & Network Tools in landscape-evaluation-data-resources.md
- NetworkX: Centrality measures and coordination pattern analysis
- LangGraph: Stateful agent workflow orchestration with conditional logic
- PyTorch Geometric: Graph neural networks for agent behavior modeling
- NetworKit: High-performance graph analysis with parallel processing
Additional Resources¶
Framework implementations and practical guidance on using these metrics