Eval Metrics

Comprehensive catalog of evaluation metrics for AI agent systems, with definitions, use cases, and primary research references for each metric.

Related Document: Evaluation & Data Resources Landscape - Tools, platforms, datasets, and benchmarks for implementing these metrics

Core Evaluation Metrics¶

Text Generation Quality¶

See also: Traditional Metrics Libraries in landscape-evaluation-data-resources.md

BLEU (Bilingual Evaluation Understudy)¶

Definition: N-gram precision metric measuring overlap between generated and reference text
Use Case: Evaluate generated review similarity to reference reviews
Strengths: Fast computation, established baseline
Limitations: Ignores semantic meaning, favors exact matches
Reference: BLEU: a Method for Automatic Evaluation of Machine Translation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)¶

Definition: Recall-based metric measuring content overlap using n-grams and longest common subsequences
Use Case: Assess coverage of key paper concepts in generated reviews
Strengths: Captures content coverage, multiple variants (ROUGE-1, ROUGE-2, ROUGE-L)
Limitations: Surface-level matching, no semantic understanding
Reference: ROUGE: A Package for Automatic Evaluation of Summaries

BERTScore¶

Definition: Contextual embedding-based similarity using pre-trained BERT models
Use Case: Measure semantic similarity beyond lexical matching
Strengths: Captures semantic meaning, correlates with human judgment
Limitations: Computationally expensive, model-dependent
Reference: BERTScore: Evaluating Text Generation with BERT

Semantic Similarity (Cosine)¶

Definition: Vector similarity between sentence embeddings using cosine distance
Use Case: Compare semantic content of generated vs reference reviews
Strengths: Fast, captures semantic relationships
Limitations: Single similarity score, no aspect-specific assessment
Reference: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

LLM-as-a-Judge Quality Assessment¶

See also: Agent Evaluation & Benchmarking and LLM Evaluation & Benchmarking in landscape-evaluation-data-resources.md

Answer Relevancy¶

Definition: LLM assessment of how well generated content addresses the input query/paper
Use Case: Evaluate if generated reviews address key paper aspects
Strengths: Contextual understanding, query-specific evaluation
Limitations: LLM bias, requires careful prompt engineering
Reference: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Faithfulness¶

Definition: Degree to which generated content remains factually consistent with source material
Use Case: Ensure generated reviews don’t hallucinate paper content
Strengths: Detects factual inconsistencies, source-grounded
Limitations: Requires clear source-target relationships
Reference: TRUE: Re-evaluating Factual Consistency Evaluation

Hallucination Detection¶

Definition: Identification of generated content not supported by source documents
Use Case: Detect fabricated claims about paper methodology/results
Strengths: Critical for academic accuracy, prevents misinformation
Limitations: Difficult to define ground truth, context-dependent
Reference: Survey of Hallucination in Natural Language Generation

Context Relevance¶

Definition: Assessment of how well retrieved/provided context relates to the query
Use Case: Evaluate if paper sections support generated review claims
Strengths: RAG-specific, improves retrieval quality
Limitations: Requires clear context-query relationships
Reference: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Landscape Reference: RAG System Evaluation

Agent Performance Metrics¶

See also: Agent Evaluation & Benchmarking and Observability & Monitoring Platforms in landscape-evaluation-data-resources.md

Tool Selection Accuracy¶

Definition: Percentage of correct tool choices for given tasks
Use Case: Assess agent ability to select appropriate research tools
Measurement: correct_selections / total_selections
Strengths: Directly measures decision-making quality
Limitations: Requires clear correct/incorrect labels
Reference: ReAct: Synergizing Reasoning and Acting in Language Models

Response Time¶

Definition: End-to-end processing time from input to output
Use Case: Evaluate system performance for real-time applications
Measurement: Wall clock time in seconds/milliseconds
Strengths: Simple, directly impacts user experience
Limitations: Hardware-dependent, varies with load
Reference: The Computer Systems Performance Handbook (standard performance measurement)

Token Usage Efficiency¶

Definition: Ratio of useful output tokens to total consumed tokens
Use Case: Optimize LLM API costs and computational efficiency
Calculation: output_tokens / (input_tokens + output_tokens)
Strengths: Cost optimization, resource management
Limitations: Doesn’t account for output quality
Reference: Training language models to follow instructions with human feedback (RLHF efficiency considerations)

Path Convergence¶

Definition: Ratio of minimum required steps to actual steps taken
Use Case: Measure agent execution efficiency in completing evaluation tasks
Calculation: minimum_steps / actual_steps
Strengths: Quantifies workflow efficiency, identifies optimization opportunities
Limitations: Requires determination of optimal path
Reference: WebArena: A Realistic Web Environment for Building Autonomous Agents
Landscape Reference: Arize Phoenix - Path Metrics

Tool Call Accuracy¶

Definition: Percentage of successful tool calls compared to attempted calls
Use Case: Evaluate agent reliability in tool selection and parameter extraction
Calculation: successful_tool_calls / total_tool_calls
Strengths: Direct measure of agent competency with tools
Limitations: Requires clear success/failure definitions
Reference: ReAct: Synergizing Reasoning and Acting in Language Models
Landscape Reference: Arize Phoenix - LLM-as-a-Judge Templates

Behavioral Fingerprint Profile¶

Definition: Diagnostic assessment revealing agent behavioral characteristics beyond performance metrics
Use Case: Agent identity evaluation, consistency tracking, alignment validation
Components: Personality clustering (ISTJ/ESTJ patterns), semantic robustness, sycophancy detection
Strengths: Captures behavioral patterns, reveals alignment choices
Limitations: Requires diagnostic prompt suite, model-dependent patterns
Reference: Behavioral Fingerprinting of LLMs (Agents4Science 2025)
Application: Inform Tier 3 Graph Analysis for agent identity persistence and behavioral consistency tracking across interactions

Cost-Performance Efficiency (Pareto)¶

Definition: Average USD cost per task paired with average success rate (0-1) across a benchmark suite
Use Case: Framework selection on the efficiency frontier — choose the framework achieving acceptable success at lowest cost
Calculation: Plot (avg_cost_per_task, avg_success_rate) per framework; Pareto-optimal frameworks are those where no other framework dominates on both dimensions simultaneously
Strengths: Makes cost/quality trade-off explicit; avoids optimizing only for accuracy at unbounded cost
Limitations: Cost varies with model/provider pricing; task distribution must be representative
Reference: [2602.22953] General Agent Evaluation / Exgentic (IBM Research, Feb 2026)
Example: SmolAgents + Claude Opus 4.5 = 0.66 success at $4.39/task vs OpenAI MCP + Claude Opus 4.5 = 0.73 success at $8.54/task — 2× cost for 11% success gain
Application: Use during framework selection for PeerRead batch evaluation to identify the cost-performance knee point

Multi-Agent Coordination Metrics¶

See also: Graph Analysis & Network Tools and Agent Frameworks in landscape-evaluation-data-resources.md

Step Efficiency¶

Definition: Ratio of productive work steps to total execution steps
Use Case: Identify coordination overhead in multi-agent workflows
Calculation: productive_steps / total_steps
Strengths: Measures workflow optimization effectiveness
Limitations: Requires classification of step types
Reference: Multi-agent coordination in distributed systems
Landscape Reference: Arize Phoenix - Path Metrics

Centrality Measures¶

Definition: Graph theory metrics measuring agent importance in coordination networks
Variants: Betweenness, closeness, degree centrality
Use Case: Identify coordination bottlenecks and key agents
Strengths: Quantifies structural importance
Limitations: Requires graph construction from interaction logs
Reference: Networks: An Introduction (Newman, 2010)
Landscape Reference: Graph Analysis & Network Tools

Communication Overhead¶

Definition: Ratio of coordination messages to productive work messages
Use Case: Optimize agent communication efficiency
Calculation: coordination_messages / total_messages
Strengths: Measures coordination cost
Limitations: Requires message classification
Reference: Coulouris et al., “Distributed Systems: Concepts and Design” (5th Edition, 2012)

Task Distribution Balance¶

Definition: Measure of workload evenness across agents using statistical variance
Use Case: Ensure fair load balancing in multi-agent systems
Calculation: 1 - std_dev(agent_tasks) / mean(agent_tasks)
Strengths: Quantifies load balancing effectiveness
Limitations: Doesn’t account for task complexity differences
Reference: Multi-agent coordination in distributed systems (coordination metrics)
Landscape Reference: Agent Frameworks

Production Framework Metrics¶

Metrics derived from production evaluation frameworks and competition benchmarks

Plan Adherence¶

Definition: Degree to which agent follows its own generated execution plan
Use Case: Detect agents that generate good plans but deviate during execution
Calculation: planned_steps_executed / total_planned_steps
Strengths: Separates planning quality from execution discipline
Limitations: Requires plan extraction and step matching
Reference: DeepEval PlanAdherenceMetric
Landscape Reference: DeepEval Framework

Argument Correctness¶

Definition: Accuracy of arguments passed to tool calls beyond tool selection
Use Case: Detect subtle tool misuse where correct tool is called with wrong parameters
Calculation: correct_arguments / total_tool_calls
Strengths: Catches parameter-level errors missed by tool selection metrics
Limitations: Requires ground truth for argument values
Reference: DeepEval ArgumentCorrectnessMetric

Handoff Quality¶

Definition: Effectiveness of context transfer between agents in multi-agent systems
Use Case: Evaluate multi-agent coordination efficiency
Calculation: successful_handoffs / total_handoffs with context preservation score
Strengths: Quantifies multi-agent coordination effectiveness
Limitations: Requires handoff event detection and context comparison
Reference: Arize Phoenix Multi-Agent Evaluation
Landscape Reference: Arize Phoenix

Semantic Outcome¶

Definition: Meaningful task completion beyond surface-level metrics
Use Case: Evaluate if agent achieved the underlying intent, not just literal task
Measurement: LLM-as-Judge assessment of outcome meaningfulness
Strengths: Captures intent fulfillment vs. task completion
Limitations: Requires clear outcome definitions
Reference: LangSmith Multi-turn Evals

Evaluator Alignment¶

Definition: Consistency between different evaluation methods on same outputs
Use Case: Validate evaluation reliability and detect metric drift
Calculation: Correlation coefficient between evaluator scores
Strengths: Ensures evaluation trustworthiness
Limitations: Requires multiple evaluation methods
Reference: Self-Improving LLM Evals (Arize), TEAM-PHI (Agents4Science 2025)
Recent Advance: TEAM-PHI demonstrates multi-LLM evaluator consensus through majority voting matches supervised evaluation without costly expert annotations

Fix Rate¶

Definition: Percentage of issues successfully resolved in long-horizon tasks
Use Case: Measure partial progress in complex multi-step tasks
Calculation: issues_fixed / total_issues_attempted
Strengths: Rewards incremental progress, not just full completion
Limitations: Requires issue tracking and resolution verification
Reference: SWE-EVO Benchmark (arXiv:2512.18470)

Rubric Alignment¶

Definition: Degree to which agent output matches structured evaluation criteria
Use Case: Self-grading against predefined quality rubrics
Calculation: Weighted rubric criterion scores
Strengths: Enables reproducible quality assessment
Limitations: Requires rubric design and criterion weighting
Reference: Rubric Rewards for AI Co-Scientists (arXiv:2512.23707)

Elicitation Rate¶

Definition: Percentage of evaluation runs achieving threshold score (≥7/10)
Use Case: Measure behavioral consistency across multiple runs
Calculation: runs_above_threshold / total_runs
Strengths: Simple, threshold-based behavioral assessment
Limitations: Requires multiple runs and score threshold definition
Reference: Bloom Framework (Anthropic)
Landscape Reference: Bloom

Session Continuity¶

Definition: Quality of state preservation across agent sessions or context windows
Use Case: Evaluate long-running agent performance with context handoffs
Measurement: State reconstruction accuracy after session boundary
Strengths: Critical for production agents with context limits
Limitations: Requires session boundary detection and state comparison
Reference: Anthropic Effective Harnesses Pattern

Scope Adherence¶

Definition: Alignment between planned work scope and actual executed work
Use Case: Detect scope creep or premature completion in agents
Calculation: overlap(planned_scope, actual_scope) / planned_scope
Strengths: Identifies planning-execution gaps
Limitations: Requires scope definition and extraction
Reference: Anthropic Effective Harnesses Pattern

Verification Coverage¶

Definition: Completeness of agent self-testing and validation
Use Case: Measure how thoroughly agent verifies its own work
Calculation: verified_outputs / total_outputs
Strengths: Quantifies agent reliability and self-correction
Limitations: Requires verification action detection
Reference: Anthropic Effective Harnesses Pattern

Observability-Based Metrics¶

See also: Observability & Monitoring Platforms in landscape-evaluation-data-resources.md

Trace Coverage¶

Definition: Percentage of agent execution paths captured in observability traces
Use Case: Ensure comprehensive monitoring of agent behavior
Calculation: traced_execution_paths / total_execution_paths
Strengths: Validates observability completeness
Limitations: Requires trace path definition
Reference: OpenTelemetry Specification
Landscape Reference: AgentNeo - Observability Platform

Error Recovery Rate¶

Definition: Percentage of errors that agents successfully recover from
Use Case: Measure system resilience and self-correction capability
Calculation: recovered_errors / total_errors
Strengths: Quantifies system robustness
Limitations: Requires error classification and recovery detection
Reference: Fault tolerance in distributed systems
Landscape Reference: Browser Use - Self-Correcting Architecture

Memory Utilization Efficiency¶

Definition: Ratio of relevant retrieved memory to total memory accessed
Use Case: Optimize agent memory systems and context management
Calculation: relevant_memory_retrieved / total_memory_accessed
Strengths: Measures memory system effectiveness
Limitations: Requires relevance assessment
Reference: MemGPT: Towards LLMs as Operating Systems
Landscape Reference: Letta - Advanced Memory Architecture

Security & Safety Metrics¶

See also: AI Model Testing & Validation Platforms in landscape-evaluation-data-resources.md

Hallucination Rate¶

Definition: Percentage of generated content not supported by source material
Use Case: Ensure factual accuracy in academic review generation
Calculation: hallucinated_statements / total_statements
Strengths: Critical for academic integrity
Limitations: Requires ground truth verification
Reference: Survey of Hallucination in Natural Language Generation
Landscape Reference: Patronus AI - Hallucination Detection

Bias Detection Score¶

Definition: Quantified measurement of bias in agent outputs across demographic groups
Use Case: Ensure fair evaluation across diverse academic content
Calculation: Statistical variance in performance across protected attributes
Strengths: Promotes fair and equitable agent behavior
Limitations: Requires demographic data and bias definitions
Reference: Bias in AI Systems
Landscape Reference: Patronus AI - Bias Assessment

Prompt Injection Resistance¶

Definition: Ability to maintain intended behavior despite malicious input attempts
Use Case: Prevent manipulation of academic evaluation processes
Measurement: Success rate against standardized injection attacks
Strengths: Essential for production security
Limitations: Requires comprehensive attack vectors
Reference: Prompt injection attacks against large language models
Landscape Reference: Giskard - Security Testing

LLM Evaluator Vulnerability¶

Definition: Susceptibility of LLM-based evaluators to manipulation strategies
Use Case: Validate robustness of LLM-as-Judge evaluation tier
Attack Vectors: TooGoodGains (exaggerating metrics), BaselineSelect (cherry-picking), StatTheater (statistical facades), CoherencePolish (presentation quality), ProofGap (hiding oversights)
Observed Rate: 67-82% acceptance rates for manipulated content
Critical Issue: Concern-acceptance conflict where reviewers flag issues yet assign acceptance scores
Strengths: Identifies critical LLM judge vulnerabilities
Limitations: Requires adversarial testing framework
Reference: BadScientist (Agents4Science 2025)
Mitigation: Implement adversarial robustness validation and meta-evaluation to detect manipulation

Implementation Frameworks¶

Evaluation Platform Integration¶

For comprehensive implementation guidance, see Agent Evaluation & Benchmarking in landscape-evaluation-data-resources.md

AutoGenBench: Docker-isolated evaluation with benchmark performance metrics
Swarms Agent Evaluation: Continuous monitoring with real-time performance tracking
DeepEval: 30+ LLM-as-a-judge metrics with pytest integration
Braintrust Agent Evaluation: Architecture-specific assessment with custom scorers
Google ADK Evaluation: Trajectory analysis with multi-turn conversation testing

Observability Tool Integration¶

For detailed technical analysis, see Observability & Monitoring Platforms in landscape-evaluation-data-resources.md

Pydantic Logfire: First-party PydanticAI instrumentation via logfire.instrument_pydantic_ai() with OTel-based tracing
Comet Opik: OpenTelemetry-compatible spans with local deployment
Arize Phoenix: Path convergence metrics with LLM-as-a-judge templates
Langfuse: Comprehensive prompt management with evaluation integration
AgentNeo: Decorator-based tracing with SQLite storage
TruLens: RAG Triad metrics with multi-step workflow assessment

Graph Analysis Integration¶

For network analysis capabilities, see Graph Analysis & Network Tools in landscape-evaluation-data-resources.md

NetworkX: Centrality measures and coordination pattern analysis
LangGraph: Stateful agent workflow orchestration with conditional logic
PyTorch Geometric: Graph neural networks for agent behavior modeling
NetworKit: High-performance graph analysis with parallel processing

Additional Resources¶

Framework implementations and practical guidance on using these metrics