Browse chronologically by year/month to track research evolution
Filter by tags like [EVAL], [SAFETY], [MAS] to find papers by topic
Follow cross-references with explanations to discover related work
Use thematic clusters at the end for quick topic-based navigation
Search arXiv IDs to quickly locate specific papers

Document Features¶

260+ papers covering 2020-2026 research
14 thematic tags for categorization
Cross-references with relationship explanations
Chronological organization preserving research timeline
Thematic clustering summary for quick navigation

Research Integration Analysis - Analysis of research trends and integration patterns across these papers

Thematic Clusters¶

Evaluation & Benchmarking [EVAL] [BENCH]:

General benchmarks: 2308.03688 (AgentBench), 2404.06411 (AgentQuest), 2401.13178 (AgentBoard), 2311.12983 (GAIA)
Web agents: 2307.13854 (WebArena), 2401.13649 (VisualWebArena), 2410.06703 (ST-WebAgentBench), 2404.07972 (OSWorld), 2412.05467 (BrowserGym), 2504.01382 (Online-Mind2Web), 2207.01206 (WebShop)
Tool evaluation: 2307.16789 (ToolLLM), 2310.03128 (MetaTool), 2406.12045 (τ-bench), 2506.07982 (τ²-bench), 2304.08244 (API-Bank EMNLP 2023), BFCL
Scientific: 2407.13168 (SciCode), 2409.11363 (CORE-Bench)
Enterprise: 2509.10769 (AgentArch), 2511.14136 (CLEAR framework), 2412.14161 (TheAgentCompany), 2411.07763 (Spider 2.0), 2411.02305 (CRMArena), 2508.00828 (Finance), 2501.14654 (MedAgentBench)
Code/SE: 2407.18901 (AppWorld), SWE-bench verified, 2404.10952 (USACO), 2507.05558 (Smart Contract)
Safety/Security: 2504.14064 (DoomArena), 2504.18575 (WASP), 2506.02548 (CyberGym)
Gaming/Embodied: 2407.13943 (Werewolf), 2310.08367 (Minecraft), 2010.03768 (ALFWorld), 2407.18416 (PersonaGym)
Multi-agent: 2503.01935 (MultiAgentBench), 2512.08296 (scaling agent systems), 2507.05178 (CREW)
Safety: 2402.05044 (SALAD-Bench ACL 2024), 2412.14470 (Agent-SafetyBench), 2412.13178 (SafeAgentBench), 2410.09024 (AgentHarm ICLR 2025)
Recent 2025-2026: 2510.02271 (InfoMosaic-Bench), 2510.02190 (Deep Research), 2510.01670 (BLIND-ACT), 2512.12791 (assessment framework), TEAM-PHI (de-identification), Behavioral Fingerprinting (LLM profiles), Strategic Reasoning (digital twin)
Observability/Production: 2601.00481 (MAESTRO), 2602.10133 (AgentTrace), 2512.04123 (measuring agents), 2601.19583 (architecture-aware metrics), 2512.18311 (monitorability)
General agent eval: 2602.22953 (Exgentic, Open General Agent Leaderboard, Unified Protocol)
Surveys: 2503.16416 (evaluation survey), 2507.21504 (LLM agents survey), 2411.13768 (evaluation-driven), 2501.11067 (IntellAgent)

Architecture & System Design [ARCH]:

Foundation: 2308.11432 (foundational survey), 2404.11584 (architecture landscape), 2510.09244 (fundamentals)
Frameworks: 2508.10146 (agentic AI frameworks), 2501.10114 (infrastructure), 2601.01743 (AI agent systems), 2602.10479 (goal-directed systems)
Surveys: 2510.25445 (comprehensive survey), 2503.23037 (agentic LLMs), 2506.01438 (architectural frameworks)
Governance: 2508.03858 (governance protocol), 2503.00237 (systems theory)

Safety & Risk Management [SAFETY]:

Constitutional AI: 2212.08073 (foundational), 2406.07814 (collective), 2501.17112 (inverse)
Core frameworks: 2302.10329 (harms analysis), 2506.04133 (TRiSM), 2408.02205 (guardrails), 2507.06134 (OpenAgentSafety), MITRE ATLAS, OWASP MAESTRO
Standards: NIST AI RMF 1.0, ISO/IEC 42001 (AI management system), ISO/IEC 23894 (AI risk management)
Security: 2510.23883 (agentic AI security), 2512.06659 (cybersecurity evolution), BadScientist (AI publishing vulnerabilities)
Safety benchmarks: 2402.05044 (SALAD-Bench ACL 2024), 2412.14470 (Agent-SafetyBench), 2412.13178 (SafeAgentBench), 2410.09024 (AgentHarm ICLR 2025)
Monitoring: 2507.11473 (CoT monitorability), 2512.18311 (monitoring monitorability), 2512.20798 (constraint violations), 2601.00911 (privacy-preserving)
Reports: 2510.13653 (AI safety first update), 2511.19863 (AI safety second update)
Recent 2025: 2510.02286 (adversarial dialogue), 2510.01586 (AdvEvo-MARL), 2510.01569 (InvThink), 2510.02204 (reasoning-execution gaps)
Multi-agent: 2503.13657 (MAS failures), 2402.04247 (safeguarding over autonomy), Hierarchical Delegated Oversight (scalable alignment)
Self-correction: Architectural Immune System (materials discovery)

Tool Use & Integration [TOOL]:

Benchmarks: 2307.16789 (ToolLLM), 2310.03128 (MetaTool), 2406.12045 (τ-bench), 2304.08244 (API-Bank EMNLP 2023), BFCL
Surveys: 2405.17935 (tool learning), 2404.11584 (tool calling architectures)
Augmentation: 2506.04625 (Tool-MVR meta-verification), 2511.18194 (agent-as-graph), 2512.16214 (PDE-Agent)
MCP applications: 2512.03955 (Blocksworld MCP), 2510.02139 (BioinfoMCP), 2509.06917 (Paper2Agent)
Recent 2025: 2510.01524 (WALT web agents), 2510.01179 (TOUCAN datasets), 2510.02271 (InfoMosaic-Bench), 2512.03420 (HarnessAgent)
Applications: 2410.22457 (tool integration), 2410.09713 (agentic IR)

Multi-Agent Systems [MAS]:

Collaboration: 2507.05178 (CREW benchmark), 2501.06322 (collaboration mechanisms), 2512.20845 (MAR reflexion)
Benchmarks: 2503.01935 (MultiAgentBench), 2512.08296 (scaling agent systems), 2505.12371 (MedAgentBoard), Job Marketplaces (OpenReview)
Analysis: 2503.13657 (failure analysis), 2505.21298 (LLMs miss the mark), 2511.02303 (lazy to deliberation)
Applications: 2507.02097 (recommender systems), 2512.20618 (LongVideoAgent), 2512.16214 (PDE-Agent), Echo (pharmacovigilance), Drug Discovery (Alzheimer’s), PsySpace (space missions), Evolutionary Boids (agent societies)
Oversight: Hierarchical Delegated Oversight (scalable alignment)
Observability: 2602.10133 (AgentTrace), 2601.00481 (MAESTRO)
Recent 2026: 2601.03328 (design patterns evaluation), 2602.10479 (goal-directed systems)

Planning & Reasoning [PLAN]:

ReAct family: 2210.03629 (ReAct), 2411.00927 (ReSpAct), 2310.04406 (LATS)
Core: 2402.02716 (planning survey), 2508.03682 (self-questioning), 2512.14474 (model-first reasoning)
Training: 2508.00344 (PilotRL global planning), 2510.01833 (plan-then-action), 2511.02303 (lazy to deliberation)
Multi-agent: 2512.20845 (MAR), 2512.08296 (scaling agent systems)
Applications: 2410.22457 (task decomposition), 2404.11584 (reasoning architectures), 2512.03955 (Blocksworld MCP)

Scientific Discovery [SCI]:

Research agents: 2506.18096 (deep research), 2508.00414 (cognitive kernel), 2509.06917 (Paper2Agent)
Discovery: 2408.06292 (AI scientist), 2503.08979 (scientific discovery survey), Beyond Adam (symbolic optimization), Architectural Immune System (self-correcting)
Domain applications: AlphaGenome (genomics), Drug Discovery (multi-target Alzheimer’s)

Code Generation [CODE]:

Surveys: 2508.00083 (comprehensive survey), 2508.11126 (agentic programming), 2511.18538 (code foundation models)
SE 3.0: 2507.15003 (AI teammates), 2510.21413 (context engineering), 2512.14012 (professional developers)
Automation: 2505.18646 (SEW self-evolving), 2504.17192 (Paper2Code), 2510.09721 (software engineering benchmarks)
Explanations: 2507.22414 (symbolic explanations), 2402.01030 (executable actions)
Recent 2025: 2510.02185 (FalseCrashReducer), 2510.01379 (multi-LLM orchestration), 2510.01003 (repository memory), 2512.03420 (HarnessAgent)
Applications: 2506.13131 (AlphaEvolve), 2410.14393 (debug agents)

Memory Systems [MEM]:

Surveys: 2512.13564 (memory in AI agents), 2512.23343 (AI meets brain), 2404.13501 (memory mechanisms)
Frameworks: 2601.03236 (MAGMA multi-graph), 2601.01885 (agentic memory), 2602.20478 (Codified Context), 2502.12110 (A-Mem), 2501.13956 (Zep temporal KG)
Learning: 2512.18950 (MACLA hierarchical procedural), 2511.18423 (GAM deep research), 2509.25250 (long-running agents)
Applications: 2510.01003 (repository memory), 2508.11120 (marketing MAS), 2510.11290 (AI-Agent School dual memory)
Production platforms: Cognee (knowledge graph engine, $7.5M seed Feb 2026), Mem0 ($24M, graph memory), LangMem (LangGraph-native)

Self-Improvement & Reflection [AUTO]:

Self-reflection: 2303.11366 (Reflexion foundation), 2405.06682 (self-reflection effects), 2512.20845 (MAR)
Recursive improvement: 2407.18219 (recursive introspection), 2410.04444 (Gödel Agent)
Training approaches: 2406.01495 (Re-ReST), 2508.15805 (ALAS autonomous learning), 2508.00344 (PilotRL)
Workflows: 2505.18646 (SEW self-evolving), 2505.22967 (MermaidFlow), 2506.04625 (Tool-MVR)
Human guidance: 2507.17131 (HITL self-improvement), 2508.07407 (self-evolving survey)

Future Research Areas¶

The following areas represent emerging or under-explored topics in agentic AI research that warrant additional investigation:

Advanced Multi-Modal Agents - Integration of vision, audio, and text processing for comprehensive environmental understanding beyond current multi-modal benchmarks.

Long-Term Memory & Retrieval - Advanced memory architectures for persistent knowledge retention and contextual recall across extended agent interactions.

Human-AI Collaboration - Frameworks for seamless human-agent teamwork, including explanation mechanisms, trust calibration, and collaborative decision-making.

Adversarial Robustness - Agent resilience against adversarial attacks, prompt injection, and manipulation attempts in production environments.

Automated Code Generation Agents - Next-generation coding assistants with advanced debugging, testing, and architectural design capabilities.

Edge & Resource-Constrained Deployment - Efficient agent architectures for mobile devices, IoT systems, and bandwidth-limited environments.

Governance & Policy Implementation - Practical frameworks for regulatory compliance, audit trails, and policy enforcement in agent systems.

Long-Term Autonomy & Reliability - Systems capable of sustained autonomous operation with minimal human intervention over extended periods.

Domain Transfer & Generalization - Techniques for rapid agent adaptation across different domains with minimal retraining or fine-tuning.

Priority Research Focus¶

Based on current gaps and transformative potential, three areas warrant immediate attention:

1. Compositional Self-Improvement - Moving beyond single-agent reflection to systems that can redesign their own architectures, create specialized sub-agents, and evolve coordination protocols. This represents the next leap from current self-reflection work toward truly recursive intelligence.

2. Persistent Contextual Memory - Current agents lack genuine episodic memory across sessions. Developing memory systems that maintain context, relationships, and learned preferences over months or years is critical for practical deployment and user trust.

3. Robust Human-Agent Teaming - Most current work treats humans as either supervisors or users. Research on agents as true collaborators—with theory-of-mind, explanation capabilities, and dynamic role adaptation—is essential for high-stakes domains like healthcare, research, and decision-making.

2026-02¶

[2602.22953] General Agent Evaluation, exgentic.ai [EVAL] [BENCH] cs.AI
IBM Research framework proposing a Unified Protocol for fair, reproducible general agent evaluation without domain-specific tuning; introduces first Open General Agent Leaderboard across 5 agent implementations × 6 environments (AppWorld, BrowseComp+, SWEbenchV, τ²); top: OpenAI MCP + Claude Opus 4.5 = 0.73 avg success
Cost-performance Pareto analysis (avg USD per task) enables framework selection on efficiency frontier
Cross-ref: 2602.10133 (AgentTrace), 2601.00481 (MAESTRO), 2503.16416 (evaluation survey)
[2602.10479] From Prompt-Response to Goal-Directed Systems: The Evolution of Agentic AI Software Architecture [ARCH] [MAS] [SURVEY] cs.SE cs.AI
Reference architecture for production-grade LLM agents, taxonomy of multi-agent topologies with failure modes, enterprise hardening checklist covering governance, observability, and reproducibility
Cross-ref: 2601.01743 (agent system architectures), 2601.03328 (MAS design patterns), 2508.10146 (agentic frameworks)
[2602.10133] AgentTrace: A Structured Logging Framework for Agent System Observability [COMP] [EVAL] [MAS] cs.AI cs.SE
First open standard for structured agent logging via schema-based protocol spanning cognitive, operational, and contextual traces; enables fine-grained debugging, failure attribution, and transparent governance
Cross-ref: 2601.00481 (MAESTRO evaluation suite), 2512.04123 (measuring agents in production), 2508.02121 (AgentOps survey)
[2601.19583] Toward Architecture-Aware Evaluation Metrics for LLM Agents [EVAL] [ARCH] cs.SE cs.AI
Links agent architectural components (planners, memory, tool routers) to observable behaviors and appropriate evaluation metrics; enables targeted and actionable evaluation
Cross-ref: 2512.12791 (assessment framework), 2503.16416 (evaluation survey), 2507.21504 (LLM agents survey)
[2601.00481] MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability [EVAL] [MAS] [BENCH] [COMP] cs.MA cs.AI
Standardizes MAS configuration and exports framework-agnostic execution traces with system-level signals (latency, cost, failures); 12 representative MAS across popular frameworks show architecture is the dominant driver of resource profiles and cost-latency-accuracy trade-offs
Cross-ref: 2602.10133 (AgentTrace), 2512.04123 (measuring agents), 2508.02121 (AgentOps survey)
[2512.18311] Monitoring Monitorability [SAFETY] [EVAL] [COMP] cs.AI cs.LG
Proposes monitorability metric and evaluation archetypes (intervention, process, outcome-property) for chain-of-thought monitoring; finds longer CoTs are more monitorable and smaller models at higher reasoning effort can yield higher monitorability
Cross-ref: 2512.12791 (assessment framework), 2601.01743 (agent architectures survey)
[2512.04123] Measuring Agents in Production [EVAL] [COMP] cs.SE cs.AI
Interview-based study (306 survey responses, 20 in-depth interviews across 26 domains) arguing agent evaluation must move beyond correctness metrics to assess reliability under varying autonomy levels
Cross-ref: 2512.12791 (assessment framework), 2601.00481 (MAESTRO), 2503.16416 (evaluation survey)
[2602.20478] Codified Context: Infrastructure for AI Agents in a Complex Codebase [MEM] [ARCH] cs.SE cs.AI
Three-tier context architecture (hot-memory constitution + 19 specialist agents + 34-doc cold-memory knowledge base) validated across 283 sessions on 108K LOC C# distributed system; 24.2% knowledge-to-code ratio; MCP retrieval service for on-demand spec loading; context drift detector
Cross-ref: 2601.19583 (architecture-aware metrics), 2602.10479 (agentic architecture)

2026-01¶

[2601.03328] LLM-Enabled Multi-Agent Systems: Empirical Evaluation and Insights into Emerging Design Patterns & Paradigms [MAS] [EVAL] [ARCH] cs.MA cs.AI
Empirical evaluation of LLM-based multi-agent systems with analysis of emerging design patterns and architectural paradigms
Cross-ref: 2507.05178 (CREW benchmark), 2501.06322 (collaboration mechanisms), 2506.01438 (architectural frameworks)
[2601.03236] MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents [MEM] [ARCH] cs.AI cs.LG
Multi-graph memory architecture representing memories across semantic, temporal, causal, and entity graphs with hierarchical intent-aware querying
Cross-ref: 2512.13564 (memory systems survey), 2601.01885 (agentic memory), 2404.13501 (memory mechanisms)
[2601.01885] Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents [MEM] [ARCH] cs.AI cs.CL
Unified framework integrating long-term and short-term memory management as tool-based actions in agent policy
Cross-ref: 2512.13564 (memory systems), 2404.13501 (memory survey), 2511.18423 (GAM)
[2601.01743] AI Agent Systems: Architectures, Applications, and Evaluation [ARCH] [SAFETY] [SURVEY] cs.AI
Comprehensive survey of AI agent system architectures with focus on tool-centric safety risks and security challenges
Cross-ref: 2508.10146 (agentic frameworks), 2510.23883 (agentic AI security), 2506.04133 (TRiSM)
[2601.00911] Device-Native Autonomous Agents for Privacy-Preserving Negotiations [SAFETY] [SPEC] cs.AI cs.CR
Framework for privacy-preserving autonomous agent negotiations running on local devices
Cross-ref: 2510.01815 (human-AI teaming), 2503.06416 (negotiation competition)

2025-12¶

[2512.23343] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents [MEM] [SURVEY] [ARCH] cs.AI cs.HC
Unified survey bridging cognitive neuroscience and AI agent memory systems with agentic perspective on external memory design
Cross-ref: 2512.13564 (memory age survey), 2404.13501 (memory mechanisms), 2601.03236 (MAGMA)
[2512.20845] MAR: Multi-Agent Reflexion Improves Reasoning Abilities in LLMs [MAS] [PLAN] [AUTO] cs.AI cs.CL
Multi-agent extension of Reflexion with diverse reasoning personas and judge model for unified reflection synthesis
Cross-ref: 2303.11366 (Reflexion foundation), 2405.06682 (self-reflection effects), 2501.06322 (collaboration mechanisms)
[2512.20798] A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents [BENCH] [SAFETY] [EVAL] cs.AI cs.CR
Benchmark evaluating agents that independently take unethical or dangerous actions toward KPI achievement
Cross-ref: 2510.01670 (BLIND-ACT), 2507.06134 (OpenAgentSafety), 2506.04133 (TRiSM)
[2512.20618] LongVideoAgent: Multi-Agent Reasoning with Long Videos [MAS] [SPEC] cs.CV cs.AI
Multi-agent framework for reasoning over long video content with distributed processing
Cross-ref: 2508.10494 (MAGUS multimodal), 2501.06322 (collaboration mechanisms)
[2512.18950] Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement [MEM] [AUTO] cs.AI cs.LG
MACLA system constructing hierarchical procedural memory 2,800× faster than parameter-training baselines
Accepted at AAMAS 2026
Cross-ref: 2512.13564 (memory systems), 2511.18423 (GAM), 2404.13501 (memory survey)
[2512.16214] PDE-Agent: A toolchain-augmented multi-agent framework for PDE solving [MAS] [TOOL] [SPEC] cs.AI cs.CE
First toolchain-augmented multi-agent framework for automated PDE solving from natural language descriptions
Cross-ref: 2511.18194 (agent-as-graph), 2405.17935 (tool learning), 2501.06322 (collaboration)
[2512.14474] Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling [PLAN] [ARCH] cs.AI cs.CL
Two-phase paradigm where LLM constructs explicit problem model before solution planning, reducing constraint violations
Cross-ref: 2210.03629 (ReAct), 2402.02716 (planning survey), 2510.01833 (plan-then-action)
[2512.14012] Professional Software Developers Don’t Vibe, They Control [CODE] [SPEC] cs.SE cs.AI
Analysis of professional developer requirements for AI coding agents emphasizing control over automation
Cross-ref: 2508.00083 (code generation survey), 2507.15003 (SE 3.0)
[2512.13564] Memory in the Age of AI Agents [MEM] [SURVEY] cs.AI cs.CL
Comprehensive survey of memory as core capability for foundation model-based agents addressing field fragmentation
Cross-ref: 2404.13501 (memory mechanisms survey), 2512.23343 (neuroscience perspective), 2601.01885 (agentic memory)
[2512.23707] Training AI Co-Scientists Using Rubric Rewards [SCI] [AUTO] [EVAL] cs.AI cs.CL
RL training for research plan generation with self-grading rubrics; 70% expert preference, cross-domain validation
Cross-ref: 2303.11366 (Reflexion), 2506.18096 (deep research agents), 2508.00414 (cognitive kernel)
[2512.18470] SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios [BENCH] [CODE] [EVAL] cs.SE cs.AI
Long-horizon evolution benchmark; 48 tasks, avg 21 files, 874 tests; introduces Fix Rate metric for partial progress
Cross-ref: 2508.00083 (code generation survey), 2507.15003 (SE 3.0), 2512.14012 (professional developers)
[2512.12791] Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems [EVAL] [ARCH] [SAFETY] cs.AI cs.SE
Assessment framework across four pillars (LLM, Memory, Tools, Environment) with static and dynamic analysis phases
Cross-ref: 2511.14136 (CLEAR framework), 2503.16416 (evaluation survey), 2506.04133 (TRiSM)
[2512.10398] Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases [CODE] [ARCH] cs.SE cs.AI
Hierarchical working memory + persistent note-taking for large codebases; 54.3% on SWE-Bench-Pro
Cross-ref: 2512.18470 (SWE-EVO), 2512.13564 (memory systems), 2508.00083 (code generation)
[2512.08296] Towards a Science of Scaling Agent Systems [MAS] [BENCH] [ARCH] cs.AI cs.MA
Controlled evaluation of five agent architectures across 180 configurations spanning diverse benchmarks
Cross-ref: 2507.05178 (CREW benchmark), 2509.10769 (AgentArch), 2501.06322 (collaboration)
[2512.06659] The Evolution of Agentic AI in Cybersecurity: From Single LLM Reasoners to Multi-Agent Systems and Autonomous Pipelines [SPEC] [MAS] [SURVEY] cs.CR cs.AI
Survey of agentic AI evolution in cybersecurity from tool-augmented agents to autonomous investigative pipelines
Cross-ref: 2510.01751 (cybersecurity framework), 2510.01654 (CLASP), 2511.18194 (agent-as-graph)
[2512.03955] Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol [BENCH] [PLAN] [TOOL] cs.AI cs.RO
Planning and control benchmark using MCP for Blocksworld domain evaluation
Cross-ref: 2402.02716 (planning survey), 2510.02139 (BioinfoMCP), 2509.06917 (Paper2Agent MCP)
[2512.03420] HarnessAgent: Scaling Automatic Fuzzing Harness Construction with Tool-Augmented LLM Pipelines [CODE] [TOOL] [SPEC] cs.SE cs.CR
Tool-augmented agentic framework for automated fuzzing harness construction at scale
Cross-ref: 2510.02185 (FalseCrashReducer), 2405.17935 (tool learning)

2025-11¶

[2511.19863] International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management [SAFETY] [SURVEY] cs.AI cs.CY
Second key update covering advances in adversarial training, data curation, and monitoring systems for AI safety
Cross-ref: 2510.13653 (first key update), 2506.04133 (TRiSM), 2508.03858 (governance protocol)
[2511.18538] From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [CODE] [SURVEY] [ARCH] cs.SE cs.AI
Practical guide covering evolution from code foundation models to agentic applications
Cross-ref: 2508.00083 (code generation survey), 2508.11126 (agentic programming), 2507.15003 (SE 3.0)
[2511.18423] General Agentic Memory Via Deep Research [MEM] [ARCH] cs.AI cs.IR
GAM framework following just-in-time compilation principle for optimized runtime contexts with simple offline memory
Cross-ref: 2512.13564 (memory survey), 2601.01885 (agentic memory), 2512.18950 (MACLA)
[2511.18194] Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems [MAS] [TOOL] [ARCH] cs.AI cs.IR
Knowledge graph representation for tools and agents with weighted reciprocal rank fusion for reranking
Cross-ref: 2405.17935 (tool learning), 2512.16214 (PDE-Agent), 2501.06322 (collaboration)
[2511.14136] Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems [EVAL] [BENCH] [ARCH] cs.AI cs.SE
CLEAR framework (Cost, Latency, Efficacy, Assurance, Reliability) for enterprise agent evaluation with ρ=0.83 production correlation
Cross-ref: 2509.10769 (AgentArch enterprise), 2512.12791 (assessment framework), 2503.16416 (evaluation survey)
[2511.02303] Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation [MAS] [PLAN] cs.AI cs.CL
Framework transitioning from lazy to deliberative reasoning in multi-agent LLM systems
Cross-ref: 2512.20845 (MAR), 2501.06322 (collaboration mechanisms), 2402.02716 (planning survey)

2025-10¶

[2510.26887] The Denario Project: Deep Knowledge AI Agents for Scientific Discovery [SCI] [MAS] [AUTO] cs.AI
Multi-agent system for scientific research: idea generation, code execution, paper drafting; generated 11 AI-authored papers across disciplines
Cross-ref: 2506.18096 (deep research agents), 2502.14776 (SurveyX)
[2510.25445] Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions [SURVEY] [ARCH] cs.AI
Comprehensive survey of agentic AI covering architectures, diverse applications, and future research directions
Cross-ref: 2308.11432 (foundational survey), 2503.21460 (LLM agent survey), 2508.10146 (frameworks)
[2510.23883] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges [SAFETY] [EVAL] [SURVEY] cs.CR cs.AI
Survey of agentic AI security covering attack methodologies, defense strategies, benchmarks, and open challenges
Cross-ref: 2507.06134 (OpenAgentSafety), 2510.01654 (CLASP), 2506.04133 (TRiSM), 2601.01743 (AI agent systems)
[2510.21413] Context Engineering for AI Agents in Open-Source Software [CODE] [ARCH] cs.SE cs.AI
Framework for deliberate context design and information structuring for LLMs in open-source development
Cross-ref: 2508.00083 (code generation survey), 2507.15003 (SE 3.0), 2510.09244 (building agents fundamentals)
[2510.13653] International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications [SAFETY] [SURVEY] cs.AI cs.CY
First key update with capability improvements and implications for biological, cyber, monitoring, and controllability risks
Authors: Yoshua Bengio and 72 others
Cross-ref: 2511.19863 (second key update), 2302.10329 (harms analysis), 2506.04133 (TRiSM)
[2510.11290] Evolution in Simulation: AI-Agent School with Dual Memory for High-Fidelity Educational Dynamics [MAS] [MEM] [SPEC] cs.AI cs.CL
LLM-based agents simulate complex educational dynamics with dual memory (experience/knowledge bases) enabling self-evolving cognitive development
Accepted at EMNLP 2025
Cross-ref: 2512.13564 (memory systems), 2404.13501 (memory mechanisms), 2510.01297 (SimCity urban simulation)
Simulating Two-Sided Job Marketplaces with AI Agents, gh/upwork/simploy [MAS] [SPEC] [BENCH] OpenReview
LLM agents demonstrate reasoning capabilities create fundamentally different market behaviors compared to rule-based simulations; reveals trade-offs between transaction volume and match quality
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2510.01297 (SimCity urban simulation), 2501.06322 (collaboration mechanisms)
Echo: A multi-agent AI system for patient-centered pharmacovigilance [MAS] [SPEC] OpenReview
Four specialized agents (Explorer, Analyzer, Verifier, Proposer) mine Reddit health communities identifying 640 drug-symptom associations including novel signals absent from FDA databases
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2507.16940 (AURA medical agent), 2508.21803 (clinical problem detection)
Multi-target Parallel Drug Discovery with Multi-agent Orchestration, gh/UAB-SPARC/agentic-drug-discovery [MAS] [SCI] [SPEC] OpenReview
End-to-end multi-agent framework for Alzheimer’s disease drug discovery generating novel compounds with favorable characteristics for four protein targets
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2501.06590 (ChemAgent), 2510.02139 (BioinfoMCP)
PsySpace: Simulating Emergent Psychological Dynamics in Long-Duration Space Missions using Multi-Agent LLMs [MAS] [SPEC] [EVAL] OpenReview
Multi-agent LLM framework simulating astronaut crew psychology with dual-component architecture (static personality profiles + dynamic stress/loneliness vectors) replicating third-quarter effect
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2509.24877 (social science LLMs), 2510.01815 (human-AI teaming)
Simulating Strategic Reasoning: A Digital Twin Approach to AI Advisors in Decision-Making [ARCH] [SPEC] [EVAL] OpenReview
Digital twin framework modeling senior strategist reasoning revealing LLM performance gap between simple and complex multi-step strategic decision-making
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2510.01815 (human-AI teaming), 2512.14474 (model-first reasoning)
Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration [MAS] [EVAL] [SPEC] OpenReview
TEAM-PHI framework using multiple LLM evaluators with majority voting to automate clinical de-identification model selection without costly expert annotations
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2507.21504 (evaluation survey), 2501.11067 (IntellAgent evaluation)
BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? [SAFETY] [SCI] [EVAL] OpenReview
Exposes critical vulnerability in AI-driven scientific publishing with five manipulation strategies (TooGoodGains, BaselineSelect, StatTheater, CoherencePolish, ProofGap) achieving 67-82% acceptance rates fooling LLM reviewers
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2510.01359 (security assessment code agents), 2502.02649 (autonomy concerns)
Beyond Adam: AI-Authored Discovery of Symbolic Optimization Rules [AUTO] [SCI] [CODE] OpenReview
Algorithmic Greenhouse demonstrates end-to-end autonomous AI authorship discovering interpretable optimization rules competitive with SGD, Momentum, and Adam
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2506.13131 (AlphaEvolve coding agent), 2507.21046 (self-evolving survey)
Behavioral Fingerprinting of Large Language Models [EVAL] [BENCH] [ARCH] OpenReview
Diagnostic Prompt Suite analyzing 18 models revealing behavioral profiles beyond performance metrics; documents ISTJ/ESTJ personality clustering reflecting deliberate alignment choices
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2507.17257 (agent identity evals), 2411.13768 (evaluation-driven approach)
Scalable Oversight in Multi-Agent Systems: Provable Alignment via Delegated Debate and Hierarchical Verification [MAS] [SAFETY] [ARCH] OpenReview
Hierarchical Delegated Oversight (HDO) framework with PAC-Bayesian bounds on misalignment risk enabling weak overseers to delegate verification through structured debates
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2506.04133 (TRiSM safety framework), 2508.03858 (governance protocol)
The Architectural Immune System: A Framework for Correcting Synthetic Fallacies in AI-Driven Science [SCI] [SAFETY] [ARCH] OpenReview
Self-correcting framework for AI-driven materials discovery detecting statistically implausible results; integrates ten tools including adversarial critique and database validation
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2503.08979 (scientific discovery survey), 2408.06292 (AI scientist automation)
Survival of the Useful: Evolutionary Boids as a Sandbox for Agent Societies [MAS] [AUTO] [ARCH] OpenReview
Combines Boids-style coordination (cohesion, separation, alignment) with evolutionary selection for agent societies; observe-reflect-build cycle generates self-contained tools through decentralized rules
Published: 08 Oct 2025 (Agents4Science 2025 Conference)
Cross-ref: 2505.22954 (Darwin Godel Machine), 2508.07407 (self-evolving survey)
[2510.09721] A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [SURVEY] [CODE] [BENCH] cs.SE cs.AI
Survey analyzing 100 MAS LLMs benchmarks and evaluations papers published 2023-2025
Cross-ref: 2508.00083 (code generation), 2503.16416 (evaluation survey), 2507.02825 (benchmark practices)
[2510.09244] Fundamentals of Building Autonomous LLM Agents [ARCH] [SURVEY] cs.AI cs.CL
Technical fundamentals covering agent perception, memory, reasoning, planning, and execution capabilities
Cross-ref: 2308.11432 (foundational survey), 2404.11584 (architecture landscape), 2503.21460 (LLM agent survey)
[2510.02297] Interactive Training: Feedback-Driven Neural Network Optimization [AUTO] [ARCH] cs.LG
Framework enabling real-time human or AI agent intervention during neural network training with dynamic optimization parameter adjustments
Cross-ref: 2507.17131 (HITL self-improvement), 2405.06682 (feedback effects)
[2510.02286] Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks [SAFETY] [AUTO] cs.LG
DialTree-RPO reinforcement learning framework for discovering multi-turn attack strategies in dialogue settings
Cross-ref: 2510.01586 (adversarial safety), 2506.04133 (TRiSM safety framework)
[2510.02271] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents [BENCH] [TOOL] [EVAL] cs.CL
Benchmark testing agents’ ability to integrate general web search with domain-specific tools across six domains
Cross-ref: 2405.17935 (tool learning survey), 2505.15872 (InfoDeepSeek RAG)
[2510.02263] RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems [PLAN] [AUTO] cs.AI
Two-player reinforcement learning approach enabling LLMs to generate and use reasoning abstractions
Cross-ref: 2510.01833 (plan-then-action), 2402.02716 (planning survey)
[2510.02250] The Unreasonable Effectiveness of Scaling Agents for Computer Use [ARCH] [BENCH] cs.AI
Behavior Best-of-N (bBoN) method for scaling computer-use agents via multiple rollouts and trajectory selection
Cross-ref: 2510.01670 (BLIND-ACT benchmark), 2501.16150 (computer use review)
[2510.02245] ExGRPO: Learning to Reason from Experience [AUTO] [PLAN] cs.LG
Framework for organizing and prioritizing valuable reasoning experiences in reinforcement learning
Cross-ref: 2405.06682 (self-reflection effects), 2406.01495 (Re-ReST)
[2510.02227] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization [AUTO] [ARCH] cs.CL
Adaptive Multi-Guidance Policy Optimization (AMPO) leveraging multiple teacher models for enhanced exploration
Cross-ref: 2510.02245 (ExGRPO), 2507.17131 (HITL guidance)
[2510.02209] StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? [BENCH] [SPEC] cs.LG
Multi-month benchmark evaluating LLM agents’ capabilities in real-world stock trading with sequential decision-making
Cross-ref: 2408.06361 (financial trading survey), 2501.00881 (industry applications)
[2510.02204] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents [SAFETY] [EVAL] cs.CL
Framework for diagnosing misalignments between reasoning and execution in vision-language mobile agents
Cross-ref: 2510.01670 (blind goal-directedness), 2501.16150 (computer use review)
[2510.02190] A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents [BENCH] [SCI] [EVAL] cs.AI
Comprehensive benchmark for Deep Research Agents with 214 expert-curated queries and multidimensional scoring
Cross-ref: 2506.18096 (deep research agents), 2501.04227 (research assistants)
[2510.02185] FalseCrashReducer: Mitigating False Positive Crashes in OSS-Fuzz-Gen Using Agentic AI [CODE] [MAS] cs.SE
AI-driven strategies reducing false positives in multi-agent fuzz driver generation systems
Cross-ref: 2503.14713 (test generation), 2410.14393 (debug agents)
[2510.02157] Agentic Reasoning and Refinement through Semantic Interaction [PLAN] [ARCH] cs.HC
VIS-ReAct two-agent framework for report refinement using semantic interactions in human-LLM collaboration
Cross-ref: 2507.17131 (HITL self-improvement), 2411.00927 (ReSpAct)
[2510.02139] BioinfoMCP: A Unified Platform Enabling MCP Interfaces in Agentic Bioinformatics [SPEC] [MAS] [TOOL] cs.MA
Platform converting bioinformatics tools to MCP-compliant servers for natural-language interaction
Cross-ref: 2510.01724 (MetaboT), 2501.06590 (ChemAgent)
[2510.02087] Cooperative Guidance for Aerial Defense in Multiagent Systems [MAS] [SPEC] cs.MA
Cooperative guidance framework for multi-drone aerial defense with time-constrained interception
Cross-ref: 2510.01869 (TACOS multi-drone), 2507.05178 (CREW benchmark)
[2510.01869] TACOS: Task Agnostic COordinator of a multi-drone System [MAS] [ARCH] cs.RO
Natural language control framework for multi-UAV systems using LLMs for high-level task delegation
Cross-ref: 2510.02087 (aerial defense), 2501.06322 (collaboration mechanisms)
[2510.01833] Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning [PLAN] [AUTO] cs.AI
Two-stage framework generating high-level guidance then using RL to optimize reasoning trajectories
Cross-ref: 2402.02716 (planning survey), 2310.04406 (LATS)
[2510.01815] Human-AI Teaming Co-Learning in Military Operations [SAFETY] [ARCH] cs.AI
Co-learning model for human-AI teams with adjustable autonomy and multi-layered control
Cross-ref: 2507.17131 (HITL self-improvement), 2402.04247 (safeguarding priority)
[2510.01751] A cybersecurity AI agent selection and decision support framework [SPEC] [SAFETY] [ARCH] cs.AI
Framework aligning AI agent architectures with NIST Cybersecurity Framework for threat response
Cross-ref: 2510.01654 (CLASP security), 2506.04133 (TRiSM)
[2510.01724] MetaboT: AI-based agent for natural language-based interaction with metabolomics knowledge graphs [SPEC] [MAS] [TOOL] cs.AI
Multi-agent system translating natural language to SPARQL queries for metabolomics knowledge graphs
Cross-ref: 2510.02139 (BioinfoMCP), 2501.06590 (ChemAgent)
[2510.01670] Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness [BENCH] [SAFETY] [EVAL] cs.AI
BLIND-ACT benchmark systematically evaluating agents’ tendency to pursue goals without considering feasibility
Cross-ref: 2510.02250 (computer use scaling), 2510.02204 (reasoning-execution gaps)
[2510.01654] SoK: Measuring What Matters for Closed-Loop Security Agents [BENCH] [SAFETY] [SPEC] cs.CL
CLASP framework for evaluating autonomous cybersecurity agents with Closed-Loop Capability Score
Cross-ref: 2510.01751 (cybersecurity framework), 2506.04133 (TRiSM)
[2510.01635] MIMIC: Integrating Diverse Personality Traits for Better Game Testing Using Large Language Model [SPEC] [CODE] cs.SE
Framework integrating personality traits into gaming agents for improved test coverage
Cross-ref: 2503.14713 (test generation), 2505.22583 (GitGoodBench)
[2510.01586] AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution [SAFETY] [MAS] [AUTO] cs.AI
Multi-agent RL framework improving safety by jointly optimizing attackers and defenders
Cross-ref: 2510.02286 (adversarial attacks), 2506.04133 (TRiSM)
[2510.01569] InvThink: Towards AI Safety via Inverse Reasoning [SAFETY] [ARCH] cs.AI
Method for LLMs to enumerate and analyze potential harms before generating responses
Cross-ref: 2501.17112 (inverse constitutional AI), 2508.03858 (governance protocol)
[2510.01553] IoDResearch: Deep Research on Private Heterogeneous Data [SCI] [MAS] cs.IR
Multi-agent framework for deep research on private heterogeneous scientific data with knowledge graphs
Cross-ref: 2510.02190 (deep research benchmark), 2506.18096 (deep research agents)
[2510.01538] TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis [SCI] [SPEC] cs.LG
LLM-driven framework with specialized agents for time series forecasting and analysis
Cross-ref: 2501.04227 (research assistants), 2506.18096 (deep research agents)
[2510.01531] Information Seeking for Robust Decision Making under Partial Observability [PLAN] [ARCH] cs.AI
InfoSeeker framework integrating planning with information seeking for decision-making under uncertainty
Cross-ref: 2402.02716 (planning survey), 2410.09713 (agentic IR)
[2510.01524] WALT: Web Agents that Learn Tools [TOOL] [ARCH] cs.CV
Framework for web agents reverse-engineering website functionalities as reusable tools
Cross-ref: 2405.17935 (tool learning), 2510.02271 (InfoMosaic-Bench)
[2510.01427] A Tale of LLMs and Induced Small Proxies: Scalable Agents for Knowledge Mining [SCI] [ARCH] cs.AI
Falconer collaborative framework combining LLM reasoning with lightweight proxy models for knowledge mining
Cross-ref: 2510.01553 (IoDResearch), 2506.18096 (deep research agents)
[2510.01379] Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration [CODE] [ARCH] cs.SE
Multi-stage orchestration framework routing coding tasks to optimal LLMs across programming languages
Cross-ref: 2507.22414 (code explanations), 2506.13131 (AlphaEvolve)
[2510.01359] Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks [SAFETY] [CODE] cs.CR
Security evaluation of code-generating AI agents through systematic jailbreaking attack testing
Cross-ref: 2510.01569 (InvThink safety), 2506.04133 (TRiSM)
[2510.01297] SimCity: Multi-Agent Urban Development Simulation with Rich Interactions [MAS] [SPEC] cs.MA
Multi-agent framework for macroeconomic simulation using LLMs modeling heterogeneous urban agents
Cross-ref: 2507.05178 (CREW benchmark), 2501.06322 (collaboration mechanisms)
[2510.01179] TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments [TOOL] [BENCH] cs.LG
Large dataset of tool-agentic interactions using real-world Model Context Protocols for training
Cross-ref: 2405.17935 (tool learning survey), 2510.02271 (InfoMosaic-Bench)
[2510.01003] Improving Code Localization with Repository Memory [CODE] [MEM] cs.SE
Augmenting language agents with repository memory leveraging commit history for code understanding
Cross-ref: 2404.13501 (memory mechanisms), 2410.14393 (debug agents)

2025-09¶

[2509.25250] Memory Management and Contextual Consistency for Long-Running Low-Code Agents [MEM] [ARCH] cs.AI cs.SE
Memory management framework ensuring contextual consistency in long-running low-code agent systems
Cross-ref: 2512.13564 (memory survey), 2511.18423 (GAM), 2404.13501 (memory mechanisms)
[2509.10769] AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise [BENCH] [EVAL] [ARCH] cs.AI cs.SE
Benchmark evaluating 18 agent architecture configurations on enterprise use cases across orchestration, prompting, memory, and tools
Cross-ref: 2511.14136 (CLEAR framework), 2512.08296 (scaling agent systems), 2308.03688 (AgentBench)
[2509.06917] Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents [RESEARCH] [AUTO] [TOOL] cs.AI cs.SE
Framework converting research papers into interactive AI agents using Model Context Protocol servers with automated testing
Systematically analyzes papers and codebases to construct MCP servers for dynamic knowledge dissemination
Cross-ref: 2505.18705 (AI-Researcher), 2312.07559 (PaperQA), 2501.04227 (research assistants)
[2509.24877] The Emergence of Social Science of Large Language Models [SURVEY] [ARCH] cs.AI
Systematic review of 270 studies examining LLM social interactions and computational taxonomy
Cross-ref: 2503.21460 (LLM agent survey), 2308.11432 (foundational survey)
[2509.00629] Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems? [CODE] [BENCH] cs.CL
ICPC benchmark with 254 competitive programming problems achieving 42.2% solve rate with self-refinement
Cross-ref: 2505.22583 (GitGoodBench), 2410.14393 (debug agents)
[2509.00625] NetGent: Agent-Based Automation of Network Application Workflows [AUTO] [SPEC] cs.AI
AI agent framework compiling natural-language workflow rules into executable code for network automation
Cross-ref: 2505.22967 (MermaidFlow workflows), 2502.05957 (AutoAgent)
[2509.00616] TimeCopilot [SCI] [SPEC] cs.LG
Open-source agentic framework for time series forecasting combining models with LLMs and natural language explanations
Cross-ref: 2510.01538 (TimeSeriesScientist), 2501.04227 (research assistants)
[2509.00581] SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction [MAS] [CODE] cs.DB
Multi-agent framework for natural language to SQL conversion with schema linking and error correction
Cross-ref: 2508.15809 (Chain-of-Query), 2402.01030 (executable actions)
[2509.00559] Social World Models [ARCH] [SPEC] cs.AI
Structured social world representation for agent reasoning about social interactions and theory-of-mind
Cross-ref: 2509.24877 (social science LLMs), 2510.01815 (human-AI teaming)
[2509.00531] MobiAgent: A Systematic Framework for Customizable Mobile Agents [SPEC] [ARCH] cs.MA
Comprehensive mobile agent system with advanced GUI perception and planning capabilities
Cross-ref: 2501.16150 (computer use review), 2510.02250 (computer use scaling)
[2509.24380] Agentic Services Computing [SURVEY] [ARCH] cs.SE
Lifecycle-driven framework for Agentic Service Computing examining multi-agent service design and evolution
Cross-ref: 2508.10146 (agentic frameworks), 2501.10114 (infrastructure)
[2509.23988] LLM/Agent-as-Data-Analyst: A Survey [SURVEY] [SPEC] cs.AI
Review of LLM and agent techniques for data analysis across different modalities
Cross-ref: 2503.21460 (LLM agent survey), 2501.04227 (research assistants)
[2510.00078] Adaptive and Resource-efficient Agentic AI Systems for Mobile and Embedded Devices: A Survey [SURVEY] [ARCH] cs.LG
Survey of adaptive, resource-efficient agentic AI for mobile and embedded device deployment
Cross-ref: 2508.10146 (agentic frameworks), 2501.10114 (infrastructure)

2025-08¶

[2508.00828] FinanceBench: Measuring Finance Domain Knowledge of LLM Agents [BENCH] [SPEC] cs.CL
Benchmark for evaluating agents on SEC filing research and financial analysis automation
[2508.11126] AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities [CODE] [SURVEY] cs.SE cs.AI
Survey of agentic programming paradigm where LLM agents autonomously plan, execute, and interact with development tools
Cross-ref: 2508.00083 (code generation survey), 2511.18538 (code foundation models), 2507.15003 (SE 3.0)
[2508.00344] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning [AUTO] [PLAN] cs.AI cs.LG
Global planning-guided RL framework achieving state-of-the-art with LLaMA3.1-8B-Instruct surpassing GPT-4o by 3.60%
Cross-ref: 2510.01833 (plan-then-action), 2402.02716 (planning survey), 2406.01495 (Re-ReST)
[2508.00083] A Survey on Code Generation with LLM-based Agents [CODE] [SURVEY] cs.SE cs.AI
Comprehensive survey of LLM code generation agents covering autonomy and expanded scope throughout SDLC
Cross-ref: 2508.11126 (agentic programming), 2511.18538 (code foundation models), 2507.15003 (SE 3.0)
[2508.21803] Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture [MAS] [SPEC] cs.AI
Collaborative multi-agent system for clinical problem identification with hierarchical debate for diagnostic conclusions
Cross-ref: 2507.16940 (AURA medical agent), 2501.06322 (collaboration mechanisms)
[2508.15809] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration [MAS] [CODE] cs.CL
Multi-agent framework for SQL generation and table understanding with clause-by-clause generation strategy
Cross-ref: 2509.00581 (SQL-of-Thought), 2402.01030 (executable actions)
[2508.15805] ALAS: Autonomous Learning Agent for Self-Updating Language Models [AUTO] [ARCH] cs.AI cs.LG
Autonomous learning framework for continuous self-updating of language models with data acquisition and fine-tuning pipeline
Cross-ref: 2507.17131 (HITL self-improvement), 2410.04444 (Gödel Agent recursive improvement)
[2508.11120] Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning [MAS] [MEM] [SPEC] cs.CL
Multi-agent framework for audience curation with iterative planning, tool verification, and long-term memory
Cross-ref: 2404.13501 (memory mechanisms), 2501.06322 (collaboration mechanisms)
[2508.11030] Families’ Vision of Generative AI Agents for Household Safety [SPEC] [SAFETY] cs.HC
Multi-agent system design for household safety with privacy-preserving principles and family-centric roles
Cross-ref: 2510.01815 (human-AI teaming), 2506.04133 (TRiSM safety)
[2508.10572] Towards Agentic AI for Multimodal-Guided Video Object Segmentation [SPEC] [ARCH] cs.CV
Multi-modal agent system for video object segmentation using LLMs for dynamic workflow generation
Cross-ref: 2408.08632 (multimodal benchmarking), 2507.16940 (AURA multimodal)
[2508.10494] A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation [MAS] [ARCH] cs.LG
MAGUS modular framework for multimodal understanding and generation via multi-agent collaboration
Cross-ref: 2408.08632 (multimodal benchmarking), 2501.06322 (collaboration mechanisms)
[2508.10146] Agentic AI Frameworks: Architectures, Protocols, and Design Challenges [ARCH] [SURVEY] cs.AI cs.SE
Systematic review of leading agentic AI frameworks including CrewAI, LangGraph, AutoGen, and MetaGPT with architectural analysis
Cross-ref: 2502.05957 (AutoAgent framework), 2501.00881 (industry applications)
[2508.07407] A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems [SURVEY] [AUTO] cs.AI cs.LG
Comprehensive survey of self-evolving agents covering evolutionary techniques, environmental feedback, and lifelong learning paradigms
Cross-ref: 2507.21046 (self-evolving survey), 2505.22954 (Darwin Godel Machine)
[2508.03858] MI9 - Agent Intelligence Protocol: Runtime Governance forAgentic AI Systems [SAFETY] [ARCH] cs.AI cs.CR
Runtime governance framework for ensuring safe and controllable agentic AI systems
Cross-ref: 2408.02205 (complementary safety layers), 2506.04133 (similar safety framework), 2302.10329 (foundational risk analysis)
[2508.03682] SELF-QUESTIONING LANGUAGE MODELS [PLAN] [ARCH] cs.CL cs.AI
Framework for improving LLM reasoning through self-generated questions and introspective analysis
Cross-ref: 2402.02716 (broader planning context), 2411.13768 (evaluation methodology overlap)
[2508.00414] Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training [SCI] [ARCH] cs.AI cs.LG
Training framework for developing specialized research agents with enhanced cognitive capabilities
Cross-ref: 2506.18096 (research agent foundations), 2501.04227 (practical research applications)
[2508.00032] Strategic Communication and Language Bias in Multi-Agent LLM Coordination [MAS] [ARCH] cs.MA
Examines how communication influences cooperative behavior in multi-agent LLM systems
Cross-ref: 2507.05178 (CREW benchmark), 2501.06322 (collaboration mechanisms)

2025-07¶

[2507.05558] Can LLM Agents Exploit Smart Contract Vulnerabilities? [BENCH] [SAFETY] cs.CR
Benchmark for vulnerability exploitation assessment in smart contracts
[2507.21504] Evaluation and Benchmarking of LLM Agents: A Survey [EVAL] [BENCH] [SURVEY] cs.AI cs.CL
Taxonomy of LLM agent evaluation covering objectives (behavior, capabilities, reliability, safety) and process (interaction, datasets, metrics, environments)
Cross-ref: 2503.16416 (evaluation survey), 2507.02825 (benchmark best practices), 2411.13768 (evaluation-driven)
[2507.15003] The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering [CODE] [SURVEY] cs.SE cs.AI
SE 3.0 vision of intent-driven conversational development with autonomous AI teammates operating at task level
Cross-ref: 2508.00083 (code generation survey), 2508.11126 (agentic programming), 2511.18538 (code foundation models)
[2507.11473] Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [SAFETY] [EVAL] cs.AI cs.LG
Framework for monitoring LLM agent chain-of-thought reasoning with holistic risk assessments
Cross-ref: 2510.13653 (AI safety report), 2506.04133 (TRiSM), 2508.03858 (governance protocol)
[2507.06134] OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [SAFETY] [EVAL] [BENCH] cs.AI cs.CR
Comprehensive framework for evaluating AI agent safety in real-world deployment scenarios
Cross-ref: 2506.04133 (TRiSM framework), 2508.03858 (governance protocol), 2410.06703 (ST-WebAgentBench)
[2507.23276] How Far Are AI Scientists from Changing the World?, gh/ResearAI/Awesome-AI-Scientist [SCI] [SURVEY]
Survey of research on AI scientists, AI researchers, AI engineers, and a series of AI-driven research studies
Cross-ref: 2408.06292 (automated scientific discovery implementation), 2506.18096 (systematic research agent analysis)
[2507.23096] ChatVis: Large Language Model Agent for Generating Scientific Visualizations [CODE] [SPEC] cs.HC
LLM assistant for generating Python code for scientific visualizations using chain-of-thought and RAG
Cross-ref: 2507.22414 (code explanations), 2410.09713 (agentic IR)
[2507.23095] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing [MAS] [ARCH] cs.CL
Framework for compositional layout and content editing with reward-guided refinement
Cross-ref: 2501.06322 (collaboration mechanisms), 2510.02157 (VIS-ReAct)
[2507.22800] The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach [MAS] [SPEC] cs.SE
Multi-agent system for root cause analysis in microservices using LLMs with knowledge-based approach
Cross-ref: 2510.01751 (cybersecurity framework), 2501.06322 (collaboration mechanisms)
[2507.17131] Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance [AUTO] [ARCH] cs.AI cs.HC
Framework for enabling agents to self-improve through human-in-the-loop guidance and knowledge gap assessment
Cross-ref: 2508.15805 (ALAS autonomous learning), 2405.06682 (self-reflection effects)
[2507.22414] AutoCodeSherpa: Symbolic Explanations in AI Coding Agents [CODE] [ARCH]
Framework for providing symbolic explanations of code generation decisions in AI coding agents
Cross-ref: 2402.01030 (code action effectiveness), 2506.13131 (evolutionary coding approach)
[2507.21046] A SURVEY OF SELF-EVOLVING AGENTS: ON PATH TO ARTIFICIAL SUPER INTELLIGENCE, gh/CharlesQ9/Self-Evolving-Agents [SURVEY] [ARCH]
Comprehensive survey of self-improving AI agents and their potential path toward artificial superintelligence
Cross-ref: 2505.22954 (theoretical self-evolution framework), 2507.17311 (domain-specific self-evolution)
[2507.18074] AlphaGo Moment for Model Architecture Discovery, gh/GAIR-NLP/ASI-Arch [ARCH] [AUTO]
Automated neural architecture search using AI agents for discovering novel model architectures
Cross-ref: 2408.08435 (broader automated design scope), 2506.16499 (ML automation methods)
[2507.17311] EarthLink: A Self-Evolving AI Agent forClimate Science [SCI] [SPEC]
Self-improving AI agent specialized for climate science research and analysis
Cross-ref: 2507.21046 (general self-evolution theory), 2501.06590 (similar scientific domain agent)
[2507.17257] Agent Identity Evals: Measuring Agentic Identity [EVAL] [BENCH]
Evaluation framework for measuring and understanding agent identity and persona consistency
Cross-ref: 2411.13768 (evaluation methodology synergy), 2503.16416 (comprehensive evaluation landscape)
[2507.16940] AURA: A Multi-Modal Medical Agent forUnderstanding, Reasoning & Annotation [SPEC] [ARCH]
Multi-modal AI agent for medical data understanding, clinical reasoning, and annotation tasks
Cross-ref: 2408.08632 (multimodal benchmarking context), 2404.13501 (memory for complex reasoning)
[2507.10584] ARPaCCino: An Agentic-RAG for Policy as CodeCompliance [COMP] [TOOL]
Agentic RAG system for automated policy-as-code compliance checking and enforcement
Cross-ref: 2505.15872 (RAG benchmarking methods), 2410.09713 (agentic retrieval techniques)
[2507.05178] CREW-WILDFIRE: Benchmarking AgenticMulti-Agent Collaborations at Scale [BENCH] [MAS]
Large-scale benchmark for evaluating collaborative multi-agent systems in complex scenarios
Cross-ref: 2501.06322 (collaboration mechanism design), 2503.13657 (failure mode analysis)
[2507.02825] Establishing Best Practices for Building RigorousAgentic Benchmarks [BENCH] [EVAL]
Guidelines and methodology for creating robust evaluation benchmarks for agentic AI systems
Cross-ref: 2404.06411 (AgentQuest), 2308.03688 (AgentBench)
[2507.02097] The Future is Agentic: Definitions, Perspectives, and OpenChallenges of Multi-Agent Recommender Systems [MAS] [SURVEY]
Survey of multi-agent recommender systems, definitions, current perspectives, and future research directions
Cross-ref: 2501.06322 (collaboration mechanisms), 2507.05178 (CREW benchmark)

2025-06¶

[2506.07982] τ²-bench: A Benchmark for Tool Use with Dual-Control User-Agent Interactions [BENCH] [TOOL] [EVAL] cs.AI cs.CL
Extension of τ-bench with dual-control evaluation for user-agent tool interactions
Cross-ref: 2406.12045 (τ-bench), 2307.16789 (ToolLLM)
[2506.04625] Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning [TOOL] [AUTO] cs.AI cs.CL
Tool-MVR framework achieving state-of-the-art on StableToolBench, surpassing ToolLLM by 23.9% and GPT-4 by 15.3%
Cross-ref: 2307.16789 (ToolLLM), 2405.17935 (tool learning survey), 2406.12045 (τ-bench)
[2506.02548] CyberGym: Real CVE Vulnerability Assessment Benchmark [BENCH] [SAFETY] cs.CR
Security benchmark for evaluating agents on real CVE vulnerability detection and assessment
[2506.23408] Do LLMs Dream of Discrete Algorithms? [PLAN] [ARCH] cs.LG
Neurosymbolic approach augmenting LLMs with logic-based reasoning modules for improved agent planning precision
Cross-ref: 2210.03629 (ReAct planning), 2310.04406 (LATS reasoning)
[2506.23329] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering [BENCH] [EVAL] cs.CV
Benchmark challenging vision-language agents to recreate 3D scene structures through tool use
Cross-ref: 2408.08632 (multimodal benchmarking), 2510.02271 (InfoMosaic-Bench)
AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model, gh/google-deepmind/alphagenome [SCI] [SPEC] [ARCH] bioRxiv
Google DeepMind’s unified DNA sequence model predicting functional genomic tracks at single base pair resolution across diverse modalities; matches or exceeds strongest models on 24/26 variant effect prediction evaluations
Published: 27 Jun 2025 (bioRxiv)
Cross-ref: 2510.01724 (MetaboT domain-specific), 2501.06590 (ChemAgent scientific)
[2506.23306] GATSim: Urban Mobility Simulation with Generative Agents [MAS] [SPEC] cs.AI
Urban mobility simulation framework using generative agents with adaptive behaviors and memory systems
Cross-ref: 2510.01297 (SimCity urban simulation), 2404.13501 (memory mechanisms)
[2506.18096] Deep Research Agents: A Systematic Examination And Roadmap, gh/ai-agents-2030/awesome-deep-research-agent [SCI] [SURVEY]
Comprehensive examination of AI agents for research tasks with roadmap for future development
Cross-ref: 2507.23276 (AI scientist impact assessment), 2501.04227 (practical research implementation)
[2506.16499] ML-Master: Towards AI-for-AI via Integration ofExploration and Reasoning [AUTO] [ARCH]
AI system for automated machine learning through integrated exploration and reasoning capabilities
Cross-ref: 2507.18074 (automated architecture search), 2411.10478 (workflow optimization survey)
[2506.13131] AlphaEvolve: A coding agent for scientific and algorithmic discovery [CODE] [SCI]
Evolutionary coding agent for automated scientific discovery and algorithm development
Cross-ref: 2507.22414 (code explanation methods), 2408.06292 (scientific discovery automation)
AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model, gh/google-deepmind/alphagenome [SCI] [SPEC] [ARCH] bioRxiv cs.AI q-bio.GN
Google DeepMind’s unified DNA sequence model predicting thousands of functional genomic tracks at single base pair resolution; matches or exceeds 24/26 variant effect prediction benchmarks
Trained on human and mouse genomes; provides API and tools for genome track and variant effect predictions from sequence
Cross-ref: 2510.01724 (MetaboT bioinformatics), 2501.06590 (ChemAgent domain applications)
[2506.04133] TRiSM for Agentic AI: A Review of Trust, Risk, and SecurityManagement in LLM-based Agentic Multi-Agent Systems [SAFETY] [MAS]
Framework for managing trust, risk, and security in LLM-based multi-agent systems
Cross-ref: 2508.03858 (runtime governance approach), 2408.02205 (layered safety model)
[2506.01438] Distinguishing Autonomous AI Agents from Collaborative Agentic Systems: A Comprehensive Framework for Understanding Modern Intelligent Architectures [ARCH] [SURVEY] cs.AI cs.MA
Framework for understanding the distinction between autonomous AI agents and collaborative agentic systems
Cross-ref: 2508.10146 (framework architectures), 2505.10468 (conceptual taxonomy)

2025-05¶

[2505.23135] VERINA: A Benchmark for Code Verification and Proof Generation [BENCH] [CODE] cs.SE
Code verification and automated proof generation benchmark
[2505.18878] CRMArena-Pro: Expanded Business Scenario Diversity [BENCH] [SPEC] cs.AI
Extended CRM benchmark with expanded business scenario coverage
[2505.21298] Large Language Models Miss the Multi-Agent Mark [MAS] [EVAL] cs.AI cs.CL
Analysis of LLM limitations in multi-agent scenarios with evaluation of collaborative performance gaps
Cross-ref: 2503.01935 (MultiAgentBench), 2507.05178 (CREW benchmark), 2501.06322 (collaboration mechanisms)
[2505.18646] SEW: Self-Evolving Agentic Workflows for Automated Code Generation [CODE] [AUTO] cs.SE cs.AI
Self-evolving framework automatically generating and optimizing multi-agent workflows without hand-crafted designs
Cross-ref: 2508.00083 (code generation survey), 2507.21046 (self-evolving survey), 2408.08435 (automated design)
[2505.12371] MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks [BENCH] [MAS] [SPEC] cs.AI cs.CL
Comprehensive medical benchmark for multi-agent collaboration across visual QA, lay summaries, EHR prediction, and workflow automation
Cross-ref: 2507.16940 (AURA medical), 2508.21803 (clinical problem detection), 2507.05178 (CREW)
[2505.22967] MermaidFlow: Redefining Agentic WorkflowGeneration via Safety-Constrained EvolutionaryProgramming, gh/chengqiArchy/MermaidFlow [AUTO] [SAFETY]
Safety-constrained evolutionary programming approach for agentic workflow generation
Cross-ref: 2408.08435 (automated design), 2507.21046 (self-evolving survey)
[2505.10468] AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges [SURVEY] [ARCH] cs.AI cs.CY
Comprehensive conceptual taxonomy distinguishing AI agents from agentic AI with application analysis
Cross-ref: 2506.01438 (architectural frameworks), 2308.11432 (foundational agent survey)
[2405.17935] Tool Learning with Foundation Models [TOOL] [SURVEY]
Comprehensive survey of tool learning capabilities in foundation models and LLMs for agentic applications
[2505.22954] Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents [ARCH] [AUTO]
Framework for open-ended evolution of self-improving AI agents based on Gödel machine principles
Cross-ref: 2507.21046 (self-evolving survey), 2408.01768 (living systems)
[2505.22583] GitGoodBench: A Novel Benchmark For Evaluating Agentic PerformanceOn Git, infodeepseek.github.io [BENCH] [CODE]
Benchmark for evaluating AI agent performance on Git version control tasks and workflows
Cross-ref: 2308.03688 (AgentBench), 2404.06411 (AgentQuest)
[2505.19764] Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding [EVAL] [ARCH]
System for predicting agent performance in complex workflows using multi-view encoding techniques
Cross-ref: 2411.13768 (evaluation-driven), 2410.22457 (workflow metrics)
[2505.18946] SANNet: A Semantic-Aware Agentic AI Networking Framework for Multi-Agent Cross-Layer Coordination [MAS] [ARCH]
Networking framework for semantic-aware coordination in multi-agent AI systems
Cross-ref: 2507.05178 (collaboration benchmark), 2501.06322 (collaboration mechanisms)
[2505.15872] InfoDeepSeek: Benchmarking Agentic InformationSeeking for Retrieval-Augmented Generation [BENCH] [TOOL]
Benchmark for evaluating agentic information-seeking capabilities in RAG systems
Cross-ref: 2410.09713 (agentic IR), 2507.10584 (compliance RAG)
[2505.18705] AI-Researcher: Fully Autonomous Research System from Literature Review to Publication, gh/HKUDS/AI-Researcher [RESEARCH] [AUTO] NeurIPS 2025 Spotlight
Fully autonomous AI research system transforming scientific discovery from literature review to publication-ready manuscripts
Features Writer Agent for automatic paper generation and Scientist-Bench for systematic research quality evaluation
Cross-ref: 2408.14033 (MLR-Copilot), 2501.10120 (PaSa), 2509.06917 (Paper2Agent)

2025-04¶

[2504.05559] SciSciGPT: Advancing Human-AI Collaboration in the Science of Science [SCI] [ARCH] [MAS] cs.AI cs.DL
Open-source AI collaborator with LLM Agent capability maturity model for human-AI research partnerships
Cross-ref: 2506.18096 (deep research agents), 2509.06917 (Paper2Agent), 2508.00414 (cognitive kernel)
[2504.01382] Online-Mind2Web: Live Web Task Evaluation Benchmark [BENCH] [TOOL] cs.AI
Live web agent evaluation with 300 real-world tasks
[2504.14064] DoomArena: Security Threat Testing for Agent Frameworks [BENCH] [SAFETY] cs.CR
Security benchmark testing agent framework vulnerabilities and threat resilience
[2504.18575] WASP: Prompt Injection Attack Resilience Benchmark [BENCH] [SAFETY] cs.CR
Benchmark for evaluating agent resilience to prompt injection attacks
[2504.17192] Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning [CODE] [SCI] [AUTO] cs.SE cs.AI
Multi-agent framework transforming ML papers into functional code with planning, analysis, and modular generation
Cross-ref: 2509.06917 (Paper2Agent), 2508.00083 (code generation survey), 2505.18705 (AI-Researcher)
[2504.19678] From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
Comprehensive review of the evolution from LLM reasoning to fully autonomous AI agents
[2504.16902] Building A Secure Agentic AI ApplicationLeveraging Google’s A2A Protocol
Guide for building secure agentic AI applications using Google’s Agent-to-Agent protocol

2025-03¶

[2503.23037] Agentic Large Language Models, a survey [SURVEY] [ARCH] cs.AI cs.CL
Survey of agentic LLMs showing mutual benefits across retrieval, tool use, reflection, reasoning, and multi-agent collaboration
Cross-ref: 2503.21460 (LLM agent survey), 2308.11432 (foundational survey), 2510.09244 (fundamentals)
[2503.01935] MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [BENCH] [MAS] [EVAL] cs.AI cs.MA
Comprehensive benchmark for multi-agent systems measuring collaboration and competition quality across diverse coordination protocols
Cross-ref: 2507.05178 (CREW benchmark), 2512.08296 (scaling agent systems), 2501.06322 (collaboration mechanisms)
[2503.21460] Large Language Model Agent: A Survey on Methodology, Applications and Challenges [SURVEY] [ARCH]
Comprehensive survey of LLM agents covering methodology, applications, and current challenges
Cross-ref: 2308.11432 (foundational survey), 2404.11584 (architecture landscape)
[2503.16416] Survey on Evaluation of LLM-based Agents [SURVEY] [EVAL]
Survey of evaluation methods and benchmarks for LLM-based agent systems
Cross-ref: 2411.13768 (evaluation-driven), 2507.02825 (benchmark best practices)
[2503.14713] TestForge: Feedback-Driven, Agentic Test Suite Generation [AUTO] [CODE]
Agentic system for automated test suite generation using feedback-driven approaches
Cross-ref: 2410.14393 (debug agents), 2402.01030 (executable code)
[2503.13657] Why Do Multi-Agent LLM Systems Fail? [MAS] [SAFETY]
Analysis of failure modes and challenges in multi-agent LLM systems
Cross-ref: 2507.05178 (MAS benchmarking), 2506.04133 (TRiSM safety)
[2503.08979] AGENTIC AI FOR SCIENTIFIC DISCOVERY: A SURVEY OF PROGRESS, CHALLENGES, AND FUTURE DIRECTION [SCI] [SURVEY]
Survey of agentic AI applications in scientific discovery with progress assessment and future directions
Cross-ref: 2408.06292 (AI scientist), 2506.18096 (deep research agents)
[2503.06416] Advancing AI Negotiations:New Theory and Evidence from a Large-ScaleAutonomous Negotiation Competition [MAS] [BENCH]
Theory and empirical evidence from large-scale autonomous agent negotiation competitions
Cross-ref: 2507.05178 (collaboration benchmark), 2501.06322 (collaboration mechanisms)
[2503.00237] Agentic AI Needs a Systems Theory [ARCH] [SURVEY]
Argument for developing systems theory approaches to understand and design agentic AI
Cross-ref: 2404.11584 (architecture landscape), 2503.21460 (methodology survey)

2025-02¶

[2502.06559] Can We Trust AI Benchmarks? An Interdisciplinary Review [BENCH] [EVAL] [SURVEY] cs.AI
Interdisciplinary review of ~100 studies on benchmark shortcomings: dataset biases, data contamination, construct validity, and gaming
Cross-ref: 2507.02825 (agentic benchmark checklist), 2308.03688 (AgentBench)
[2502.12110] A-Mem: Agentic Memory for LLM Agents [MEM] [ARCH] cs.AI cs.CL
Autonomous memory system with contextual description generation and connection establishment for continuous evolution
Cross-ref: 2512.13564 (memory survey), 2601.01885 (agentic memory), 2512.18950 (MACLA)
[2502.14776] SurveyX: Academic Survey Automation via Large Language Models [AUTO] [SCI]
Framework for automating academic survey generation and literature review using LLMs
Cross-ref: 2506.18096 (deep research agents), 2501.04227 (research assistants)
[2502.05957] AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents [AUTO] [ARCH]
Zero-code framework for creating and deploying LLM agents without programming requirements
Cross-ref: 2412.04093 (practical considerations), 2501.00881 (industry guide)
[2502.02649] Fully Autonomous AI Agents Should Not be Developed [SAFETY] [SURVEY]
Position paper arguing against development of fully autonomous AI agents with safety considerations
Cross-ref: 2302.10329 (harms analysis), 2402.04247 (safeguarding priority)

2025-01¶

[2501.13956] Zep: A Temporal Knowledge Graph Architecture for Agent Memory, getzep.com [MEM] [BENCH] cs.AI cs.CL
Introduces Zep, a memory layer service using Graphiti (temporally-aware KG engine) that outperforms MemGPT on DMR (94.8% vs 93.4%) and achieves +18.5% accuracy on LongMemEval; addresses static-document RAG limitations via dynamic synthesis of conversational and structured business data
Establishes LongMemEval as the more representative enterprise memory benchmark vs DMR
Cross-ref: 2601.03236 (MAGMA multi-graph), 2512.13564 (memory survey), 2404.13501 (memory mechanisms)
[2501.14654] MedAgentBench: Benchmark for Virtual EHR Healthcare Workflows [BENCH] [SPEC] cs.AI
Healthcare agent benchmark evaluating performance on virtual electronic health record workflows
[2501.11067] IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems, gh/plurai-ai/intellagent [EVAL] [MAS] [BENCH] cs.AI cs.CL
Multi-agent framework for comprehensive diagnosis and optimization of conversational agents using simulated realistic synthetic interactions
Provides systematic evaluation methodology to uncover agent blind spots and improve performance
Cross-ref: 2503.16416 (evaluation survey), 2507.02825 (benchmark best practices), 2411.13768 (evaluation-driven)
[2501.17112] Decoding Human Preferences in Alignment: An Improved Approach to Inverse Constitutional AI [SAFETY] [ARCH] cs.AI cs.LG
Improved approach for inverse constitutional AI and human preference alignment in agent systems
Cross-ref: 2406.07814 (collective constitutional AI), 2212.08073 (foundational constitutional AI)
[2501.10114] Infrastructure for AI Agents [ARCH] [COMP] cs.AI cs.SE
Infrastructure requirements and protocols for deploying AI agents in production environments
Cross-ref: 2508.10146 (framework architectures), 2412.04093 (practical considerations)
[2501.16150] AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants [SURVEY] [SPEC]
Review of AI agents for computer control, GUI automation, and operator assistance systems
Cross-ref: 2410.14393 (debug agents), 2503.14713 (test generation)
[2501.06590] ChemAgent [SCI] [SPEC]
AI agent system specialized for chemistry research and chemical compound analysis
Cross-ref: 2507.17311 (EarthLink climate), 2507.16940 (AURA medical)
[2501.06322] Multi-Agent Collaboration Mechanisms: A Survey of LLMs [MAS] [SURVEY]
Survey of collaboration mechanisms in multi-agent LLM systems and coordination strategies
Cross-ref: 2507.05178 (CREW benchmark), 2503.06416 (negotiation competition)
[2501.04227] Agent Laboratory: Using LLM Agents as Research Assitants, AgentRxiv:Towards Collaborative Autonomous Research [SCI] [ARCH]
Framework for using LLM agents as research assistants in academic and scientific workflows
Cross-ref: 2506.18096 (deep research agents), 2502.14776 (SurveyX)
[2501.00881] Agentic Systems: A Guide to Transforming Industries with Vertical AI Agents [SPEC] [SURVEY]
Guide for implementing vertical AI agents across different industries and use cases
Cross-ref: 2412.04093 (practical considerations), 2408.06361 (financial trading)
[2501.10120] PaSa: LLM-Powered Paper Search Agent with Reinforcement Learning [RESEARCH] [TOOL]
LLM-powered paper search agent using reinforcement learning trained on AutoScholarQuery dataset with 35k academic queries
Autonomous search workflow with tool invocation, paper reading, and reference filtering for comprehensive scholarly search
Cross-ref: 2505.18705 (AI-Researcher), 2312.07559 (PaperQA), 2501.04227 (Agent Laboratory)

2024-12¶

[2412.14470] Agent-SafetyBench: Evaluating the Safety of LLM Agents [BENCH] [SAFETY] [EVAL] cs.AI cs.CL
Safety benchmark with 349 interaction environments and 2,000 test cases evaluating 8 risk categories and 10 failure modes
Cross-ref: 2412.13178 (SafeAgentBench), 2507.06134 (OpenAgentSafety), 2402.05044 (SALAD-Bench)
[2412.14161] TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [BENCH] [EVAL] cs.AI cs.SE
Benchmark testing agents in simulated software company environment with real-world consequential tasks
Cross-ref: 2509.10769 (AgentArch enterprise), 2511.14136 (CLEAR framework), 2308.03688 (AgentBench)
[2412.13178] SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [BENCH] [SAFETY] [SPEC] cs.AI cs.RO
First comprehensive benchmark for safety-aware task planning with 750 tasks covering 10 hazards in embodied agents
Cross-ref: 2412.14470 (Agent-SafetyBench), 2507.06134 (OpenAgentSafety), 2512.20798 (constraint violations)
[2412.17149] A Multi-AI Agent System for Autonomous Optimization of Agentic AISolutions via Iterative Refinement and LLM-Driven Feedback Loop [MAS] [AUTO]
Multi-agent system for autonomous optimization of agentic AI solutions using iterative refinement and LLM feedback loops
Cross-ref: 2408.08435 (automated design), 2507.18074 (architecture discovery)
[2412.04093] Practical Considerations for Agentic LLM Systems [ARCH] [COMP]
Practical guidance for implementing and deploying agentic LLM systems in production
Cross-ref: 2411.05285 (agentops taxonomy), 2502.05957 (AutoAgent)
[2412.05467] BrowserGym: A Gym Environment for Web Task Automation [BENCH] [TOOL] cs.AI
Web agent benchmarking ecosystem with standardized evaluation framework
[2412.17259] LegalAgentBench: Evaluating LLM Agents in Legal Domain [BENCH] [SPEC] cs.CL
Benchmark for evaluating agents in Chinese legal domain tasks

2024-11¶

[2411.13768] Evaluation-driven Approach to LLM Agents [EVAL] [ARCH]
Framework for designing and improving LLM agents through evaluation-driven development
Cross-ref: 2503.16416 (comprehensive evaluation taxonomy), 2507.17257 (specific identity evaluation methods)
[2411.13543] BALROG: BENCHMARKING AGENTIC LLM ANDVLM REASONING ON GAMES [BENCH] [EVAL]
Benchmark for evaluating agentic reasoning capabilities of LLMs and VLMs in game environments
Cross-ref: 2308.03688 (foundational agent benchmarking), 2404.06411 (modular benchmark design)
[2411.10478] Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey [AUTO] [SURVEY]
Survey of LLMs for automated machine learning workflow construction and optimization
Cross-ref: 2506.16499 (practical ML automation), 2507.18074 (automated architecture search)
[2411.00927] ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents [ARCH] [PLAN] cs.AI cs.CL
Extension of ReAct framework for conversational AI agents with integrated reasoning, speaking, and acting
Cross-ref: 2210.03629 (foundational ReAct), 2403.14589 (ReAct training autonomy)
[2411.05285] A taxonomy of agentops for enabling observability of foundation model based agents [COMP] [ARCH]
Taxonomy and framework for observability and operations of foundation model-based agents
Cross-ref: 2412.04093 (practical considerations), 2507.10584 (compliance RAG)
[2411.07763] Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows [BENCH] [SPEC] cs.DB
Enterprise text-to-SQL benchmark with real-world database workflows
[2411.02305] CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks [BENCH] [SPEC] cs.AI
Benchmark for evaluating agents on enterprise CRM professional scenarios

2024-10¶

[2410.09024] AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [BENCH] [SAFETY] [EVAL] cs.AI cs.CR
Benchmark measuring harmful behaviors in LLM agents across malicious and benign agentic scenarios
Accepted at ICLR 2025
Cross-ref: 2512.20798 (constraint violations), 2507.06134 (OpenAgentSafety), 2510.23883 (agentic AI security)
[2410.22457] Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset [EVAL] [TOOL] [PLAN]
Framework for dynamic task decomposition and tool integration in agentic systems with evaluation metrics
Cross-ref: 2405.17935 (foundational tool learning theory), 2402.02716 (planning mechanism foundations)
[2410.14393] Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks [CODE] [AUTO]
AI agents for automated debugging and error resolution in computational notebook environments
Cross-ref: 2503.14713 (automated testing synergy), 2402.01030 (executable code foundations)
[2410.07959] Compl-AI Framework: A Technical Interpretation and LLM Benchmarking [BENCH] [COMP]
Technical framework for interpreting and benchmarking LLM compliance and capabilities
Cross-ref: 2411.05285 (observability framework overlap), 2412.04093 (deployment considerations)
[2410.06703] ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents [BENCH] [SAFETY] cs.AI cs.CR
Benchmark for evaluating safety and trustworthiness of web agents in enterprise environments
Cross-ref: 2307.13854 (WebArena foundation), 2401.13649 (VisualWebArena)
[2410.04444] Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [AUTO] [ARCH] cs.AI cs.LG
Self-referential framework inspired by Gödel machines enabling recursive self-improvement without predefined routines
Cross-ref: 2505.22954 (Darwin Godel Machine), 2508.15805 (ALAS autonomous learning)
[2410.02810] StateAct: State Tracking and Reasoning for Acting and Planning with Large Language Models [PLAN] [ARCH] cs.AI cs.CL
Framework for state tracking and reasoning in LLM-based agents for improved planning and acting
Cross-ref: 2210.03629 (ReAct foundation), 2310.04406 (LATS reasoning)
[2410.09713] Agentic Information Retrieval [TOOL] [ARCH]
Framework for agentic approaches to information retrieval and knowledge discovery
Cross-ref: 2505.15872 (InfoDeepSeek), 2507.10584 (policy compliance RAG)
[2408.08435] AUTOMATED DESIGN OF AGENTIC SYSTEMS [AUTO] [ARCH]
Automated methodology for designing and optimizing agentic AI systems
Cross-ref: 2507.18074 (architecture discovery), 2506.16499 (ML-Master)
[2408.01768] Building Living Software Systems with Generative & Agentic AI [ARCH] [AUTO]
Approach for creating self-evolving software systems using generative and agentic AI
Cross-ref: 2507.21046 (self-evolving survey), 2505.22954 (Darwin Godel)

2024-09¶

[2409.11363] CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark [BENCH] [SCI] [EVAL] cs.AI
Computational reproducibility benchmark for assessing agent ability to verify research claims
Cross-ref: 2407.13168 (SciCode), 2412.14161 (TheAgentCompany)

2024-08¶

[2408.14033] MLR-Copilot: Autonomous Machine Learning Research Framework, gh/du-nlp-lab/MLR-Copilot [RESEARCH] [AUTO]
Autonomous machine learning research framework with three-phase pipeline for idea generation, implementation, and validation
Mimics researchers’ thought processes for systematic ML research automation and executable research contributions
Cross-ref: 2505.18705 (AI-Researcher), 2501.10120 (PaSa), 2408.06292 (AI Scientist)
[2408.06361] Large Language Model Agent in Financial Trading: A Survey
Survey of LLM agents in financial trading applications and market analysis
[2408.06292] The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Framework for fully automated scientific discovery using AI agents
[2408.08632] A Survey on Benchmarks of Multimodal Large Language Models [BENCH] [SURVEY]
Comprehensive survey of benchmarks for evaluating multimodal LLMs and their capabilities
Cross-ref: 2411.13543 (BALROG games), 2507.16940 (AURA multimodal)
[2408.02205] A Taxonomy of Multi-layered Runtime Guardrails for Designing Foundation Model-based Agents: Swiss Cheese Model for AI Safety by Design [SAFETY] [ARCH]
Taxonomy of multi-layered runtime guardrails for safe foundation model-based agent design using Swiss cheese safety model
Cross-ref: 2508.03858 (governance protocol), 2506.04133 (TRiSM)

2024-07¶

[2407.13168] SciCode: A Research Coding Benchmark Curated by Scientists [BENCH] [CODE] [SCI] cs.AI cs.SE
Scientific domain code problems requiring research-level understanding
Cross-ref: 2507.15003 (SE 3.0), 2508.00083 (code generation survey)
[2407.18901] AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents [BENCH] [CODE] cs.SE
Multi-app coding agent benchmark with controllable environments
[2407.18416] PersonaGym: Evaluating Persona Agents and LLMs [BENCH] [SPEC] cs.CL
Benchmark for persona-following agent evaluation
[2407.13943] Werewolf Arena: Strategic Reasoning in LLM Agents [BENCH] [SPEC] cs.AI
Social deduction game benchmark for strategic reasoning assessment
[2407.18219] Recursive Introspection: Teaching Language Model Agents How to Self-Improve [AUTO] [ARCH] cs.AI cs.LG
RISE framework for fine-tuning LLMs to introduce recursive introspection and self-improvement capabilities
Cross-ref: 2405.06682 (self-reflection effects), 2410.04444 (Gödel Agent recursive)

2024-06¶

[2406.12045] τ-bench: A Benchmark for Tool-Agent-User Interaction [BENCH] [TOOL] [EVAL] cs.AI cs.CL
Benchmark evaluating agents on tool use, user interaction, and domain-specific rule adherence; introduces pass^k consistency metric
Cross-ref: 2506.07982 (τ²-bench), 2307.16789 (ToolLLM), 2308.03688 (AgentBench)
[2406.01495] Re-ReST: Reflection-Reinforced Self-Training for Language Agents [AUTO] [ARCH] cs.AI cs.LG
Reflection-reinforced self-training approach using environmental feedback to enhance sample quality and agent performance
Cross-ref: 2303.11366 (Reflexion foundation), 2407.18219 (recursive introspection)

2024-05¶

[2405.06682] Self-Reflection in LLM Agents: Effects on Problem-Solving Performance [AUTO] [EVAL] cs.AI cs.CL
Empirical study demonstrating significant improvement in problem-solving through self-reflection mechanisms
Cross-ref: 2407.18219 (recursive introspection), 2303.11366 (Reflexion framework)

2024-04¶

[2404.07972] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [BENCH] [EVAL] [TOOL] cs.AI
Comprehensive OS/web task benchmark across multiple applications with real-world grounding
Cross-ref: 2307.13854 (WebArena), 2412.14161 (TheAgentCompany), 2401.13649 (VisualWebArena)
[2404.13501] A Survey on the Memory Mechanism of Large Language Model based Agents [MEM] [SURVEY]
Survey of memory mechanisms and architectures in LLM-based agent systems
Cross-ref: 2507.16940 (complex reasoning memory needs), 2503.21460 (broader agent architecture context)
[2404.11584] Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling [ARCH] [SURVEY] [PLAN] [TOOL]
Survey of emerging AI agent architectures focusing on reasoning, planning, and tool calling capabilities
Cross-ref: 2405.17935 (tool integration foundations), 2402.02716 (planning mechanism details)
[2404.06411] AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents [BENCH] [EVAL]
Modular benchmark framework for measuring progress and improvement in LLM agent capabilities
Cross-ref: 2308.03688 (comprehensive benchmarking precedent), 2401.13178 (multi-turn evaluation focus)
[2404.10952] Can Language Models Solve Olympiad Programming? [BENCH] [CODE] cs.AI
USACO benchmark for programming competition problem-solving

2024-02¶

[2402.05044] SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [BENCH] [SAFETY] [EVAL] cs.CL cs.AI
Hierarchical safety benchmark with large scale, rich diversity, and intricate three-level taxonomy
Accepted at ACL 2024 (Findings)
Cross-ref: 2412.14470 (Agent-SafetyBench), 2507.06134 (OpenAgentSafety), 2510.23883 (agentic AI security)
[2402.06360] CoSearchAgent: A Lightweight Collaborative Search Agent with Large Language Models [TOOL] [MAS]
Lightweight collaborative search agent system using LLMs for information retrieval
Cross-ref: 2410.09713 (agentic IR), 2505.15872 (InfoDeepSeek)
[2402.04247] Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [SAFETY] [SCI]
Analysis of safety risks and the need to prioritize safeguarding over autonomy in scientific LLM agents
Cross-ref: 2302.10329 (harms analysis), 2502.02649 (autonomy concerns)
[2402.02716] Understanding the planning of LLM agents: A survey [PLAN] [SURVEY]
Survey of planning mechanisms and strategies in LLM-based agent systems
Cross-ref: 2404.11584 (reasoning architectures), 2508.03682 (self-questioning)
[2402.01030] Executable Code Actions Elicit Better LLM Agents [CODE] [ARCH]
Framework showing how executable code actions improve LLM agent performance
Cross-ref: 2507.22414 (code explanations), 2503.14713 (test generation)

2024-01¶

[2401.13178] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [BENCH] [EVAL]
Analytical evaluation framework for multi-turn interactions and performance assessment of LLM agents
Cross-ref: 2308.03688 (broader agent evaluation scope), 2404.06411 (modular evaluation approach)

2023-08¶

[2308.11432] A Survey on Large Language Model based Autonomous Agents [SURVEY] [ARCH] cs.AI cs.CL
Foundational survey of LLM-based autonomous agents, covering architecture, capabilities, and applications
Cross-ref: 2503.21460 (methodology evolution), 2404.11584 (architectural advances)
[2308.03688] AgentBench: Evaluating LLMs as Agents [BENCH] [EVAL] cs.AI cs.CL
Comprehensive benchmark for evaluating LLMs as autonomous agents across diverse tasks and environments
Cross-ref: 2404.06411 (modular benchmark evolution), 2401.13178 (multi-turn evaluation specialization)

2023-04¶

[2304.08244] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [BENCH] [TOOL] [EVAL] cs.CL cs.AI
Comprehensive benchmark with 73 API tools, 314 tool-use dialogues, and 753 API calls for evaluating planning, retrieval, and calling
Published at EMNLP 2023; includes 1,888 training dialogues from 2,138 APIs across 1,000 domains
Cross-ref: 2307.16789 (ToolLLM), 2310.03128 (MetaTool), 2405.17935 (tool learning survey)
[2304.05376] ChemCrow: LLM Chemistry Agent with Expert-Designed Tools [RESEARCH] [SCI] [TOOL]
LLM chemistry agent augmented with 18 expert-designed tools for organic synthesis, drug discovery, and materials design
Autonomous synthesis planning and execution with emergent capabilities from tool combination
Cross-ref: 2310.10632 (BioPlanner), 2501.06590 (ChemAgent), 2505.18705 (AI-Researcher)

2023-03¶

[2303.11366] Reflexion: Language Agents with Verbal Reinforcement Learning [AUTO] [ARCH] cs.AI cs.CL
Foundational framework for self-reflective agents using verbal reinforcement learning and iterative improvement
Cross-ref: 2405.06682 (self-reflection effects), 2406.01495 (Re-ReST extension)

2023-02¶

[2302.10329] Harms from Increasingly Agentic Algorithmic Systems [SAFETY] [SURVEY]
Analysis of potential harms and risks from increasingly autonomous algorithmic systems
Cross-ref: 2508.03858 (governance solutions), 2506.04133 (risk management framework)

2023-07¶

[2307.16789] ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [TOOL] [BENCH] cs.AI cs.CL
Framework for training LLMs to master real-world APIs with comprehensive tool benchmarking
Cross-ref: 2405.17935 (tool learning survey), 2406.12045 (τ-bench evaluation)
[2307.13854] WebArena: A Realistic Web Environment for Building Autonomous Agents [BENCH] [SPEC] cs.AI cs.HC
Realistic web environment benchmark for evaluating autonomous agents on web-based tasks
Cross-ref: 2401.13649 (VisualWebArena), 2410.06703 (ST-WebAgentBench safety)

2023-10¶

[2310.10632] BioPlanner: Automated AI Approach for Protocol Planning in Biology, gh/bioplanner/bioplanner [RESEARCH] [SCI]
Automated protocol generation for biological experiments using LLMs with BIOPROT dataset of 9,000+ protocols
Generates accurate experimental protocols from natural language with real-world laboratory validation
Cross-ref: 2505.18705 (AI-Researcher), 2304.05376 (ChemCrow), 2501.06590 (ChemAgent)
[2310.04406] Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [PLAN] [ARCH] cs.AI cs.CL
LATS framework integrating Monte Carlo Tree Search with LM reasoning, acting, and planning capabilities
Cross-ref: 2210.03629 (ReAct foundation), 2410.02810 (StateAct)
[2310.03128] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [BENCH] [TOOL] cs.AI cs.CL
Benchmark for evaluating LLM tool selection and usage decision-making capabilities
[2310.08367] Minecraft Gaming Agent Benchmark [BENCH] [SPEC] cs.AI
Open-ended game environment for evaluating agent learning and exploration
Cross-ref: 2307.16789 (ToolLLM), 2406.12045 (τ-bench)

2023-12¶

[2312.07559] PaperQA: Open-Source RAG Agent for Scientific Literature Question Answering, gh/Future-House/paper-qa [RESEARCH] [TOOL]
RAG agent for answering questions over scientific literature with hallucination reduction and provenance tracking
Information retrieval across full-text articles with source attribution for transparent evaluation evidence
Cross-ref: 2505.18705 (AI-Researcher), 2501.10120 (PaSa), 2509.06917 (Paper2Agent)

2023-11¶

[2311.12983] GAIA: a benchmark for General AI Assistants [BENCH] [EVAL] cs.AI cs.CL
Benchmark with 466 real-world questions requiring reasoning, multi-modality, web browsing, and tool-use proficiency
Human performance: 92% vs GPT-4 with plugins: 15%; leaderboard at https://huggingface.co/gaia-benchmark
Cross-ref: 2308.03688 (AgentBench), 2404.06411 (AgentQuest), 2307.13854 (WebArena)

2022-12¶

[2212.08073] Constitutional AI: Harmlessness from AI Feedback [SAFETY] [ARCH] cs.AI cs.LG
Foundational constitutional AI approach for training harmless AI systems through AI feedback
Cross-ref: 2406.07814 (collective constitutional AI), 2501.17112 (inverse constitutional AI)

2022-10¶

[2210.03629] ReAct: Synergizing Reasoning and Acting in Language Models [PLAN] [ARCH] cs.AI cs.CL
Foundational ReAct framework for interleaving reasoning and acting in language model agents
Cross-ref: 2411.00927 (ReSpAct extension), 2310.04406 (LATS integration)

2022-07¶

[2207.01206] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents [BENCH] [SPEC] cs.AI
E-commerce web interaction benchmark for evaluating grounded language agents

2020-10¶

[2010.03768] ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [BENCH] [SPEC] cs.AI
Text-to-embodied task alignment benchmark for interactive agent learning

Practitioner Resources¶

Industry blog posts and engineering articles providing implementation insights and production patterns.

Effective Harnesses for Long-Running Agents - Anthropic Engineering (2025)
Two-agent harness pattern: Initializer + Coding agents for context window management
Key patterns: JSON feature lists, git-based state tracking, incremental development
Failure modes: Premature completion, undocumented progress, testing gaps, setup confusion
Cross-ref: 2512.13564 (memory systems), 2509.25250 (long-running agents), 2510.01003 (repository memory)
Inspect AI - UK AI Safety Institute (2025)
100+ pre-built evaluations, three-component model (datasets, solvers, scorers)
Direct PydanticAI support, MCP integration, multi-agent compositions
Cross-ref: 2507.21504 (evaluation taxonomy), 2503.16416 (evaluation survey)
Bloom - Anthropic (2025)
Four-stage behavioral evaluation: Understanding → Ideation → Rollout → Judgment
Elicitation rate metric (≥7/10 threshold), meta-judge for suite-level analysis
Cross-ref: 2507.06134 (OpenAgentSafety), 2412.14470 (Agent-SafetyBench)
Petri - Anthropic (2025)
Auditor/Target/Judge architecture for alignment auditing, built on Inspect AI
Multi-turn audits, transcript scoring (deception, oversight subversion, harmful content)
Cross-ref: 2410.09024 (AgentHarm), 2402.05044 (SALAD-Bench)
DeepEval AI Agent Evaluation Guide - Confident AI (2025)
Three-layer evaluation model: Reasoning (plan quality/adherence), Action (tool/argument correctness), Execution (task completion/efficiency)
Component-level metric attachment via @observe() decorator pattern
GEval framework for custom LLM-as-Judge criteria using plain English definitions
Cross-ref: 2503.16416 (evaluation survey), 2507.21504 (LLM agents survey)
Pydantic Evals - Pydantic (2025)
Span-based evaluation using OpenTelemetry for internal agent behavior analysis
Loosely coupled framework evaluating any callable (not dependent on pydantic-ai)
Flexible scoring (0.0-1.0 float) with Logfire integration for web-based visualization
Philosophy: “Correctness depends on how the answer was reached, not just the final output”
Cross-ref: 2411.05285 (AgentOps observability taxonomy), 2503.16416 (evaluation survey)
Arize Phoenix Multi-Agent Evaluation - Arize (2025)
Three evaluation strategies: Agent Handoff, System-Level, Coordination
Multi-level metrics: Agent, Interaction, System, User performance measurement
Five coordination patterns: Network, Supervisor, Hierarchical, Tool-calling, Custom Workflow
Handoff evaluation: Appropriateness, information transfer, timing
Cross-ref: 2501.06322 (collaboration mechanisms), 2503.13657 (MAS failures), 2512.08296 (scaling agent systems)
Claude Evaluation Framework - Anthropic (2025)
SMART success criteria (Specific, Measurable, Achievable, Relevant); grading hierarchy: Code-based (fastest) → LLM-based (nuanced) → Human (flexible)
Best practice: Volume over quality; encourage reasoning before scoring
Bloom correlation: Claude Opus 4.1 (0.86), Sonnet 4.5 (0.75) with human scores
Cross-ref: Bloom (alignment.anthropic.com), 2503.16416 (evaluation survey)
Pydantic Logfire - Pydantic (2025-2026)
First-party OpenTelemetry-based observability for PydanticAI agents via logfire.instrument_pydantic_ai()
Three instrumentation paths: Logfire cloud, raw OpenTelemetry with custom TracerProvider, or hybrid routing to alternative backends
Multi-language SDKs (Python, TypeScript, Rust); follows OpenTelemetry GenAI Semantic Conventions
Cross-ref: Pydantic Evals (above), 2602.10133 (AgentTrace), 2601.00481 (MAESTRO)
How to Build a Production Agentic App, the Pydantic Way - Pydantic (2026)
End-to-end guide combining Pydantic AI (agents), Logfire (observability), Pydantic Evals (evaluation), and FastAPI (serving)
Demonstrates full agentic stack: agent → instrument → evaluate → deploy pattern
Cross-ref: Pydantic Evals (above), Pydantic Logfire (above)
OpenTelemetry AI Agent Observability Blog - OpenTelemetry (2025)
Establishes need for standardized agent observability; covers OpenTelemetry GenAI semantic conventions for agent tracing
Cross-ref: 2508.02121 (AgentOps survey), 2602.10133 (AgentTrace)
OTel GenAI Agentic Systems Semantic Conventions Proposal - OpenTelemetry (2025)
Defines attributes for tracing tasks, actions, agents, teams, artifacts, and memory in OpenTelemetry
Standardizes telemetry across complex AI workflows for traceability, reproducibility, and analysis
Cross-ref: 2601.00481 (MAESTRO), 2602.10133 (AgentTrace)
otel-tui - ymtdzzz (2025)
Terminal-based OpenTelemetry trace viewer; single binary accepting OTLP on ports 4317/4318
Zero-infrastructure local debugging; referenced in PydanticAI docs as alternative local backend
Cross-ref: Pydantic Logfire (above), Arize Phoenix (trace_observe_methods.md)
MITRE ATLAS - MITRE (2021-2026)
Adversarial Threat Landscape for Artificial-Intelligence Systems; ATT&CK-style framework for AI/ML threats
2026 updates add agentic AI attack surfaces: runtime decision manipulation, credential abuse, tool misuse, AI Service API (AML.T0096)
Cross-ref: 2510.23883 (agentic AI security), 2506.04133 (TRiSM), OWASP MAESTRO (below)
OWASP MAESTRO Framework - OWASP GenAI Security Project (2025)
Multi-Agent Environment, Security, Threat, Risk, and Outcome; 7-layer threat modeling for multi-agent systems
Applies OWASP ASI threat taxonomy to MAS: Tool Misuse, Intent Manipulation, Privilege Compromise; companion to MITRE ATLAS
Cross-ref: MITRE ATLAS (above), 2503.13657 (MAS failures), 2601.00911 (privacy-preserving agents)
NIST AI Risk Management Framework (AI RMF 1.0) - NIST (2023)
Four core functions: Govern, Map, Measure, Manage for trustworthy AI lifecycle risk management
Flexible, voluntary framework; official crosswalk to ISO/IEC 42001 available from NIST
Cross-ref: ISO 42001 (below), ISO 23894 (below), 2506.04133 (TRiSM)
ISO/IEC 42001:2023 - ISO/IEC (2023)
World’s first AI management system standard; requirements for establishing, implementing, and maintaining an AIMS
Covers ethical considerations, transparency, continuous learning, auditability, and data handling
Cross-ref: NIST AI RMF (above), ISO 23894 (below)
ISO/IEC 23894:2023 - ISO/IEC (2023)
AI risk management guidance; provides principles and processes for managing risk specific to AI systems
Complements ISO 42001 (management system) with focused risk assessment and treatment methodology
Cross-ref: ISO 42001 (above), NIST AI RMF (above)
See docs/analysis/ai-security-governance-frameworks.md for detailed comparative analysis of all four frameworks applied to Agents-eval

Further Reading

Overview¶

Usage¶