Skip to content

Further Reading

Overview

This document provides a comprehensive, curated collection of research papers on agentic AI systems, evaluation frameworks, and related topics. Papers are organized chronologically to show research evolution while featuring thematic tagging and cross-references for efficient navigation.

Usage

  • Browse chronologically by year/month to track research evolution
  • Filter by tags like [EVAL], [SAFETY], [MAS] to find papers by topic
  • Follow cross-references with explanations to discover related work
  • Use thematic clusters at the end for quick topic-based navigation
  • Search arXiv IDs to quickly locate specific papers

Document Features

  • 260+ papers covering 2020-2026 research
  • 14 thematic tags for categorization
  • Cross-references with relationship explanations
  • Chronological organization preserving research timeline
  • Thematic clustering summary for quick navigation

Paper Tags and Categories Legend

  • [ARCH] - Architecture and system design
  • [AUTO] - Automation and workflow
  • [BENCH] - Benchmarking and performance measurement
  • [CODE] - Code generation and programming
  • [COMP] - Compliance and observability
  • [EVAL] - Evaluation frameworks and benchmarks
  • [MAS] - Multi-agent systems
  • [MEM] - Memory mechanisms
  • [PLAN] - Planning and reasoning
  • [SAFETY] - Safety, governance, and risk management
  • [SCI] - Scientific discovery and research
  • [SPEC] - Domain-specific applications
  • [SURVEY] - Survey and review papers
  • [TOOL] - Tool use and integration

Thematic Clusters

Evaluation & Benchmarking [EVAL] [BENCH]:

  • General benchmarks: 2308.03688 (AgentBench), 2404.06411 (AgentQuest), 2401.13178 (AgentBoard), 2311.12983 (GAIA)
  • Web agents: 2307.13854 (WebArena), 2401.13649 (VisualWebArena), 2410.06703 (ST-WebAgentBench), 2404.07972 (OSWorld), 2412.05467 (BrowserGym), 2504.01382 (Online-Mind2Web), 2207.01206 (WebShop)
  • Tool evaluation: 2307.16789 (ToolLLM), 2310.03128 (MetaTool), 2406.12045 (τ-bench), 2506.07982 (τ²-bench), 2304.08244 (API-Bank EMNLP 2023), BFCL
  • Scientific: 2407.13168 (SciCode), 2409.11363 (CORE-Bench)
  • Enterprise: 2509.10769 (AgentArch), 2511.14136 (CLEAR framework), 2412.14161 (TheAgentCompany), 2411.07763 (Spider 2.0), 2411.02305 (CRMArena), 2508.00828 (Finance), 2501.14654 (MedAgentBench)
  • Code/SE: 2407.18901 (AppWorld), SWE-bench verified, 2404.10952 (USACO), 2507.05558 (Smart Contract)
  • Safety/Security: 2504.14064 (DoomArena), 2504.18575 (WASP), 2506.02548 (CyberGym)
  • Gaming/Embodied: 2407.13943 (Werewolf), 2310.08367 (Minecraft), 2010.03768 (ALFWorld), 2407.18416 (PersonaGym)
  • Multi-agent: 2503.01935 (MultiAgentBench), 2512.08296 (scaling agent systems), 2507.05178 (CREW)
  • Safety: 2402.05044 (SALAD-Bench ACL 2024), 2412.14470 (Agent-SafetyBench), 2412.13178 (SafeAgentBench), 2410.09024 (AgentHarm ICLR 2025)
  • Recent 2025-2026: 2510.02271 (InfoMosaic-Bench), 2510.02190 (Deep Research), 2510.01670 (BLIND-ACT), 2512.12791 (assessment framework), TEAM-PHI (de-identification), Behavioral Fingerprinting (LLM profiles), Strategic Reasoning (digital twin)
  • Observability/Production: 2601.00481 (MAESTRO), 2602.10133 (AgentTrace), 2512.04123 (measuring agents), 2601.19583 (architecture-aware metrics), 2512.18311 (monitorability)
  • General agent eval: 2602.22953 (Exgentic, Open General Agent Leaderboard, Unified Protocol)
  • Surveys: 2503.16416 (evaluation survey), 2507.21504 (LLM agents survey), 2411.13768 (evaluation-driven), 2501.11067 (IntellAgent)

Architecture & System Design [ARCH]:

  • Foundation: 2308.11432 (foundational survey), 2404.11584 (architecture landscape), 2510.09244 (fundamentals)
  • Frameworks: 2508.10146 (agentic AI frameworks), 2501.10114 (infrastructure), 2601.01743 (AI agent systems), 2602.10479 (goal-directed systems)
  • Surveys: 2510.25445 (comprehensive survey), 2503.23037 (agentic LLMs), 2506.01438 (architectural frameworks)
  • Governance: 2508.03858 (governance protocol), 2503.00237 (systems theory)

Safety & Risk Management [SAFETY]:

  • Constitutional AI: 2212.08073 (foundational), 2406.07814 (collective), 2501.17112 (inverse)
  • Core frameworks: 2302.10329 (harms analysis), 2506.04133 (TRiSM), 2408.02205 (guardrails), 2507.06134 (OpenAgentSafety), MITRE ATLAS, OWASP MAESTRO
  • Standards: NIST AI RMF 1.0, ISO/IEC 42001 (AI management system), ISO/IEC 23894 (AI risk management)
  • Security: 2510.23883 (agentic AI security), 2512.06659 (cybersecurity evolution), BadScientist (AI publishing vulnerabilities)
  • Safety benchmarks: 2402.05044 (SALAD-Bench ACL 2024), 2412.14470 (Agent-SafetyBench), 2412.13178 (SafeAgentBench), 2410.09024 (AgentHarm ICLR 2025)
  • Monitoring: 2507.11473 (CoT monitorability), 2512.18311 (monitoring monitorability), 2512.20798 (constraint violations), 2601.00911 (privacy-preserving)
  • Reports: 2510.13653 (AI safety first update), 2511.19863 (AI safety second update)
  • Recent 2025: 2510.02286 (adversarial dialogue), 2510.01586 (AdvEvo-MARL), 2510.01569 (InvThink), 2510.02204 (reasoning-execution gaps)
  • Multi-agent: 2503.13657 (MAS failures), 2402.04247 (safeguarding over autonomy), Hierarchical Delegated Oversight (scalable alignment)
  • Self-correction: Architectural Immune System (materials discovery)

Tool Use & Integration [TOOL]:

  • Benchmarks: 2307.16789 (ToolLLM), 2310.03128 (MetaTool), 2406.12045 (τ-bench), 2304.08244 (API-Bank EMNLP 2023), BFCL
  • Surveys: 2405.17935 (tool learning), 2404.11584 (tool calling architectures)
  • Augmentation: 2506.04625 (Tool-MVR meta-verification), 2511.18194 (agent-as-graph), 2512.16214 (PDE-Agent)
  • MCP applications: 2512.03955 (Blocksworld MCP), 2510.02139 (BioinfoMCP), 2509.06917 (Paper2Agent)
  • Recent 2025: 2510.01524 (WALT web agents), 2510.01179 (TOUCAN datasets), 2510.02271 (InfoMosaic-Bench), 2512.03420 (HarnessAgent)
  • Applications: 2410.22457 (tool integration), 2410.09713 (agentic IR)

Multi-Agent Systems [MAS]:

  • Collaboration: 2507.05178 (CREW benchmark), 2501.06322 (collaboration mechanisms), 2512.20845 (MAR reflexion)
  • Benchmarks: 2503.01935 (MultiAgentBench), 2512.08296 (scaling agent systems), 2505.12371 (MedAgentBoard), Job Marketplaces (OpenReview)
  • Analysis: 2503.13657 (failure analysis), 2505.21298 (LLMs miss the mark), 2511.02303 (lazy to deliberation)
  • Applications: 2507.02097 (recommender systems), 2512.20618 (LongVideoAgent), 2512.16214 (PDE-Agent), Echo (pharmacovigilance), Drug Discovery (Alzheimer’s), PsySpace (space missions), Evolutionary Boids (agent societies)
  • Oversight: Hierarchical Delegated Oversight (scalable alignment)
  • Observability: 2602.10133 (AgentTrace), 2601.00481 (MAESTRO)
  • Recent 2026: 2601.03328 (design patterns evaluation), 2602.10479 (goal-directed systems)

Planning & Reasoning [PLAN]:

  • ReAct family: 2210.03629 (ReAct), 2411.00927 (ReSpAct), 2310.04406 (LATS)
  • Core: 2402.02716 (planning survey), 2508.03682 (self-questioning), 2512.14474 (model-first reasoning)
  • Training: 2508.00344 (PilotRL global planning), 2510.01833 (plan-then-action), 2511.02303 (lazy to deliberation)
  • Multi-agent: 2512.20845 (MAR), 2512.08296 (scaling agent systems)
  • Applications: 2410.22457 (task decomposition), 2404.11584 (reasoning architectures), 2512.03955 (Blocksworld MCP)

Scientific Discovery [SCI]:

  • Research agents: 2506.18096 (deep research), 2508.00414 (cognitive kernel), 2509.06917 (Paper2Agent)
  • Discovery: 2408.06292 (AI scientist), 2503.08979 (scientific discovery survey), Beyond Adam (symbolic optimization), Architectural Immune System (self-correcting)
  • Domain applications: AlphaGenome (genomics), Drug Discovery (multi-target Alzheimer’s)

Code Generation [CODE]:

  • Surveys: 2508.00083 (comprehensive survey), 2508.11126 (agentic programming), 2511.18538 (code foundation models)
  • SE 3.0: 2507.15003 (AI teammates), 2510.21413 (context engineering), 2512.14012 (professional developers)
  • Automation: 2505.18646 (SEW self-evolving), 2504.17192 (Paper2Code), 2510.09721 (software engineering benchmarks)
  • Explanations: 2507.22414 (symbolic explanations), 2402.01030 (executable actions)
  • Recent 2025: 2510.02185 (FalseCrashReducer), 2510.01379 (multi-LLM orchestration), 2510.01003 (repository memory), 2512.03420 (HarnessAgent)
  • Applications: 2506.13131 (AlphaEvolve), 2410.14393 (debug agents)

Memory Systems [MEM]:

  • Surveys: 2512.13564 (memory in AI agents), 2512.23343 (AI meets brain), 2404.13501 (memory mechanisms)
  • Frameworks: 2601.03236 (MAGMA multi-graph), 2601.01885 (agentic memory), 2602.20478 (Codified Context), 2502.12110 (A-Mem), 2501.13956 (Zep temporal KG)
  • Learning: 2512.18950 (MACLA hierarchical procedural), 2511.18423 (GAM deep research), 2509.25250 (long-running agents)
  • Applications: 2510.01003 (repository memory), 2508.11120 (marketing MAS), 2510.11290 (AI-Agent School dual memory)
  • Production platforms: Cognee (knowledge graph engine, $7.5M seed Feb 2026), Mem0 ($24M, graph memory), LangMem (LangGraph-native)

Self-Improvement & Reflection [AUTO]:

  • Self-reflection: 2303.11366 (Reflexion foundation), 2405.06682 (self-reflection effects), 2512.20845 (MAR)
  • Recursive improvement: 2407.18219 (recursive introspection), 2410.04444 (Gödel Agent)
  • Training approaches: 2406.01495 (Re-ReST), 2508.15805 (ALAS autonomous learning), 2508.00344 (PilotRL)
  • Workflows: 2505.18646 (SEW self-evolving), 2505.22967 (MermaidFlow), 2506.04625 (Tool-MVR)
  • Human guidance: 2507.17131 (HITL self-improvement), 2508.07407 (self-evolving survey)

Future Research Areas

The following areas represent emerging or under-explored topics in agentic AI research that warrant additional investigation:

Advanced Multi-Modal Agents - Integration of vision, audio, and text processing for comprehensive environmental understanding beyond current multi-modal benchmarks.

Long-Term Memory & Retrieval - Advanced memory architectures for persistent knowledge retention and contextual recall across extended agent interactions.

Human-AI Collaboration - Frameworks for seamless human-agent teamwork, including explanation mechanisms, trust calibration, and collaborative decision-making.

Adversarial Robustness - Agent resilience against adversarial attacks, prompt injection, and manipulation attempts in production environments.

Automated Code Generation Agents - Next-generation coding assistants with advanced debugging, testing, and architectural design capabilities.

Edge & Resource-Constrained Deployment - Efficient agent architectures for mobile devices, IoT systems, and bandwidth-limited environments.

Governance & Policy Implementation - Practical frameworks for regulatory compliance, audit trails, and policy enforcement in agent systems.

Long-Term Autonomy & Reliability - Systems capable of sustained autonomous operation with minimal human intervention over extended periods.

Domain Transfer & Generalization - Techniques for rapid agent adaptation across different domains with minimal retraining or fine-tuning.

Priority Research Focus

Based on current gaps and transformative potential, three areas warrant immediate attention:

1. Compositional Self-Improvement - Moving beyond single-agent reflection to systems that can redesign their own architectures, create specialized sub-agents, and evolve coordination protocols. This represents the next leap from current self-reflection work toward truly recursive intelligence.

2. Persistent Contextual Memory - Current agents lack genuine episodic memory across sessions. Developing memory systems that maintain context, relationships, and learned preferences over months or years is critical for practical deployment and user trust.

3. Robust Human-Agent Teaming - Most current work treats humans as either supervisors or users. Research on agents as true collaborators—with theory-of-mind, explanation capabilities, and dynamic role adaptation—is essential for high-stakes domains like healthcare, research, and decision-making.

2026-02

  • [2602.22953] General Agent Evaluation, exgentic.ai [EVAL] [BENCH] cs.AI
  • IBM Research framework proposing a Unified Protocol for fair, reproducible general agent evaluation without domain-specific tuning; introduces first Open General Agent Leaderboard across 5 agent implementations × 6 environments (AppWorld, BrowseComp+, SWEbenchV, τ²); top: OpenAI MCP + Claude Opus 4.5 = 0.73 avg success
  • Cost-performance Pareto analysis (avg USD per task) enables framework selection on efficiency frontier
  • Cross-ref: 2602.10133 (AgentTrace), 2601.00481 (MAESTRO), 2503.16416 (evaluation survey)
  • [2602.10479] From Prompt-Response to Goal-Directed Systems: The Evolution of Agentic AI Software Architecture [ARCH] [MAS] [SURVEY] cs.SE cs.AI
  • Reference architecture for production-grade LLM agents, taxonomy of multi-agent topologies with failure modes, enterprise hardening checklist covering governance, observability, and reproducibility
  • Cross-ref: 2601.01743 (agent system architectures), 2601.03328 (MAS design patterns), 2508.10146 (agentic frameworks)
  • [2602.10133] AgentTrace: A Structured Logging Framework for Agent System Observability [COMP] [EVAL] [MAS] cs.AI cs.SE
  • First open standard for structured agent logging via schema-based protocol spanning cognitive, operational, and contextual traces; enables fine-grained debugging, failure attribution, and transparent governance
  • Cross-ref: 2601.00481 (MAESTRO evaluation suite), 2512.04123 (measuring agents in production), 2508.02121 (AgentOps survey)
  • [2601.19583] Toward Architecture-Aware Evaluation Metrics for LLM Agents [EVAL] [ARCH] cs.SE cs.AI
  • Links agent architectural components (planners, memory, tool routers) to observable behaviors and appropriate evaluation metrics; enables targeted and actionable evaluation
  • Cross-ref: 2512.12791 (assessment framework), 2503.16416 (evaluation survey), 2507.21504 (LLM agents survey)
  • [2601.00481] MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability [EVAL] [MAS] [BENCH] [COMP] cs.MA cs.AI
  • Standardizes MAS configuration and exports framework-agnostic execution traces with system-level signals (latency, cost, failures); 12 representative MAS across popular frameworks show architecture is the dominant driver of resource profiles and cost-latency-accuracy trade-offs
  • Cross-ref: 2602.10133 (AgentTrace), 2512.04123 (measuring agents), 2508.02121 (AgentOps survey)
  • [2512.18311] Monitoring Monitorability [SAFETY] [EVAL] [COMP] cs.AI cs.LG
  • Proposes monitorability metric and evaluation archetypes (intervention, process, outcome-property) for chain-of-thought monitoring; finds longer CoTs are more monitorable and smaller models at higher reasoning effort can yield higher monitorability
  • Cross-ref: 2512.12791 (assessment framework), 2601.01743 (agent architectures survey)
  • [2512.04123] Measuring Agents in Production [EVAL] [COMP] cs.SE cs.AI
  • Interview-based study (306 survey responses, 20 in-depth interviews across 26 domains) arguing agent evaluation must move beyond correctness metrics to assess reliability under varying autonomy levels
  • Cross-ref: 2512.12791 (assessment framework), 2601.00481 (MAESTRO), 2503.16416 (evaluation survey)
  • [2602.20478] Codified Context: Infrastructure for AI Agents in a Complex Codebase [MEM] [ARCH] cs.SE cs.AI
  • Three-tier context architecture (hot-memory constitution + 19 specialist agents + 34-doc cold-memory knowledge base) validated across 283 sessions on 108K LOC C# distributed system; 24.2% knowledge-to-code ratio; MCP retrieval service for on-demand spec loading; context drift detector
  • Cross-ref: 2601.19583 (architecture-aware metrics), 2602.10479 (agentic architecture)

2026-01

2025-12

2025-11

2025-10

2025-09

2025-08

2025-07

2025-06

2025-05

2025-04

2025-03

2025-02

2025-01

2024-12

2024-11

2024-10

2024-09

2024-08

2024-07

2024-06

2024-05

2024-04

2024-02

2024-01

2023-08

2023-04

  • [2304.08244] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [BENCH] [TOOL] [EVAL] cs.CL cs.AI
  • Comprehensive benchmark with 73 API tools, 314 tool-use dialogues, and 753 API calls for evaluating planning, retrieval, and calling
  • Published at EMNLP 2023; includes 1,888 training dialogues from 2,138 APIs across 1,000 domains
  • Cross-ref: 2307.16789 (ToolLLM), 2310.03128 (MetaTool), 2405.17935 (tool learning survey)
  • [2304.05376] ChemCrow: LLM Chemistry Agent with Expert-Designed Tools [RESEARCH] [SCI] [TOOL]
  • LLM chemistry agent augmented with 18 expert-designed tools for organic synthesis, drug discovery, and materials design
  • Autonomous synthesis planning and execution with emergent capabilities from tool combination
  • Cross-ref: 2310.10632 (BioPlanner), 2501.06590 (ChemAgent), 2505.18705 (AI-Researcher)

2023-03

2023-02

2023-07

2023-10

2023-12

2023-11

2022-12

2022-10

2022-07

2020-10

Practitioner Resources

Industry blog posts and engineering articles providing implementation insights and production patterns.

  • Effective Harnesses for Long-Running Agents - Anthropic Engineering (2025)
  • Two-agent harness pattern: Initializer + Coding agents for context window management
  • Key patterns: JSON feature lists, git-based state tracking, incremental development
  • Failure modes: Premature completion, undocumented progress, testing gaps, setup confusion
  • Cross-ref: 2512.13564 (memory systems), 2509.25250 (long-running agents), 2510.01003 (repository memory)
  • Inspect AI - UK AI Safety Institute (2025)
  • 100+ pre-built evaluations, three-component model (datasets, solvers, scorers)
  • Direct PydanticAI support, MCP integration, multi-agent compositions
  • Cross-ref: 2507.21504 (evaluation taxonomy), 2503.16416 (evaluation survey)
  • Bloom - Anthropic (2025)
  • Four-stage behavioral evaluation: Understanding → Ideation → Rollout → Judgment
  • Elicitation rate metric (≥7/10 threshold), meta-judge for suite-level analysis
  • Cross-ref: 2507.06134 (OpenAgentSafety), 2412.14470 (Agent-SafetyBench)
  • Petri - Anthropic (2025)
  • Auditor/Target/Judge architecture for alignment auditing, built on Inspect AI
  • Multi-turn audits, transcript scoring (deception, oversight subversion, harmful content)
  • Cross-ref: 2410.09024 (AgentHarm), 2402.05044 (SALAD-Bench)
  • DeepEval AI Agent Evaluation Guide - Confident AI (2025)
  • Three-layer evaluation model: Reasoning (plan quality/adherence), Action (tool/argument correctness), Execution (task completion/efficiency)
  • Component-level metric attachment via @observe() decorator pattern
  • GEval framework for custom LLM-as-Judge criteria using plain English definitions
  • Cross-ref: 2503.16416 (evaluation survey), 2507.21504 (LLM agents survey)
  • Pydantic Evals - Pydantic (2025)
  • Span-based evaluation using OpenTelemetry for internal agent behavior analysis
  • Loosely coupled framework evaluating any callable (not dependent on pydantic-ai)
  • Flexible scoring (0.0-1.0 float) with Logfire integration for web-based visualization
  • Philosophy: “Correctness depends on how the answer was reached, not just the final output”
  • Cross-ref: 2411.05285 (AgentOps observability taxonomy), 2503.16416 (evaluation survey)
  • Arize Phoenix Multi-Agent Evaluation - Arize (2025)
  • Three evaluation strategies: Agent Handoff, System-Level, Coordination
  • Multi-level metrics: Agent, Interaction, System, User performance measurement
  • Five coordination patterns: Network, Supervisor, Hierarchical, Tool-calling, Custom Workflow
  • Handoff evaluation: Appropriateness, information transfer, timing
  • Cross-ref: 2501.06322 (collaboration mechanisms), 2503.13657 (MAS failures), 2512.08296 (scaling agent systems)
  • Claude Evaluation Framework - Anthropic (2025)
  • SMART success criteria (Specific, Measurable, Achievable, Relevant); grading hierarchy: Code-based (fastest) → LLM-based (nuanced) → Human (flexible)
  • Best practice: Volume over quality; encourage reasoning before scoring
  • Bloom correlation: Claude Opus 4.1 (0.86), Sonnet 4.5 (0.75) with human scores
  • Cross-ref: Bloom (alignment.anthropic.com), 2503.16416 (evaluation survey)
  • Pydantic Logfire - Pydantic (2025-2026)
  • First-party OpenTelemetry-based observability for PydanticAI agents via logfire.instrument_pydantic_ai()
  • Three instrumentation paths: Logfire cloud, raw OpenTelemetry with custom TracerProvider, or hybrid routing to alternative backends
  • Multi-language SDKs (Python, TypeScript, Rust); follows OpenTelemetry GenAI Semantic Conventions
  • Cross-ref: Pydantic Evals (above), 2602.10133 (AgentTrace), 2601.00481 (MAESTRO)
  • How to Build a Production Agentic App, the Pydantic Way - Pydantic (2026)
  • End-to-end guide combining Pydantic AI (agents), Logfire (observability), Pydantic Evals (evaluation), and FastAPI (serving)
  • Demonstrates full agentic stack: agent → instrument → evaluate → deploy pattern
  • Cross-ref: Pydantic Evals (above), Pydantic Logfire (above)
  • OpenTelemetry AI Agent Observability Blog - OpenTelemetry (2025)
  • Establishes need for standardized agent observability; covers OpenTelemetry GenAI semantic conventions for agent tracing
  • Cross-ref: 2508.02121 (AgentOps survey), 2602.10133 (AgentTrace)
  • OTel GenAI Agentic Systems Semantic Conventions Proposal - OpenTelemetry (2025)
  • Defines attributes for tracing tasks, actions, agents, teams, artifacts, and memory in OpenTelemetry
  • Standardizes telemetry across complex AI workflows for traceability, reproducibility, and analysis
  • Cross-ref: 2601.00481 (MAESTRO), 2602.10133 (AgentTrace)
  • otel-tui - ymtdzzz (2025)
  • Terminal-based OpenTelemetry trace viewer; single binary accepting OTLP on ports 4317/4318
  • Zero-infrastructure local debugging; referenced in PydanticAI docs as alternative local backend
  • Cross-ref: Pydantic Logfire (above), Arize Phoenix (trace_observe_methods.md)
  • MITRE ATLAS - MITRE (2021-2026)
  • Adversarial Threat Landscape for Artificial-Intelligence Systems; ATT&CK-style framework for AI/ML threats
  • 2026 updates add agentic AI attack surfaces: runtime decision manipulation, credential abuse, tool misuse, AI Service API (AML.T0096)
  • Cross-ref: 2510.23883 (agentic AI security), 2506.04133 (TRiSM), OWASP MAESTRO (below)
  • OWASP MAESTRO Framework - OWASP GenAI Security Project (2025)
  • Multi-Agent Environment, Security, Threat, Risk, and Outcome; 7-layer threat modeling for multi-agent systems
  • Applies OWASP ASI threat taxonomy to MAS: Tool Misuse, Intent Manipulation, Privilege Compromise; companion to MITRE ATLAS
  • Cross-ref: MITRE ATLAS (above), 2503.13657 (MAS failures), 2601.00911 (privacy-preserving agents)
  • NIST AI Risk Management Framework (AI RMF 1.0) - NIST (2023)
  • Four core functions: Govern, Map, Measure, Manage for trustworthy AI lifecycle risk management
  • Flexible, voluntary framework; official crosswalk to ISO/IEC 42001 available from NIST
  • Cross-ref: ISO 42001 (below), ISO 23894 (below), 2506.04133 (TRiSM)
  • ISO/IEC 42001:2023 - ISO/IEC (2023)
  • World’s first AI management system standard; requirements for establishing, implementing, and maintaining an AIMS
  • Covers ethical considerations, transparency, continuous learning, auditability, and data handling
  • Cross-ref: NIST AI RMF (above), ISO 23894 (below)
  • ISO/IEC 23894:2023 - ISO/IEC (2023)
  • AI risk management guidance; provides principles and processes for managing risk specific to AI systems
  • Complements ISO 42001 (management system) with focused risk assessment and treatment methodology
  • Cross-ref: ISO 42001 (above), NIST AI RMF (above)
  • See docs/analysis/ai-security-governance-frameworks.md for detailed comparative analysis of all four frameworks applied to Agents-eval