<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://qte77.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://qte77.github.io/" rel="alternate" type="text/html" /><updated>2026-01-16T15:28:12+00:00</updated><id>https://qte77.github.io/feed.xml</id><title type="html">Recap on ML</title><subtitle>Recap on ML</subtitle><author><name>qte77</name></author><entry><title type="html">Agentx Agentbeats Writeup</title><link href="https://qte77.github.io/agentx-agentbeats-writeup/" rel="alternate" type="text/html" title="Agentx Agentbeats Writeup" /><published>2026-01-15T00:00:00+00:00</published><updated>2026-01-15T00:00:00+00:00</updated><id>https://qte77.github.io/agentx-agentbeats-writeup</id><content type="html" xml:base="https://qte77.github.io/agentx-agentbeats-writeup/"><![CDATA[<h1 id="graphjudge-measuring-how-agents-collaborate">GraphJudge: Measuring How Agents Collaborate</h1>

<blockquote>
  <p>Measure how, not just whether</p>
</blockquote>

<h2 id="about-agentbeats--agentic-ai-learning">About AgentBeats &amp; Agentic AI Learning</h2>

<p>GraphJudge is built for the
<a href="https://rdi.berkeley.edu/agentx-agentbeats">AgentBeats competition</a>, part of
the RDI Foundation’s initiative to advance agent evaluation infrastructure.
AgentBeats establishes a standardized framework (A2A protocol) for
benchmarking AI agents through competitive and collaborative tasks.</p>

<p>This competition runs alongside the
<a href="https://agenticai-learning.org">Agentic AI Learning MOOC</a>—a comprehensive
course teaching agent system design, evaluation, and deployment. The course
materials at <a href="https://docs.agentbeats.org">docs.agentbeats.org</a> and
<a href="https://docs.agentbeats.dev/tutorial/">docs.agentbeats.dev/tutorial</a> provide
hands-on experience building green (assessor) and purple (evaluated) agents
using the A2A protocol.</p>

<p>GraphJudge contributes to this ecosystem by introducing graph-based
coordination assessment—a novel evaluation methodology that complements
existing task-completion benchmarks with structural analysis of agent
interactions.</p>

<h2 id="the-problem-success-isnt-the-whole-story">The Problem: Success Isn’t the Whole Story</h2>

<p>When you evaluate multi-agent systems today, you typically ask: “Did they
complete the task?” But here’s what that misses—two agents might both succeed
at a task, yet one does it through elegant coordination while the other
stumbles through with redundant communication and bottlenecks. Traditional
benchmarks can’t tell the difference.</p>

<p>Think of it like evaluating team projects in school. Getting an A on the final
deliverable doesn’t tell you whether the team collaborated effectively or if
one person did all the work while others copied notes at the last minute. We
need to measure <strong>how</strong> agents work together, not just whether they succeed.</p>

<h2 id="our-approach-graph-based-coordination-analysis">Our Approach: Graph-Based Coordination Analysis</h2>

<p>GraphJudge is a graph-centric evaluation framework built for the AgentBeats
competition that measures <strong>coordination network complexity against execution
outcomes</strong>. We capture interaction traces as agents communicate, then
transform these traces into directed graphs where nodes represent agents and
edges represent their communications. This isn’t just bookkeeping—it reveals
the structure of collaboration and whether agents achieve results through
efficient coordination or convoluted communication patterns.</p>

<p>We extract <strong>structural metrics</strong> that quantify what’s actually happening:</p>

<ul>
  <li><strong>Centrality</strong>: Which agents are coordination hubs vs peripheral
participants?</li>
  <li><strong>Density</strong>: How connected is the communication network?</li>
  <li><strong>Efficiency</strong>: Are agents taking direct paths or bouncing messages around?</li>
</ul>

<p>These NetworkX-based graph metrics form our primary evaluation tier (Tier 1),
complemented by latency analysis that tracks performance bottlenecks. Together,
they provide quantitative measures of coordination quality that you can compare
across different agent systems.</p>

<h3 id="beyond-pure-numbers-the-llm-as-judge-layer">Beyond Pure Numbers: The LLM-as-Judge Layer</h3>

<p>Graphs tell you the structure, but what about the quality of interactions?
That’s where our <strong>Tier 2 LLM-as-judge</strong> comes in. We use real LLM API calls
(with rule-based fallback) to provide qualitative assessment of coordination
patterns—did agents adapt their strategies? Did they share information
effectively? This semantic layer complements the quantitative graph metrics
with behavioral insights.</p>

<p>For consistency validation, <strong>Tier 3 text metrics</strong> measure response
similarity across multiple runs, ensuring reproducibility in evaluation.</p>

<h2 id="origins-building-on-agents-eval">Origins: Building on Agents-eval</h2>

<p>GraphJudge is derived from
<a href="https://github.com/qte77/Agents-eval">Agents-eval</a>, a PeerRead-based
benchmark for autonomous research agent systems. We adapted its evaluation
philosophy—measuring the quality of agent behavior, not just outcomes—to the
AgentBeats context. Where Agents-eval focuses on research paper assessment
using text similarity metrics, GraphJudge pivots to graph structural analysis
for general multi-agent coordination.</p>

<p>This isn’t a fork—it’s an architectural adaptation. We took the core insight
that agent evaluation needs multiple complementary metrics and specialized it
for coordination assessment through graph theory.</p>

<h2 id="implementation-a2a-compliant-and-production-ready">Implementation: A2A-Compliant and Production-Ready</h2>

<p>GraphJudge operates as an A2A-compliant assessor, exposing standard endpoints
that any purple agent can interact with. The evaluation flow is
straightforward:</p>

<ol>
  <li>Purple agent submits evaluation request via A2A protocol</li>
  <li>GraphJudge captures interaction traces during task execution</li>
  <li>Traces → Directed graph → Structural metrics extraction</li>
  <li>Three-tier evaluation produces comprehensive coordination scores</li>
  <li>Results returned as structured A2A artifacts</li>
</ol>

<p>The complete agentic graph benchmark architecture is visualized below, showing
the full evaluation pipeline from trace capture through multi-tier scoring:</p>

<p><img src="images/RDI-AgentX-Architecture-light.png" alt="Agentic Graph Benchmark Architecture" /></p>

<p>We validated the framework on a baseline purple agent across 5 independent
runs, achieving <strong>perfect reproducibility</strong> (0% variance across all metrics).
This isn’t just about proving correctness—it demonstrates that our evaluation
is stable and fair for comparing different agent implementations.</p>

<p>Deployment is containerized via Docker, with results integrating directly into
the <a href="https://github.com/qte77/RDI-AgentX-AgentBeats-Competition-Leaderboard">AgentBeats
leaderboard</a>
for transparent comparison. The agent is registered at
<a href="https://agentbeats.dev/qte77/graphjudge">agentbeats.dev/qte77/graphjudge</a>.</p>

<h2 id="why-this-matters">Why This Matters</h2>

<p>No existing AgentBeats benchmark quantifies coordination quality through graph
structural analysis. GraphJudge fills that gap by providing researchers with
actionable insights into <strong>how effectively agents collaborate</strong>.</p>

<p>You don’t just get a pass/fail grade—you get metrics that reveal:</p>

<ul>
  <li>Communication bottlenecks in your agent network</li>
  <li>Centralization vs distributed coordination patterns</li>
  <li>Performance characteristics under different workloads</li>
  <li>Behavioral adaptability through qualitative assessment</li>
</ul>

<p>This enables evidence-based improvements to multi-agent system design. You can
see exactly where coordination breaks down and iterate accordingly.</p>

<h2 id="development-insights--contributions">Development Insights &amp; Contributions</h2>

<h3 id="lessons-learned">Lessons Learned</h3>

<p><strong>Ralph Loop TDD</strong>: Enforcing TEST-first then IMPL proved challenging. The
Ralph loop naturally wants to implement before testing, requiring scaffolding
through linting rules, Claude Code skills (<code class="language-plaintext highlighter-rouge">.claude/skills/</code>), and core
principles (<code class="language-plaintext highlighter-rouge">.claude/rules/</code>) to maintain TDD discipline. Interestingly,
specialized subagents became less critical than initially expected—well-
structured skills and rules provide sufficient guidance for the main agent.</p>

<p><strong>AgentBeats Submission</strong>: The submission process is comprehensive—requiring
both green (assessor) and purple (evaluated) agents, a main agent repository
plus separate leaderboard repository, registration on agentbeats.dev, GitHub
workflow permissions configuration, container package tokens for GHCR
publishing, Docker image deployment, demo video creation, abstract writing,
MOOC article contribution, and finally a multi-page submission form with tight
deadline. Each component serves a purpose (reproducibility, transparency,
education), though coordinating everything in time tests your project
management skills. The resulting infrastructure is well-designed for the agent
ecosystem’s long-term growth.</p>

<p><strong>Time Constraints</strong>: Competition deadlines unfortunately cut development time
short, limiting implementation of advanced features like interactive graph
visualizations, Phase 2 ART training on traces, and comprehensive plugin
ecosystem expansion. The current release prioritizes core graph-based
coordination assessment with proven reproducibility, establishing a foundation
for future enhancements. The agentic benchmark architecture visualization
(<code class="language-plaintext highlighter-rouge">assets/AgenticBenchArch.png</code>) documents the intended full system design.</p>

<h3 id="technical-contributions">Technical Contributions</h3>

<p>GraphJudge introduces three novel elements to AgentBeats:</p>

<ol>
  <li><strong>Custom trace engine</strong>: Captures interaction patterns during task execution,
transforming A2A message flows into directed graphs for structural analysis</li>
  <li><strong>Network complexity scoring</strong>: Combines graph metrics (where lower
complexity often indicates efficient coordination) with LLM-as-judge
qualitative assessment of MAS execution quality</li>
  <li><strong>Plugin architecture</strong>: Future-ready extensibility enabling domain-specific
evaluators—demonstrated through the text metrics module designed for
Agents-eval’s PeerRead dataset assessment</li>
</ol>

<p>This architecture balances quantitative structural analysis with qualitative
behavioral assessment, while remaining extensible for specialized evaluation
contexts.</p>

<h2 id="categories--contribution">Categories &amp; Contribution</h2>

<p><strong>Competition Categories</strong>: Multi-agent Evaluation, Research Agent</p>

<p><strong>Core Contribution</strong>: First AgentBeats benchmark measuring coordination
quality through graph structural analysis, enabling researchers to understand
not just if agents coordinate, but how effectively.</p>

<p>GraphJudge pioneers <strong>agentified benchmarking</strong> for multi-agent systems—using
automated evaluation agents to assess coordination quality. This approach is
demonstrated through integration with agents-eval, a research MAS that
evaluates autonomous agents on the PeerRead dataset. By combining graph-based
structural metrics with domain-specific evaluation plugins, GraphJudge
establishes a framework where assessment agents can be specialized for
different contexts while maintaining consistent coordination analysis.</p>

<h2 id="competition-compliance">Competition Compliance</h2>

<p>GraphJudge meets all official
<a href="https://rdi.berkeley.edu/agentx-agentbeats">AgentBeats competition requirements</a>:</p>

<ul>
  <li><strong>A2A Protocol</strong>: Universal agent interface with standard endpoints at
<code class="language-plaintext highlighter-rouge">/.well-known/agent.json</code></li>
  <li><strong>Docker Deployment</strong>: Containerized for <code class="language-plaintext highlighter-rouge">linux/amd64</code>, published to GHCR,
accepts CLI args (<code class="language-plaintext highlighter-rouge">--host</code>, <code class="language-plaintext highlighter-rouge">--port</code>, <code class="language-plaintext highlighter-rouge">--card-url</code>)</li>
  <li><strong>Reproducibility</strong>: Fresh state per assessment, task ID namespacing,
documented across 5 validation runs</li>
  <li><strong>Leaderboard Integration</strong>: DuckDB queries extract graph metrics,
coordination scores, and similarity measures from published results</li>
</ul>

<p>Judging criteria alignment: Technical correctness (A2A-compliant, typed,
tested), reproducibility (0% variance documented), benchmark quality (graph
metrics reveal genuine coordination patterns), evaluation methodology
(three-tier quantitative + qualitative assessment), innovation (first
graph-based coordination benchmark in AgentBeats).</p>

<hr />

<p><strong>Agent Registry</strong>: <a href="https://agentbeats.dev/qte77/graphjudge">agentbeats.dev/qte77/graphjudge</a></p>

<p><strong>Repository</strong>:
<a href="https://github.com/qte77/RDI-AgentX-AgentBeats-Competition">github.com/qte77/RDI-AgentX-AgentBeats-Competition</a></p>

<p><strong>Leaderboard</strong>:
<a href="https://github.com/qte77/RDI-AgentX-AgentBeats-Competition-Leaderboard">github.com/qte77/RDI-AgentX-AgentBeats-Competition-Leaderboard</a></p>

<p><strong>References</strong>:
<a href="https://rdi.berkeley.edu/agentx-agentbeats">Competition Page</a> |
<a href="https://github.com/RDI-Foundation/agentbeats-tutorial">AgentBeats Tutorial</a> |
<a href="https://github.com/RDI-Foundation/green-agent-template">Green Agent Template</a> |
<a href="https://docs.agentbeats.dev/tutorial/">Documentation</a></p>]]></content><author><name>qte77</name></author><summary type="html"><![CDATA[GraphJudge: Measuring How Agents Collaborate]]></summary></entry><entry><title type="html">AI Agents-eval Comprehensive Analysis</title><link href="https://qte77.github.io/ai-agents-eval-comprehensive-analysis/" rel="alternate" type="text/html" title="AI Agents-eval Comprehensive Analysis" /><published>2025-08-09T00:00:00+00:00</published><updated>2025-08-09T00:00:00+00:00</updated><id>https://qte77.github.io/ai-agents-eval-comprehensive-analysis</id><content type="html" xml:base="https://qte77.github.io/ai-agents-eval-comprehensive-analysis/"><![CDATA[<h1 id="comprehensive-analysis-individual-paper-summaries">Comprehensive Analysis: Individual Paper Summaries</h1>

<p>Following paper reviews are based on the papers contained in <a href="https://github.com/qte77/Agents-eval/blob/main/docs/papers/further_reading.md">Further Reading</a>.
Refer to the <a href="https://claude.ai/public/artifacts/7761a54c-f49b-486b-9e28-7aa2de8b3c86">Paper Visualization</a> which was inspired by <a href="https://papescape.org">Paperscape</a>.
This summery aims to enhance the project <a href="https://github.com/qte77/Agents-eval">Agents-eval</a> and was generated with help provided by Claude Sonnet 4 🙏🏼🌟🙌🏼💕🤗</p>

<h2 id="2025-08">2025-08</h2>

<h3 id="250803858-mi9---agent-intelligence-protocol-runtime-governance-for-agentic-ai-systems">[2508.03858] MI9 - Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems</h3>

<p><strong>Evaluation Approach</strong>: Focuses on runtime governance and monitoring of agentic systems. Establishes protocols for continuous intelligence assessment and behavioral compliance monitoring during agent execution.</p>

<p>• <strong>Focus</strong>: Agentic systems - runtime governance and intelligence protocols
• <strong>Relevance for Agents-eval</strong>: High - runtime monitoring protocols for continuous evaluation
• <strong>Concrete Example</strong>: Implement MI9 protocol adapters to monitor agent decision-making patterns and compliance metrics in real-time</p>

<h3 id="250803682-self-questioning-language-models">[2508.03682] Self-Questioning Language Models</h3>

<p><strong>Evaluation Approach</strong>: Develops self-assessment mechanisms where models generate and answer their own evaluation questions. Creates introspective evaluation loops for model uncertainty and capability assessment.</p>

<p>• <strong>Focus</strong>: LLM-based systems with self-evaluation capabilities
• <strong>Relevance for Agents-eval</strong>: High - self-questioning mechanisms for automated evaluation
• <strong>Concrete Example</strong>: Integrate self-questioning modules that generate domain-specific evaluation questions for agents to assess their own performance</p>

<h3 id="250800414-cognitive-kernel-pro-a-framework-for-deep-research-agents-and-agent-foundation-models-training">[2508.00414] Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training</h3>

<p><strong>Evaluation Approach</strong>: Establishes evaluation metrics for research-oriented agents, focusing on deep reasoning capabilities, research methodology adherence, and scientific output quality assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - specifically research agents and foundation model training
• <strong>Relevance for Agents-eval</strong>: Medium - research-specific evaluation methods
• <strong>Concrete Example</strong>: Adapt cognitive kernel evaluation metrics to assess agent reasoning depth and research methodology compliance</p>

<h2 id="2025-07">2025-07</h2>

<h3 id="250723276-how-far-are-ai-scientists-from-changing-the-world">[2507.23276] How Far Are AI Scientists from Changing the World?</h3>

<p><strong>Evaluation Approach</strong>: Evaluates AI scientist capabilities through impact assessment, scientific contribution analysis, and research output quality metrics. Includes benchmarks for scientific discovery and innovation potential.</p>

<p>• <strong>Focus</strong>: Agentic systems - AI scientists and research agents
• <strong>Relevance for Agents-eval</strong>: Medium - scientific impact evaluation methods
• <strong>Concrete Example</strong>: Implement scientific contribution scoring system based on novelty, reproducibility, and potential impact metrics</p>

<h3 id="250722414-autocodesherpa-symbolic-explanations-in-ai-coding-agents">[2507.22414] AutoCodeSherpa: Symbolic Explanations in AI Coding Agents</h3>

<p><strong>Evaluation Approach</strong>: Focuses on explainability evaluation for coding agents. Assesses quality of symbolic explanations, code reasoning transparency, and interpretability of agent decisions.</p>

<p>• <strong>Focus</strong>: Agentic systems - coding agents with explainability features
• <strong>Relevance for Agents-eval</strong>: Medium - explainability assessment for coding agents
• <strong>Concrete Example</strong>: Add explainability evaluation module that scores agent explanations using symbolic reasoning clarity metrics</p>

<h3 id="250721046-a-survey-of-self-evolving-agents-on-path-to-artificial-super-intelligence">[2507.21046] A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence</h3>

<p><strong>Evaluation Approach</strong>: Evaluates self-evolution capabilities in agents, including adaptation metrics, learning progression assessment, and capability expansion measurement over time.</p>

<p>• <strong>Focus</strong>: Agentic systems - self-evolving autonomous agents
• <strong>Relevance for Agents-eval</strong>: High - longitudinal evaluation of agent evolution
• <strong>Concrete Example</strong>: Implement evolution tracking dashboard that monitors agent capability changes and adaptation rates over time</p>

<h3 id="250718074-alphago-moment-for-model-architecture-discovery">[2507.18074] AlphaGo Moment for Model Architecture Discovery</h3>

<p><strong>Evaluation Approach</strong>: Evaluates automated architecture discovery agents, focusing on search efficiency, architecture quality, and optimization convergence metrics.</p>

<p>• <strong>Focus</strong>: Agentic systems - architecture discovery agents
• <strong>Relevance for Agents-eval</strong>: Low - highly specialized for architecture discovery
• <strong>Concrete Example</strong>: Adapt architecture quality metrics for evaluating any agent’s internal structure optimization</p>

<h3 id="250717311-earthlink-a-self-evolving-ai-agent-for-climate-science">[2507.17311] EarthLink: A Self-Evolving AI Agent for Climate Science</h3>

<p><strong>Evaluation Approach</strong>: Domain-specific evaluation for climate science agents, including scientific accuracy, prediction quality, and environmental impact assessment capabilities.</p>

<p>• <strong>Focus</strong>: Agentic systems - domain-specific climate science agents
• <strong>Relevance for Agents-eval</strong>: Low - highly domain-specific
• <strong>Concrete Example</strong>: Extract domain-agnostic scientific accuracy evaluation methods for specialized knowledge agents</p>

<h3 id="250717257-agent-identity-evals-measuring-agentic-identity">[2507.17257] Agent Identity Evals: Measuring Agentic Identity</h3>

<p><strong>Evaluation Approach</strong>: Develops identity consistency evaluation for agents, measuring personality persistence, behavioral coherence, and identity stability across interactions.</p>

<p>• <strong>Focus</strong>: Agentic systems - agent identity and personality evaluation
• <strong>Relevance for Agents-eval</strong>: High - identity consistency evaluation framework
• <strong>Concrete Example</strong>: Implement identity coherence scoring system that tracks agent personality consistency across different tasks</p>

<h3 id="250716940-aura-a-multi-modal-medical-agent-for-understanding-reasoning--annotation">[2507.16940] AURA: A Multi-Modal Medical Agent for Understanding, Reasoning &amp; Annotation</h3>

<p><strong>Evaluation Approach</strong>: Multi-modal evaluation for medical agents, including diagnostic accuracy, reasoning quality, and annotation precision across different medical data types.</p>

<p>• <strong>Focus</strong>: Agentic systems - multi-modal medical agents
• <strong>Relevance for Agents-eval</strong>: Medium - multi-modal evaluation techniques
• <strong>Concrete Example</strong>: Adapt multi-modal evaluation pipeline for agents handling diverse data types (text, images, structured data)</p>

<h3 id="250710584-arpaccino-an-agentic-rag-for-policy-as-code-compliance">[2507.10584] ARPaCCino: An Agentic-RAG for Policy as Code Compliance</h3>

<p><strong>Evaluation Approach</strong>: Compliance evaluation for RAG-based agents, focusing on policy adherence, regulatory compliance accuracy, and code compliance verification.</p>

<p>• <strong>Focus</strong>: Agentic systems - RAG agents with compliance focus
• <strong>Relevance for Agents-eval</strong>: Medium - compliance and policy adherence evaluation
• <strong>Concrete Example</strong>: Build compliance evaluation module that checks agent outputs against predefined policy requirements</p>

<h3 id="250705178-crew-wildfire-benchmarking-agentic-multi-agent-collaborations-at-scale">[2507.05178] CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale</h3>

<p><strong>Evaluation Approach</strong>: Large-scale multi-agent evaluation using emergency response scenarios. Measures coordination, communication effectiveness, and collective decision-making in crisis situations.</p>

<p>• <strong>Focus</strong>: Agentic systems - multi-agent collaborative systems
• <strong>Relevance for Agents-eval</strong>: Medium - multi-agent collaboration evaluation
• <strong>Concrete Example</strong>: Implement team coordination metrics that evaluate agent communication patterns and task distribution efficiency</p>

<h3 id="250702825-establishing-best-practices-for-building-rigorous-agentic-benchmarks">[2507.02825] Establishing Best Practices for Building Rigorous Agentic Benchmarks</h3>

<p><strong>Evaluation Approach</strong>: Meta-evaluation methodology providing guidelines for benchmark design, reproducibility standards, and evaluation framework validation.</p>

<p>• <strong>Focus</strong>: Agentic systems - evaluation methodology and best practices
• <strong>Relevance for Agents-eval</strong>: Very High - foundational evaluation framework design
• <strong>Concrete Example</strong>: Apply rigorous benchmark design principles including statistical validity, reproducibility checks, and bias detection protocols</p>

<h3 id="250702097-the-future-is-agentic-definitions-perspectives-and-open-challenges-of-multi-agent-recommender-systems">[2507.02097] The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems</h3>

<p><strong>Evaluation Approach</strong>: Evaluation framework for multi-agent recommender systems, including recommendation quality, user satisfaction, and system fairness assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - multi-agent recommender systems
• <strong>Relevance for Agents-eval</strong>: Low - highly domain-specific for recommender systems
• <strong>Concrete Example</strong>: Extract collaborative filtering evaluation metrics for any multi-agent system with recommendation components</p>

<h2 id="2025-06">2025-06</h2>

<h3 id="250618096-deep-research-agents-a-systematic-examination-and-roadmap">[2506.18096] Deep Research Agents: A Systematic Examination And Roadmap</h3>

<p><strong>Evaluation Approach</strong>: Comprehensive evaluation framework for research agents including literature review quality, hypothesis generation, experimental design, and scientific rigor assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - deep research agents
• <strong>Relevance for Agents-eval</strong>: High - systematic evaluation methodology for complex agent tasks
• <strong>Concrete Example</strong>: Implement research methodology evaluation pipeline that scores agent performance on systematic investigation tasks</p>

<h3 id="250616499-ml-master-towards-ai-for-ai-via-integration-of-exploration-and-reasoning">[2506.16499] ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning</h3>

<p><strong>Evaluation Approach</strong>: Evaluates AI systems that optimize other AI systems, focusing on meta-learning capabilities, optimization effectiveness, and reasoning integration quality.</p>

<p>• <strong>Focus</strong>: Agentic systems - meta-AI optimization agents
• <strong>Relevance for Agents-eval</strong>: Medium - meta-evaluation and optimization assessment
• <strong>Concrete Example</strong>: Build meta-evaluation layer that assesses how well agents can evaluate and improve other agents</p>

<h3 id="250613131-alphaevolve-a-coding-agent-for-scientific-and-algorithmic-discovery">[2506.13131] AlphaEvolve: A coding agent for scientific and algorithmic discovery</h3>

<p><strong>Evaluation Approach</strong>: Scientific coding evaluation including algorithm novelty, implementation correctness, computational efficiency, and scientific contribution assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - scientific coding agents
• <strong>Relevance for Agents-eval</strong>: Medium - scientific coding evaluation methods
• <strong>Concrete Example</strong>: Implement algorithmic discovery scoring that evaluates code novelty, efficiency, and scientific validity</p>

<h3 id="250604133-trism-for-agentic-ai-a-review-of-trust-risk-and-security-management-in-llm-based-agentic-multi-agent-systems">[2506.04133] TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems</h3>

<p><strong>Evaluation Approach</strong>: Security and trust evaluation for multi-agent systems, including risk assessment, trust measurement, and security vulnerability evaluation.</p>

<p>• <strong>Focus</strong>: Agentic systems - security and trust evaluation
• <strong>Relevance for Agents-eval</strong>: High - safety and security evaluation framework
• <strong>Concrete Example</strong>: Integrate TRiSM security evaluation modules that assess agent trustworthiness and risk levels</p>

<h2 id="2025-05">2025-05</h2>

<h3 id="250522967-mermaidflow-redefining-agentic-workflow-generation-via-safety-constrained-evolutionary-programming">[2505.22967] MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming</h3>

<p><strong>Evaluation Approach</strong>: Evaluates workflow generation agents with safety constraints, including workflow quality, safety compliance, and evolutionary optimization effectiveness.</p>

<p>• <strong>Focus</strong>: Agentic systems - workflow generation with safety constraints
• <strong>Relevance for Agents-eval</strong>: High - safety-constrained evaluation methodology
• <strong>Concrete Example</strong>: Implement safety-constrained workflow evaluation that checks agent outputs against safety requirements</p>

<h3 id="250522954-darwin-godel-machine-open-ended-evolution-of-self-improving-agents">[2505.22954] Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents</h3>

<p><strong>Evaluation Approach</strong>: Long-term evolutionary evaluation of self-improving agents, including adaptation measurement, improvement trajectory analysis, and evolutionary fitness assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - self-improving evolutionary agents
• <strong>Relevance for Agents-eval</strong>: High - long-term agent evolution tracking
• <strong>Concrete Example</strong>: Build evolution monitoring system that tracks agent self-improvement over extended periods</p>

<h3 id="250522583-gitgoodbench-a-novel-benchmark-for-evaluating-agentic-performance-on-git">[2505.22583] GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git</h3>

<p><strong>Evaluation Approach</strong>: Domain-specific evaluation for software development agents using Git operations, measuring code management, collaboration skills, and workflow understanding.</p>

<p>• <strong>Focus</strong>: Agentic systems - software development agents
• <strong>Relevance for Agents-eval</strong>: Medium - domain-specific Git-based evaluation
• <strong>Concrete Example</strong>: Adapt Git operation evaluation suite for any agent performing version control tasks</p>

<h3 id="250519764-agentic-predictor-performance-prediction-for-agentic-workflows-via-multi-view-encoding">[2505.19764] Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding</h3>

<p><strong>Evaluation Approach</strong>: Predictive evaluation using multi-view encoding to forecast agent performance before full execution, enabling proactive optimization.</p>

<p>• <strong>Focus</strong>: Agentic systems - predictive performance evaluation
• <strong>Relevance for Agents-eval</strong>: High - predictive evaluation for efficiency optimization
• <strong>Concrete Example</strong>: Implement performance prediction module that estimates agent success rates before task execution</p>

<h3 id="250518946-sannet-a-semantic-aware-agentic-ai-networking-framework-for-multi-agent-cross-layer-coordination">[2505.18946] SANNet: A Semantic-Aware Agentic AI Networking Framework for Multi-Agent Cross-Layer Coordination</h3>

<p><strong>Evaluation Approach</strong>: Network-aware evaluation for multi-agent systems, including coordination efficiency, communication overhead, and semantic understanding assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - networked multi-agent coordination
• <strong>Relevance for Agents-eval</strong>: Medium - network-aware agent evaluation
• <strong>Concrete Example</strong>: Add network coordination metrics that evaluate agent communication efficiency and semantic alignment</p>

<h3 id="250515872-infodeepseek-benchmarking-agentic-information-seeking-for-retrieval-augmented-generation">[2505.15872] InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation</h3>

<p><strong>Evaluation Approach</strong>: Information seeking evaluation for RAG agents, including search quality, information relevance, and retrieval effectiveness assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - information seeking and RAG agents
• <strong>Relevance for Agents-eval</strong>: Medium - information retrieval evaluation methods
• <strong>Concrete Example</strong>: Implement information seeking benchmark that evaluates agent query formulation and retrieval quality</p>

<h2 id="2025-04">2025-04</h2>

<h3 id="250419678-from-llm-reasoning-to-autonomous-ai-agents-a-comprehensive-review">[2504.19678] From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review</h3>

<p><strong>Evaluation Approach</strong>: Comprehensive review of evaluation methods spanning from LLM reasoning assessment to full autonomous agent evaluation, bridging traditional and agentic evaluation.</p>

<p>• <strong>Focus</strong>: Both LLM and agentic systems - comprehensive evaluation survey
• <strong>Relevance for Agents-eval</strong>: High - comprehensive evaluation methodology overview
• <strong>Concrete Example</strong>: Use survey taxonomy to structure evaluation categories from basic reasoning to full autonomy</p>

<h3 id="250416902-building-a-secure-agentic-ai-application-leveraging-googles-a2a-protocol">[2504.16902] Building A Secure Agentic AI Application Leveraging Google’s A2A Protocol</h3>

<p><strong>Evaluation Approach</strong>: Security evaluation for agentic applications using A2A protocol, focusing on authentication, authorization, and secure communication assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - secure application development
• <strong>Relevance for Agents-eval</strong>: Medium - security evaluation using A2A protocol
• <strong>Concrete Example</strong>: Implement A2A-based security evaluation that verifies agent authentication and secure communication protocols</p>

<h2 id="2025-03">2025-03</h2>

<h3 id="250321460-large-language-model-agent-a-survey-on-methodology-applications-and-challenges">[2503.21460] Large Language Model Agent: A Survey on Methodology, Applications and Challenges</h3>

<p><strong>Evaluation Approach</strong>: Survey of LLM agent evaluation methods across different applications, including capability assessment, application-specific metrics, and challenge identification.</p>

<p>• <strong>Focus</strong>: LLM-based agents - comprehensive methodology survey
• <strong>Relevance for Agents-eval</strong>: High - systematic agent evaluation methodology
• <strong>Concrete Example</strong>: Structure evaluation framework using survey’s methodology taxonomy for different agent applications</p>

<h3 id="250316416-survey-on-evaluation-of-llm-based-agents">[2503.16416] Survey on Evaluation of LLM-based Agents</h3>

<p><strong>Evaluation Approach</strong>: Comprehensive survey categorizing evaluation into capability assessment, behavioral analysis, and performance benchmarking with gap identification.</p>

<p>• <strong>Focus</strong>: LLM-based agents - systematic evaluation survey
• <strong>Relevance for Agents-eval</strong>: Very High - systematic evaluation framework
• <strong>Concrete Example</strong>: Implement three-tier evaluation system: capabilities, behaviors, and performance as suggested in survey</p>

<h3 id="250314713-testforge-feedback-driven-agentic-test-suite-generation">[2503.14713] TestForge: Feedback-Driven, Agentic Test Suite Generation</h3>

<p><strong>Evaluation Approach</strong>: Self-generating evaluation through automated test suite creation with feedback loops for continuous improvement and adaptation.</p>

<p>• <strong>Focus</strong>: Agentic systems - self-evaluating test generation
• <strong>Relevance for Agents-eval</strong>: High - automated test generation and self-evaluation
• <strong>Concrete Example</strong>: Build TestForge-inspired module that automatically generates evaluation tests based on agent performance feedback</p>

<h3 id="250313657-why-do-multi-agent-llm-systems-fail">[2503.13657] Why Do Multi-Agent LLM Systems Fail?</h3>

<p><strong>Evaluation Approach</strong>: Failure analysis evaluation focusing on identifying failure modes, root cause analysis, and system reliability assessment in multi-agent contexts.</p>

<p>• <strong>Focus</strong>: Multi-agent LLM systems - failure analysis evaluation
• <strong>Relevance for Agents-eval</strong>: High - failure mode detection and analysis
• <strong>Concrete Example</strong>: Implement failure analysis module that identifies common multi-agent failure patterns and root causes</p>

<h3 id="250308979-agentic-ai-for-scientific-discovery-a-survey-of-progress-challenges-and-future-direction">[2503.08979] Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Direction</h3>

<p><strong>Evaluation Approach</strong>: Scientific discovery evaluation including research quality, discovery novelty, experimental design, and scientific impact assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - scientific discovery agents
• <strong>Relevance for Agents-eval</strong>: Medium - scientific discovery evaluation methods
• <strong>Concrete Example</strong>: Adapt scientific discovery metrics to evaluate any agent performing research or discovery tasks</p>

<h3 id="250306416-advancing-ai-negotiations-new-theory-and-evidence-from-a-large-scale-autonomous-negotiation-competition">[2503.06416] Advancing AI Negotiations: New Theory and Evidence from a Large-Scale Autonomous Negotiation Competition</h3>

<p><strong>Evaluation Approach</strong>: Negotiation performance evaluation including strategy effectiveness, outcome optimization, and competitive performance assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - autonomous negotiation agents
• <strong>Relevance for Agents-eval</strong>: Low - highly specialized for negotiation tasks
• <strong>Concrete Example</strong>: Extract strategic decision-making evaluation metrics for any agent performing competitive tasks</p>

<h3 id="250300237-agentic-ai-needs-a-systems-theory">[2503.00237] Agentic AI Needs a Systems Theory</h3>

<p><strong>Evaluation Approach</strong>: Systems-theoretic evaluation approach focusing on emergent properties, system behavior analysis, and complexity assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - systems theory approach to evaluation
• <strong>Relevance for Agents-eval</strong>: High - systems-level evaluation methodology
• <strong>Concrete Example</strong>: Implement systems-theory evaluation that assesses agent emergent properties and complex system behaviors</p>

<h2 id="2025-02">2025-02</h2>

<h3 id="250214776-surveyx-academic-survey-automation-via-large-language-models">[2502.14776] SurveyX: Academic Survey Automation via Large Language Models</h3>

<p><strong>Evaluation Approach</strong>: Automated survey generation and analysis evaluation, including survey quality, response analysis accuracy, and research methodology compliance.</p>

<p>• <strong>Focus</strong>: LLM-based systems - automated survey generation
• <strong>Relevance for Agents-eval</strong>: Low - highly specialized for survey automation
• <strong>Concrete Example</strong>: Extract automated evaluation generation methods for creating evaluation surveys</p>

<h3 id="250205957-autoagent-a-fully-automated-and-zero-code-framework-for-llm-agents">[2502.05957] AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents</h3>

<p><strong>Evaluation Approach</strong>: Zero-code agent evaluation focusing on automation quality, user experience, and framework effectiveness assessment.</p>

<p>• <strong>Focus</strong>: LLM-based agents - automated agent frameworks
• <strong>Relevance for Agents-eval</strong>: Medium - automated evaluation framework design
• <strong>Concrete Example</strong>: Implement zero-code evaluation interface that allows non-technical users to evaluate agents</p>

<h3 id="250202649-fully-autonomous-ai-agents-should-not-be-developed">[2502.02649] Fully Autonomous AI Agents Should Not be Developed</h3>

<p><strong>Evaluation Approach</strong>: Safety and ethics evaluation for autonomous agents, including risk assessment, ethical compliance, and safety constraint verification.</p>

<p>• <strong>Focus</strong>: Agentic systems - safety and ethics evaluation
• <strong>Relevance for Agents-eval</strong>: High - safety and ethics evaluation framework
• <strong>Concrete Example</strong>: Build safety evaluation module that assesses agent autonomy levels and associated risks</p>

<h2 id="2025-01">2025-01</h2>

<h3 id="250116150-ai-agents-for-computer-use-a-review-of-instruction-based-computer-control-gui-automation-and-operator-assistants">[2501.16150] AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants</h3>

<p><strong>Evaluation Approach</strong>: Computer-use agent evaluation including accuracy metrics, user experience measures, safety assessments, and real-world usability testing.</p>

<p>• <strong>Focus</strong>: Agentic systems - computer-use and GUI automation agents
• <strong>Relevance for Agents-eval</strong>: Medium - computer-use evaluation methods
• <strong>Concrete Example</strong>: Implement GUI interaction evaluation suite that measures agent accuracy in computer control tasks</p>

<h3 id="250106590-chemagent">[2501.06590] ChemAgent</h3>

<p><strong>Evaluation Approach</strong>: Chemistry-specific agent evaluation including chemical knowledge accuracy, reaction prediction quality, and safety protocol compliance.</p>

<p>• <strong>Focus</strong>: Agentic systems - domain-specific chemistry agents
• <strong>Relevance for Agents-eval</strong>: Low - highly domain-specific for chemistry
• <strong>Concrete Example</strong>: Extract domain expertise evaluation methods for any specialized knowledge agent</p>

<h3 id="250106322-multi-agent-collaboration-mechanisms-a-survey-of-llms">[2501.06322] Multi-Agent Collaboration Mechanisms: A Survey of LLMs</h3>

<p><strong>Evaluation Approach</strong>: Collaboration mechanism evaluation including coordination efficiency, communication quality, and collective intelligence assessment.</p>

<p>• <strong>Focus</strong>: Multi-agent LLM systems - collaboration mechanisms
• <strong>Relevance for Agents-eval</strong>: Medium - multi-agent collaboration evaluation
• <strong>Concrete Example</strong>: Implement collaboration quality metrics that measure agent teamwork effectiveness</p>

<h3 id="250104227-agent-laboratory-using-llm-agents-as-research-assistants">[2501.04227] Agent Laboratory: Using LLM Agents as Research Assistants</h3>

<p><strong>Evaluation Approach</strong>: Research assistant evaluation including research quality, methodology compliance, and scientific contribution assessment.</p>

<p>• <strong>Focus</strong>: LLM-based agents - research assistance
• <strong>Relevance for Agents-eval</strong>: Medium - research assistance evaluation methods
• <strong>Concrete Example</strong>: Build research assistant evaluation that scores agent contributions to scientific workflows</p>

<h3 id="250100881-agentic-systems-a-guide-to-transforming-industries-with-vertical-ai-agents">[2501.00881] Agentic Systems: A Guide to Transforming Industries with Vertical AI Agents</h3>

<p><strong>Evaluation Approach</strong>: Industry-specific evaluation for vertical agents, including domain adaptation assessment, business impact measurement, and transformation effectiveness.</p>

<p>• <strong>Focus</strong>: Agentic systems - industry-specific vertical agents
• <strong>Relevance for Agents-eval</strong>: Medium - industry adaptation evaluation methods
• <strong>Concrete Example</strong>: Create industry adaptation evaluation framework that measures agent effectiveness across different domains</p>

<h2 id="2024-12">2024-12</h2>

<h3 id="241217149-a-multi-ai-agent-system-for-autonomous-optimization-of-agentic-ai-solutions-via-iterative-refinement-and-llm-driven-feedback-loop">[2412.17149] A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loop</h3>

<p><strong>Evaluation Approach</strong>: Iterative refinement evaluation with LLM-driven feedback loops, measuring optimization effectiveness, convergence quality, and system improvement.</p>

<p>• <strong>Focus</strong>: Agentic systems - multi-agent optimization systems
• <strong>Relevance for Agents-eval</strong>: High - iterative evaluation with feedback loops
• <strong>Concrete Example</strong>: Implement feedback-driven evaluation system that continuously refines evaluation criteria based on agent performance</p>

<h3 id="241204093-practical-considerations-for-agentic-llm-systems">[2412.04093] Practical Considerations for Agentic LLM Systems</h3>

<p><strong>Evaluation Approach</strong>: Practical deployment evaluation including system reliability, scalability assessment, maintenance requirements, and operational effectiveness.</p>

<p>• <strong>Focus</strong>: LLM-based agentic systems - practical deployment considerations
• <strong>Relevance for Agents-eval</strong>: High - practical deployment evaluation considerations
• <strong>Concrete Example</strong>: Add deployment readiness evaluation that assesses agent reliability and operational requirements</p>

<h2 id="2024-11">2024-11</h2>

<h3 id="241113768-evaluation-driven-approach-to-llm-agents">[2411.13768] Evaluation-driven Approach to LLM Agents</h3>

<p><strong>Evaluation Approach</strong>: Evaluation-driven development methodology where assessment guides optimization, focusing on continuous improvement and performance-based refinement.</p>

<p>• <strong>Focus</strong>: LLM-based agents - evaluation-driven development
• <strong>Relevance for Agents-eval</strong>: High - evaluation-driven development methodology
• <strong>Concrete Example</strong>: Implement development pipeline that uses evaluation results to automatically suggest agent improvements</p>

<h3 id="241113543-balrog-benchmarking-agentic-llm-and-vlm-reasoning-on-games">[2411.13543] BALROG: Benchmarking Agentic LLM and VLM Reasoning on Games</h3>

<p><strong>Evaluation Approach</strong>: Game-based reasoning evaluation using strategic environments to assess planning, decision-making, and competitive performance.</p>

<p>• <strong>Focus</strong>: Agentic systems - reasoning evaluation through games
• <strong>Relevance for Agents-eval</strong>: Medium - game-based evaluation methods
• <strong>Concrete Example</strong>: Create strategic reasoning benchmark using simplified game scenarios to evaluate agent decision-making</p>

<h3 id="241110478-large-language-models-for-constructing-and-optimizing-machine-learning-workflows-a-survey">[2411.10478] Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey</h3>

<p><strong>Evaluation Approach</strong>: ML workflow construction evaluation including pipeline quality, optimization effectiveness, and workflow validity assessment.</p>

<p>• <strong>Focus</strong>: LLM-based systems - ML workflow construction
• <strong>Relevance for Agents-eval</strong>: Low - specialized for ML workflow construction
• <strong>Concrete Example</strong>: Extract workflow construction evaluation metrics for agents that build complex processes</p>

<h3 id="241105285-a-taxonomy-of-agentops-for-enabling-observability-of-foundation-model-based-agents">[2411.05285] A Taxonomy of AgentOps for Enabling Observability of Foundation Model Based Agents</h3>

<p><strong>Evaluation Approach</strong>: Operational observability evaluation through AgentOps taxonomy, focusing on runtime monitoring and system health assessment.</p>

<p>• <strong>Focus</strong>: Foundation model-based agents - operational monitoring
• <strong>Relevance for Agents-eval</strong>: High - operational monitoring and observability
• <strong>Concrete Example</strong>: Implement AgentOps monitoring dashboard that tracks agent operational metrics in real-time</p>

<h2 id="2024-10">2024-10</h2>

<h3 id="241022457-advancing-agentic-systems-dynamic-task-decomposition-tool-integration-and-evaluation-using-novel-metrics-and-dataset">[2410.22457] Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset</h3>

<p><strong>Evaluation Approach</strong>: Novel metrics for dynamic task decomposition and tool integration, including adaptability measurement and decomposition quality assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - task decomposition and tool integration
• <strong>Relevance for Agents-eval</strong>: Very High - novel evaluation metrics and datasets
• <strong>Concrete Example</strong>: Implement dynamic task decomposition evaluation that scores agent ability to break down complex tasks</p>

<h3 id="241014393-debug-smarter-not-harder-ai-agents-for-error-resolution-in-computational-notebooks">[2410.14393] Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks</h3>

<p><strong>Evaluation Approach</strong>: Debugging agent evaluation including error detection accuracy, resolution effectiveness, and code improvement quality assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - debugging and error resolution agents
• <strong>Relevance for Agents-eval</strong>: Medium - debugging effectiveness evaluation
• <strong>Concrete Example</strong>: Build debugging evaluation suite that measures agent error detection and resolution capabilities</p>

<h3 id="241009713-agentic-information-retrieval">[2410.09713] Agentic Information Retrieval</h3>

<p><strong>Evaluation Approach</strong>: Information retrieval evaluation for autonomous agents, including search strategy assessment, relevance judgment, and retrieval effectiveness.</p>

<p>• <strong>Focus</strong>: Agentic systems - autonomous information retrieval
• <strong>Relevance for Agents-eval</strong>: Medium - information retrieval evaluation methods
• <strong>Concrete Example</strong>: Implement information retrieval evaluation that assesses agent search strategies and result quality</p>

<h3 id="240808435-automated-design-of-agentic-systems">[2408.08435] Automated Design of Agentic Systems</h3>

<p><strong>Evaluation Approach</strong>: Automated system design evaluation including design quality, system effectiveness, and automation level assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - automated system design
• <strong>Relevance for Agents-eval</strong>: Medium - automated design evaluation methods
• <strong>Concrete Example</strong>: Create design quality evaluation that scores agent-generated system architectures</p>

<h3 id="240801768-building-living-software-systems-with-generative--agentic-ai">[2408.01768] Building Living Software Systems with Generative &amp; Agentic AI</h3>

<p><strong>Evaluation Approach</strong>: Living systems evaluation including adaptability, evolution capability, and system lifespan assessment for generative and agentic systems.</p>

<p>• <strong>Focus</strong>: Agentic systems - living software systems
• <strong>Relevance for Agents-eval</strong>: Medium - adaptive system evaluation methods
• <strong>Concrete Example</strong>: Implement living system evaluation that tracks agent adaptation and evolution over time</p>

<h2 id="2024-08">2024-08</h2>

<h3 id="240806361-large-language-model-agent-in-financial-trading-a-survey">[2408.06361] Large Language Model Agent in Financial Trading: A Survey</h3>

<p><strong>Evaluation Approach</strong>: Financial trading evaluation including portfolio performance, risk management, market adaptation, and trading strategy effectiveness.</p>

<p>• <strong>Focus</strong>: LLM-based agents - financial trading applications
• <strong>Relevance for Agents-eval</strong>: Low - highly domain-specific for financial trading
• <strong>Concrete Example</strong>: Extract quantitative performance evaluation methods for any agent making sequential decisions</p>

<h3 id="240806292-the-ai-scientist-towards-fully-automated-open-ended-scientific-discovery">[2408.06292] The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery</h3>

<p><strong>Evaluation Approach</strong>: Scientific discovery evaluation including research novelty, experimental validity, publication quality, and scientific impact assessment.</p>

<p>• <strong>Focus</strong>: Agentic systems - automated scientific discovery
• <strong>Relevance for Agents-eval</strong>: Medium - scientific discovery evaluation methods
• <strong>Concrete Example</strong>: Implement scientific contribution evaluation that scores agent research outputs for novelty and validity</p>

<h2 id="2024-04">2024-04</h2>

<h3 id="240413501-a-survey-on-the-memory-mechanism-of-large-language-model-based-agents">[2404.13501] A Survey on the Memory Mechanism of Large Language Model based Agents</h3>

<p><strong>Evaluation Approach</strong>: Memory system evaluation including memory retention, retrieval accuracy, contextual relevance, and memory utilization effectiveness.</p>

<p>• <strong>Focus</strong>: LLM-based agents - memory mechanisms
• <strong>Relevance for Agents-eval</strong>: High - memory system evaluation methods
• <strong>Concrete Example</strong>: Build memory evaluation suite that tests agent memory retention, retrieval accuracy, and contextual usage</p>

<h2 id="2024-02">2024-02</h2>

<h3 id="240206360-cosearchagent-a-lightweight-collaborative-search-agent-with-large-language-models">[2402.06360] CoSearchAgent: A Lightweight Collaborative Search Agent with Large Language Models</h3>

<p><strong>Evaluation Approach</strong>: Collaborative search evaluation including search coordination, result quality, collaboration effectiveness, and search strategy assessment.</p>

<p>• <strong>Focus</strong>: LLM-based agents - collaborative search
• <strong>Relevance for Agents-eval</strong>: Low - specialized for collaborative search
• <strong>Concrete Example</strong>: Extract collaborative task evaluation methods for any multi-agent coordination scenario</p>

<h3 id="240202716-understanding-the-planning-of-llm-agents-a-survey">[2402.02716] Understanding the planning of LLM agents: A survey</h3>

<p><strong>Evaluation Approach</strong>: Planning capability evaluation including plan quality, execution effectiveness, adaptation ability, and strategic thinking assessment.</p>

<p>• <strong>Focus</strong>: LLM-based agents - planning capabilities
• <strong>Relevance for Agents-eval</strong>: High - planning evaluation methods
• <strong>Concrete Example</strong>: Implement planning evaluation suite that scores agent strategic thinking and plan execution quality</p>

<h3 id="240201030-executable-code-actions-elicit-better-llm-agents">[2402.01030] Executable Code Actions Elicit Better LLM Agents</h3>

<p><strong>Evaluation Approach</strong>: Code execution evaluation including code quality, execution success, error handling, and practical implementation effectiveness.</p>

<p>• <strong>Focus</strong>: LLM-based agents - executable code generation
• <strong>Relevance for Agents-eval</strong>: Medium - code execution evaluation methods
• <strong>Concrete Example</strong>: Create code execution evaluation that measures agent coding accuracy and execution success rates</p>

<h2 id="2023-08">2023-08</h2>

<h3 id="230811432-a-survey-on-large-language-model-based-autonomous-agents">[2308.11432] A Survey on Large Language Model based Autonomous Agents</h3>

<p><strong>Evaluation Approach</strong>: Comprehensive autonomous agent evaluation including capability assessment, autonomy measurement, and performance benchmarking across multiple dimensions.</p>

<p>• <strong>Focus</strong>: LLM-based autonomous agents - comprehensive evaluation survey
• <strong>Relevance for Agents-eval</strong>: Very High - foundational comprehensive agent evaluation
• <strong>Concrete Example</strong>: Use survey’s evaluation framework as foundation for multi-dimensional agent assessment structure</p>

<h2 id="conclusion">Conclusion</h2>

<p>The comprehensive analysis of 50+ papers reveals a rapidly maturing field with clear consensus around key evaluation dimensions while highlighting significant opportunities for standardization and integration.</p>]]></content><author><name>qte77</name></author><category term="ml" /><category term="ai" /><category term="agents" /><category term="eval" /><summary type="html"><![CDATA[Comprehensive Analysis: Individual Paper Summaries]]></summary></entry><entry><title type="html">AI Agents-eval Enhancement Recommendations</title><link href="https://qte77.github.io/ai-agents-eval-enhancement-recommendations/" rel="alternate" type="text/html" title="AI Agents-eval Enhancement Recommendations" /><published>2025-08-09T00:00:00+00:00</published><updated>2025-08-09T00:00:00+00:00</updated><id>https://qte77.github.io/ai-agents-eval-enhancement-recommendations</id><content type="html" xml:base="https://qte77.github.io/ai-agents-eval-enhancement-recommendations/"><![CDATA[<h1 id="enhancement-recommendations-for-agents-eval-project">Enhancement Recommendations for Agents-eval Project</h1>

<p>This proposition is based on <a href="https://github.com/qte77/qte77.github.io/blob/master/_posts/2025-08-09-ai-agents-eval-comprehensive-analysis.md">Comprehensive Analysis</a> and <a href="https://github.com/qte77/qte77.github.io/blob/master/_posts/2025-08-09-ai-agents-eval-papers-meta-review.md">Meta Review</a> of the papers contained in <a href="https://github.com/qte77/Agents-eval/blob/main/docs/papers/further_reading.md">Further Reading</a>.
It aims to enhance the project <a href="https://github.com/qte77/Agents-eval">Agents-eval</a> and was generated with help provided by Claude Sonnet 4 🙏🏼🌟🙌🏼💕🤗</p>

<h2 id="core-framework-enhancements">Core Framework Enhancements</h2>

<ol>
  <li><strong>Multi-Dimensional Evaluation Architecture</strong></li>
</ol>

<ul>
  <li>Implement a three-tier evaluation system</li>
  <li><strong>Capability Layer</strong>: Core competencies (reasoning, planning, tool use)</li>
  <li><strong>Behavioral Layer</strong>: Consistency, adaptability, interaction patterns</li>
  <li><strong>Performance Layer</strong>: Task completion, efficiency, real-world effectiveness</li>
  <li>Based on [2503.16416], [2308.11432], and [2504.19678]</li>
</ul>

<ol>
  <li><strong>Dynamic Evaluation Pipeline</strong></li>
</ol>

<ul>
  <li><strong>Continuous Monitoring</strong>: Real-time performance tracking during agent execution</li>
  <li><strong>Adaptive Benchmarks</strong>: Evaluation criteria that evolve based on agent capabilities</li>
  <li><strong>Feedback Loops</strong>: Automatic refinement of evaluation based on results</li>
  <li>Using insights from [2507.21046], [2505.22954], and [2412.17149]</li>
</ul>

<ol>
  <li><strong>Safety-First Evaluation Framework</strong></li>
</ol>

<ul>
  <li><strong>Risk Assessment Module</strong>: Evaluate potential harm and safety compliance</li>
  <li><strong>Ethical Compliance Checker</strong>: Verify alignment with ethical guidelines</li>
  <li><strong>Security Evaluation</strong>: Assess vulnerability and trustworthiness</li>
  <li>Incorporating [2506.04133], [2502.02649], and [2505.22967]</li>
</ul>

<h2 id="advanced-features-implementation">Advanced Features Implementation</h2>

<ol>
  <li><strong>Self-Evaluation Integration</strong></li>
</ol>

<ul>
  <li><strong>Self-Questioning Module</strong>: Agents generate their own evaluation questions</li>
  <li><strong>Identity Consistency Tracker</strong>: Monitor agent personality and behavior stability</li>
  <li><strong>Automated Test Generation</strong>: Dynamic creation of evaluation scenarios</li>
  <li>Based on [2508.03682], [2503.14713], and [2507.17257]</li>
</ul>

<ol>
  <li><strong>Predictive Evaluation System</strong></li>
</ol>

<ul>
  <li><strong>Performance Prediction</strong>: Estimate success probability before full task execution</li>
  <li><strong>Resource Optimization</strong>: Predict computational requirements and optimize evaluation efficiency</li>
  <li><strong>Early Warning System</strong>: Identify potential failure modes before they occur</li>
  <li>From [2505.19764] insights</li>
</ul>

<ol>
  <li><strong>Multi-Agent Coordination Assessment</strong></li>
</ol>

<ul>
  <li><strong>Collaboration Metrics</strong>: Measure teamwork effectiveness and communication quality</li>
  <li><strong>Failure Analysis</strong>: Identify and categorize multi-agent system failure modes</li>
  <li><strong>Emergent Behavior Detection</strong>: Track unexpected group behaviors and properties</li>
  <li>Incorporating [2507.05178], [2501.06322], and [2503.13657]</li>
</ul>

<h2 id="specialized-evaluation-modules">Specialized Evaluation Modules</h2>

<ol>
  <li><strong>Domain-Specific Evaluation Suites</strong></li>
</ol>

<ul>
  <li><strong>Scientific Research Module</strong>: Evaluate research methodology and contribution quality</li>
  <li><strong>Code Generation Suite</strong>: Assess programming capabilities and software development skills</li>
  <li><strong>Information Retrieval Evaluator</strong>: Test search strategies and information synthesis</li>
  <li><strong>Creative Tasks Assessor</strong>: Measure creative output quality and originality</li>
</ul>

<ol>
  <li><strong>Explainability and Interpretability Assessment</strong></li>
</ol>

<ul>
  <li><strong>Decision Transparency Scorer</strong>: Evaluate clarity of agent reasoning processes</li>
  <li><strong>Explanation Quality Metrics</strong>: Assess understandability of agent explanations</li>
  <li><strong>Trust Calibration</strong>: Measure alignment between agent confidence and actual performance</li>
  <li>From [2507.22414] and related work</li>
</ul>

<ol>
  <li><strong>Long-term Evolution Tracking</strong></li>
</ol>

<ul>
  <li>Learning Progression Monitor to track capability development over time</li>
  <li><strong>Adaptation Rate Measurement</strong>: Assess speed and quality of agent adaptation</li>
  <li><strong>Stability Analysis</strong>: Monitor long-term behavioral consistency and drift</li>
  <li>Inspired by [2505.22954] and [2507.21046]</li>
</ul>

<h2 id="infrastructure-and-usability-improvements">Infrastructure and Usability Improvements</h2>

<ol>
  <li><strong>AgentOps Integration</strong></li>
</ol>

<ul>
  <li><strong>Operational Dashboard</strong>: Real-time monitoring of agent health and performance</li>
  <li><strong>Alerting System</strong>: Notifications for performance degradation or anomalies</li>
  <li><strong>Resource Usage Tracking</strong>: Monitor computational costs and efficiency</li>
  <li>Based on [2411.05285]</li>
</ul>

<ol>
  <li><strong>Zero-Code Evaluation Interface</strong></li>
</ol>

<ul>
  <li><strong>Visual Evaluation Builder</strong>: Drag-and-drop interface for creating evaluation pipelines</li>
  <li><strong>Template Library</strong>: Pre-built evaluation templates for common use cases</li>
  <li><strong>Automated Report Generation</strong>: Generate comprehensive evaluation reports without coding</li>
  <li>From [2502.05957]</li>
</ul>

<ol>
  <li><strong>Benchmark Standardization Framework</strong></li>
</ol>

<ul>
  <li><strong>Reproducibility Standards</strong>: Ensure consistent evaluation across different environments</li>
  <li><strong>Statistical Validation</strong>: Built-in statistical significance testing and confidence intervals</li>
  <li><strong>Bias Detection</strong>: Automated detection and mitigation of evaluation biases</li>
  <li><strong>Cross-Platform Compatibility</strong>: Standardized evaluation protocols across different agent frameworks</li>
  <li>Based on [2507.02825]</li>
</ul>

<h2 id="implementation-priority-roadmap">Implementation Priority Roadmap</h2>

<h3 id="phase-1-foundation-high-priority">Phase 1: Foundation (High Priority)</h3>

<ul>
  <li>Multi-Dimensional Evaluation Architecture - Core framework structure</li>
  <li>Safety-First Evaluation Framework - Essential for responsible AI development</li>
  <li>Dynamic Evaluation Pipeline - Modern approach to continuous assessment</li>
  <li>Benchmark Standardization Framework - Ensures scientific rigor</li>
</ul>

<h3 id="phase-2-advanced-features-medium-priority">Phase 2: Advanced Features (Medium Priority)</h3>

<ul>
  <li>Self-Evaluation Integration - Automated evaluation capabilities</li>
  <li>Predictive Evaluation System - Efficiency optimization</li>
  <li>AgentOps Integration - Operational monitoring</li>
  <li>Memory System Evaluation - Based on [2404.13501]</li>
</ul>

<h3 id="phase-3-specialized-modules-lower-priority">Phase 3: Specialized Modules (Lower Priority)</h3>

<ul>
  <li>Domain-Specific Evaluation Suites - Specialized assessment capabilities</li>
  <li>Multi-Agent Coordination Assessment - For collaborative systems</li>
  <li>Long-term Evolution Tracking - Extended monitoring capabilities</li>
  <li>Zero-Code Interface - User experience enhancement</li>
</ul>

<h2 id="technical-implementation-considerations">Technical Implementation Considerations</h2>

<ol>
  <li><strong>Architecture Design</strong></li>
</ol>

<ul>
  <li><strong>Modular Structure</strong>: Each evaluation component should be independently deployable</li>
  <li><strong>Plugin System</strong>: Allow easy integration of new evaluation methods from emerging research</li>
  <li><strong>Scalable Infrastructure</strong>: Support evaluation of both single agents and large multi-agent systems</li>
  <li><strong>API-First Design</strong>: Enable integration with existing agent development workflows</li>
</ul>

<ol>
  <li><strong>Data Management</strong></li>
</ol>

<ul>
  <li><strong>Evaluation History Tracking</strong>: Maintain comprehensive logs of all evaluations</li>
  <li><strong>Performance Analytics</strong>: Built-in analytics for identifying trends and patterns</li>
  <li><strong>Comparative Analysis</strong>: Side-by-side comparison of different agents or versions</li>
  <li><strong>Export Capabilities</strong>: Support for various data formats and external analysis tools</li>
</ul>

<ol>
  <li><strong>Integration Ecosystem</strong></li>
</ol>

<ul>
  <li><strong>Framework Compatibility</strong>: Support for major agent frameworks (LangChain, AutoGPT, etc.)</li>
  <li><strong>CI/CD Integration</strong>: Automated evaluation in development pipelines</li>
  <li><strong>Cloud Deployment</strong>: Scalable cloud-based evaluation services</li>
  <li><strong>Community Contributions</strong>: Framework for researchers to contribute new evaluation methods</li>
</ul>

<h2 id="success-metrics-for-agents-eval-project">Success Metrics for Agents-eval Project</h2>

<ol>
  <li><strong>Adoption Metrics</strong></li>
</ol>

<ul>
  <li>Number of integrated agent frameworks</li>
  <li>Community contributions and pull requests</li>
  <li>Usage across different domains and applications</li>
  <li>Not relevant: Academic citations and research adoption</li>
</ul>

<ol>
  <li><strong>Quality Metrics</strong></li>
</ol>

<ul>
  <li>Evaluation accuracy and reliability</li>
  <li>Reproducibility of results across environments</li>
  <li>Coverage of different agent capabilities</li>
  <li>User satisfaction and ease of use</li>
</ul>

<ol>
  <li><strong>Impact Metrics</strong></li>
</ol>

<ul>
  <li>Improvement in agent development cycles</li>
  <li>Standardization adoption across the field</li>
  <li>Safety incidents prevented through evaluation</li>
  <li>Research acceleration and breakthrough enablement</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>The proposed enhancements would create a comprehensive, scientifically rigorous, and practically useful evaluation framework that serves both researchers developing new agent capabilities and practitioners deploying agents in real-world applications.
The modular architecture ensures the system can evolve with the rapidly advancing field while maintaining backward compatibility and scientific validity.
The Agents-eval project is positioned to become a foundational tool by implementing the identified best practices, novel methodologies, and addressing critical gaps in current evaluation approaches.</p>]]></content><author><name>qte77</name></author><category term="ml" /><category term="ai" /><category term="agents" /><category term="eval" /><summary type="html"><![CDATA[Enhancement Recommendations for Agents-eval Project]]></summary></entry><entry><title type="html">AI Agents-eval Papers Meta Review</title><link href="https://qte77.github.io/ai-agents-eval-papers-meta-review/" rel="alternate" type="text/html" title="AI Agents-eval Papers Meta Review" /><published>2025-08-09T00:00:00+00:00</published><updated>2025-08-09T00:00:00+00:00</updated><id>https://qte77.github.io/ai-agents-eval-papers-meta-review</id><content type="html" xml:base="https://qte77.github.io/ai-agents-eval-papers-meta-review/"><![CDATA[<h1 id="papers-meta-review">Papers Meta Review</h1>

<p>This is a meta review for the project <a href="https://github.com/qte77/Agents-eval">Agents-eval</a> using the papers in <a href="https://github.com/qte77/Agents-eval/blob/main/docs/papers/further_reading.md">Further Reading</a>. Generated with help provided by Claude Sonnet 4 🙏🏼🌟🙌🏼💕🤗</p>

<h2 id="summary">Summary</h2>
<p>Current State of Agentic AI Evaluation: The field demonstrates rapid evolution from traditional LLM evaluation toward sophisticated frameworks for autonomous agents. Research spans from foundational evaluation methodologies to highly specialized domain-specific assessments.</p>

<h2 id="key-evaluation-dimensions-identified">Key Evaluation Dimensions Identified</h2>

<ul>
  <li>Autonomy Level Assessment: Measuring degrees of agent independence and decision-making capability</li>
  <li>Multi-Agent Coordination: Collaborative performance and emergent group behaviors</li>
  <li>Task Decomposition &amp; Planning: Dynamic planning capabilities and complex task management</li>
  <li>Tool Integration &amp; API Usage: Effective utilization of external resources and services</li>
  <li>Safety &amp; Security: Risk assessment, compliance verification, and secure operation</li>
  <li>Adaptability &amp; Evolution: Long-term learning and capability development</li>
  <li>Domain Expertise: Specialized knowledge application and domain-specific performance</li>
  <li>Explainability &amp; Interpretability: Transparency of decision-making processes</li>
  <li>Real-world Deployment: Practical usability and operational effectiveness</li>
</ul>

<h2 id="methodological-trends">Methodological Trends</h2>

<ul>
  <li>Shift toward Dynamic Evaluation: From static benchmarks to continuous monitoring and adaptive assessment</li>
  <li>Multi-Dimensional Assessment: Evaluating capabilities, behaviors, and outcomes simultaneously</li>
  <li>Domain-Specific Benchmarks: Specialized evaluations for particular applications (medical, financial, scientific)</li>
  <li>Self-Evaluation Integration: Agents that assess their own performance and generate improvements</li>
  <li>Safety-First Evaluation: Prioritizing risk assessment and ethical compliance</li>
  <li>Systems-Level Analysis: Evaluating emergent properties and complex system behaviors</li>
  <li>Predictive Evaluation: Forecasting performance before full execution for efficiency</li>
  <li>Longitudinal Assessment: Tracking agent evolution and learning over extended periods</li>
</ul>

<h2 id="critical-gaps-identified">Critical Gaps Identified</h2>

<ul>
  <li>Limited standardization across evaluation frameworks despite growing consensus on key dimensions</li>
  <li>Insufficient long-term behavioral pattern assessment and stability measurement</li>
  <li>Need for better metrics capturing true autonomy levels vs. automated task execution</li>
  <li>Lack of comprehensive safety and alignment evaluation standards across domains</li>
  <li>Missing integration between different evaluation approaches and methodologies</li>
  <li>Limited focus on evaluation framework validation and meta-evaluation quality</li>
</ul>

<h2 id="conclusion">Conclusion</h2>
<p>The comprehensive analysis of 50+ papers reveals a rapidly maturing field with clear consensus around key evaluation dimensions while highlighting significant opportunities for standardization and integration.</p>]]></content><author><name>qte77</name></author><category term="ml" /><category term="ai" /><category term="agents" /><category term="eval" /><summary type="html"><![CDATA[Papers Meta Review]]></summary></entry><entry><title type="html">AI Agents Tools List</title><link href="https://qte77.github.io/ai-agents-tools-list/" rel="alternate" type="text/html" title="AI Agents Tools List" /><published>2025-02-07T00:00:00+00:00</published><updated>2025-02-07T00:00:00+00:00</updated><id>https://qte77.github.io/ai-agents-tools-list</id><content type="html" xml:base="https://qte77.github.io/ai-agents-tools-list/"><![CDATA[<h1 id="ai-agents-tools-list">AI Agents Tools List</h1>

<h2 id="lists">Lists</h2>

<ul>
  <li><a href="https://github.com/slavakurilyak/awesome-ai-agents">300+ agent frameworks</a></li>
  <li><a href="https://github.com/ashishpatel26/500-AI-Agents-Projects">500+ curated list of use cases</a></li>
</ul>

<h2 id="frameworks">Frameworks</h2>

<ul>
  <li><a href="https://abacus.ai/">abacus.ai</a></li>
  <li><a href="https://www.salesforce.com/agentforce/">Agentforce 2.0 (Salesforce)</a></li>
  <li><a href="https://aws.amazon.com/bedrock/agents/">Amazon Bedrock AI Agent framework</a></li>
  <li><a href="https://github.com/microsoft/autogen">AutoGen (Microsoft)</a></li>
  <li><a href="https://agpt.co/">AutoGPT</a></li>
  <li><a href="https://www.crewai.com/">CrewAI</a></li>
  <li><a href="https://dspy.ai/">DSpy.ai (Stanford)</a></li>
  <li><a href="https://langchain-ai.github.io/langgraph/">LangGraph (LangChain)</a></li>
  <li><a href="https://www.lyzr.ai/">Lyzr</a></li>
  <li><a href="https://ai.pydantic.dev">pydantic-ai</a></li>
  <li><a href="https://smolagents.org/">smolagents (Hugging Face)</a></li>
</ul>

<h2 id="research-agents">Research-Agents</h2>

<ul>
  <li><a href="https://huggingface.co/blog/open-deep-research">Open DeepResearch (Huggingface)</a></li>
  <li><a href="https://openai.com/index/introducing-deep-research/">DeepResearch (OpenAI)</a></li>
  <li><a href="https://blog.google/products/gemini/google-gemini-deep-research/">DeepResearch (Google)</a></li>
  <li><a href="https://github.com/microsoft/RD-Agent">RD-Agent (Microsoft)</a></li>
  <li><a href="https://storm.genie.stanford.edu/">CO-STORM (Stanford)</a></li>
</ul>

<h2 id="benchmarks">Benchmarks</h2>

<ul>
  <li><a href="https://github.com/raga-ai-hub/agentneo">AgentNeo</a></li>
  <li><a href="https://www.agentops.ai/">AgentOps (Agency)</a></li>
  <li><a href="https://aka.ms/agbench">AutoGenBench (Microsoft)</a></li>
  <li><a href="https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html">Mosaic AI Agent Evaluation</a></li>
  <li><a href="https://github.com/raga-ai-hub/RagaAI-Catalyst">RagaAI-Catalyst</a></li>
</ul>

<h2 id="tracing">Tracing</h2>

<ul>
  <li><a href="https://www.postman.com/ai-on-postman/postman-ai-agent-builder/overview">Postman AI Builder</a></li>
  <li><a href="https://wandb.ai/site/weave/">W&amp;B Weave - Weights &amp; Biases</a></li>
</ul>

<h2 id="ai-enhanced-workflows">AI-enhanced Workflows</h2>

<ul>
  <li><a href="https://botpress.com">BotPress</a></li>
  <li><a href="https://www.gumloop.com">gumloop.com</a></li>
  <li><a href="https://www.langflow.org">Langflow</a></li>
  <li><a href="https://www.mulesoft.com">Mulesoft (Salesforce)</a></li>
  <li><a href="https://n8n.io/">n8n.io</a></li>
  <li><a href="https://www.postman.com/ai-on-postman/postman-ai-agent-builder/overview">Postman AI Builder</a></li>
  <li><a href="https://Relay.app">Relay.app</a></li>
  <li><a href="https://rivet.ironcladapp.com/">Rivet</a></li>
  <li><a href="https://www.vellum.ai/">Vellum</a></li>
</ul>]]></content><author><name>qte77</name></author><category term="ml" /><category term="ai" /><category term="agents" /><category term="tools" /><category term="list" /><summary type="html"><![CDATA[AI Agents Tools List]]></summary></entry><entry><title type="html">AI Coding Tools List</title><link href="https://qte77.github.io/ai-coding-tools-list/" rel="alternate" type="text/html" title="AI Coding Tools List" /><published>2025-02-07T00:00:00+00:00</published><updated>2025-02-07T00:00:00+00:00</updated><id>https://qte77.github.io/ai-coding-tools-list</id><content type="html" xml:base="https://qte77.github.io/ai-coding-tools-list/"><![CDATA[<h1 id="ai-coding-tools-list">AI Coding Tools List</h1>

<h2 id="bug-search">Bug Search</h2>

<ul>
  <li><a href="https://logicstar.ai">logicstar.ai</a></li>
</ul>

<h2 id="ideterminal">IDE/Terminal</h2>

<ul>
  <li><a href="https://about.appsheet.com/home/">Google AppSheet</a></li>
  <li><a href="https://www.cursor.com/">Cursor AI</a></li>
  <li><a href="https://codegpt.co/">CodeGPT.co</a></li>
  <li><a href="https://www.codeguide.dev/">CodeGuide</a></li>
  <li><a href="https://devin.ai">Devin</a></li>
  <li><a href="https://idx.dev/">idx.dev</a>, idx.google.com</li>
  <li><a href="https://onlook.com">onlook.com</a>, cursor for designers</li>
  <li><a href="https://www.warp.dev/">warp.dev</a></li>
  <li><a href="https://windsurfai.org/">WindSurf</a> by Codium</li>
  <li><a href="zed.dev/ai">zed.dev</a></li>
</ul>

<h2 id="infrastructure">Infrastructure</h2>

<ul>
  <li><a href="https://infra.new">infra.new</a></li>
</ul>

<h2 id="full-stack">Full-stack</h2>

<ul>
  <li><a href="https://bolt.diy">bolt.diy </a></li>
  <li><a href="https://bolt.new">bolt.new</a></li>
  <li><a href="https://bubble.io/">Bubble</a></li>
  <li><a href="https://www.builder.ai/">Builder.ai</a></li>
  <li><a href="https://lovable.dev/">lovable.dev</a></li>
  <li><a href="https://heyboss.xyz">heyboss.xyz</a></li>
  <li><a href="https://replit.com/">replit.com</a></li>
  <li><a href="https://smolagents.org">smolagents.org</a> (simple full-stack builder on home page)</li>
  <li><a href="https://softgen.ai/">softgen.ai</a></li>
  <li><a href="https://developer.apple.com/xcode/">Xcode</a> (Apple)</li>
</ul>

<h3 id="ui-dev">UI Dev</h3>

<ul>
  <li><a href="https://a0.dev/">a0</a></li>
  <li><a href="https://www.buzzy.buzz/">buzzy.buzz</a></li>
  <li><a href="www.figma.com/ai/">FIGMA AI</a></li>
  <li><a href="https://www.usegalileo.ai/">Galileio AI</a></li>
  <li><a href="https://www.magicpatterns.com/">Magic Patterns</a></li>
  <li><a href="https://www.superblocks.com/">Superblocks</a></li>
  <li><a href="https://www.tempolabs.ai/">TempoLabs</a></li>
  <li><a href="https://uizard.io">Uizard.io</a></li>
  <li><a href="https://v0.dev/">v0</a></li>
</ul>]]></content><author><name>qte77</name></author><category term="ml" /><category term="ai" /><category term="coding" /><category term="tools" /><category term="list" /><summary type="html"><![CDATA[AI Coding Tools List]]></summary></entry><entry><title type="html">Segformerbaseline Finetuning Results</title><link href="https://qte77.github.io/SegFormerBaseline-FineTuning-results/" rel="alternate" type="text/html" title="Segformerbaseline Finetuning Results" /><published>2024-06-08T00:00:00+00:00</published><updated>2024-06-08T00:00:00+00:00</updated><id>https://qte77.github.io/SegFormerBaseline-FineTuning-results</id><content type="html" xml:base="https://qte77.github.io/SegFormerBaseline-FineTuning-results/"><![CDATA[<h1 id="resultes-fine-tuning-pre-trained-segformer">Resultes fine-tuning pre-trained SegFormer</h1>

<ul>
  <li><a href="https://github.com/qte77/SegFormerQuantization/edit/main/PoC/SegFormer-fine-tune-half-baseline.py">SegFormer-fine-tune-half-baseline.py</a></li>
  <li>Model specs
    <ul>
      <li><a href="https://huggingface.co/nvidia/mit-b0">original pre-trained mit-b0</a></li>
      <li>fined (fine-tuned)</li>
      <li>fined_half (fine-tuned and weights halfed)</li>
    </ul>
  </li>
</ul>

<h2 id="gpu-t4">GPU T4</h2>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2024-05-18, GPU T4 x2, <span class="nv">train_n_epochs</span><span class="o">=</span>1000, <span class="nb">time </span>train 24 minutes

orig
         <span class="nv">size</span><span class="o">=</span>1832.62
         <span class="nv">mean_iou</span><span class="o">=</span>6.661906575535688e-06
         <span class="nv">mean_accuracy</span><span class="o">=</span>4.905447543698625e-05
         <span class="nv">overall_accuracy</span><span class="o">=</span>1.2687577909658102e-05
fined <span class="o">(</span>fine-tuned<span class="o">)</span>
         <span class="nv">size</span><span class="o">=</span>1832.62
         <span class="nv">mean_iou</span><span class="o">=</span>0.8767435186414186
         <span class="nv">mean_accuracy</span><span class="o">=</span>0.9333408857632106
         <span class="nv">overall_accuracy</span><span class="o">=</span>0.9786753534283421
fined_half <span class="o">(</span>fine-tuned and weights halfed<span class="o">)</span>
         <span class="nv">size</span><span class="o">=</span>916.31
         <span class="nv">mean_iou</span><span class="o">=</span>0.8763746372690113
         <span class="nv">mean_accuracy</span><span class="o">=</span>0.9341201476478558
         <span class="nv">overall_accuracy</span><span class="o">=</span>0.978795885418484
<span class="nt">---------------------</span>
orig
                   IoU       Acc
wall         0.000000  0.000000P
floor        0.000047  0.000049
tree         0.000000  0.000000
ceiling      0.000000  0.000000
person       0.000000  0.000000
plant        0.000000  0.000000
seat         0.000000  0.000000
fence        0.000000  0.000000
column       0.000000  0.000000
signboard    0.000366  0.000736
streetlight  0.000000  0.000000
escalator    0.000000  0.000000
fountain     0.000000  0.000000
pot          0.000000  0.000000
ashcan       0.000000  0.000000
flag         0.000000  0.000000
<span class="nt">---------------------</span>
fined
                  IoU       Acc
wall         0.964517  0.987972
floor        0.922030  0.941426
tree         0.876874  0.917932
ceiling      0.990845  0.995715
person       0.642100  0.898230
plant        0.944452  0.977216
seat         0.893468  0.959987
fence        0.582727  0.643574
column       0.892548  0.929626
signboard    0.898165  0.918322
streetlight  0.988662  0.995434
escalator    0.945328  0.955779
fountain     0.964822  0.984560
pot          0.814856  0.929204
ashcan       0.783099  0.948805
flag         0.923404  0.949672
<span class="nt">---------------------</span>
fined_half
                  IoU       Acc
wall         0.962633  0.990550
floor        0.923439  0.942186
tree         0.873446  0.903782
ceiling      0.991342  0.995631
person       0.647707  0.871681
plant        0.942934  0.973982
seat         0.898811  0.967202
fence        0.588037  0.700803
column       0.895929  0.929840
signboard    0.885653  0.906181
streetlight  0.997722  1.000000
escalator    0.943396  0.954774
fountain     0.960984  0.985589
pot          0.799145  0.945638
ashcan       0.793785  0.959044
flag         0.917031  0.919037
</code></pre></div></div>

<h2 id="p100">P100</h2>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2024-05-20, P100, 1.40s/it, <span class="nv">train_n_epochs</span><span class="o">=</span>1000, <span class="nb">time </span>train 24 minutes

orig
	<span class="nv">size</span><span class="o">=</span>1832.62
	<span class="nv">mean_iou</span><span class="o">=</span>0.00012747152834123089
	<span class="nv">mean_accuracy</span><span class="o">=</span>0.015043250957378799
	<span class="nv">overall_accuracy</span><span class="o">=</span>0.0017223387012360873
fined
	<span class="nv">size</span><span class="o">=</span>1832.62
	<span class="nv">mean_iou</span><span class="o">=</span>0.8633531645402123
	<span class="nv">mean_accuracy</span><span class="o">=</span>0.9080789058286678
	<span class="nv">overall_accuracy</span><span class="o">=</span>0.975750866720166
fined_half
	<span class="nv">size</span><span class="o">=</span>916.31
	<span class="nv">mean_iou</span><span class="o">=</span>0.7668116746879406
	<span class="nv">mean_accuracy</span><span class="o">=</span>0.8328274292451897
	<span class="nv">overall_accuracy</span><span class="o">=</span>0.962108548572806
<span class="nt">---------------------</span>
orig
                  IoU       Acc
wall         0.001655  0.001681
floor        0.005059  0.005075
tree         0.000000  0.000000
ceiling      0.000000  0.000000
person       0.000000  0.000000
plant        0.000000  0.000000
seat         0.000000  0.000000
fence        0.001444  0.233936
column       0.000000  0.000000
signboard    0.000000  0.000000
streetlight  0.000000  0.000000
escalator    0.000000  0.000000
fountain     0.000000  0.000000
pot          0.000000  0.000000
ashcan       0.000000  0.000000
flag         0.000000  0.000000
<span class="nt">---------------------</span>
fined
                  IoU       Acc
wall         0.957004  0.982243
floor        0.920770  0.972491
tree         0.849617  0.883715
ceiling      0.987733  0.989995
person       0.613441  0.741150
plant        0.932384  0.952668
seat         0.869103  0.912430
fence        0.594569  0.637550
column       0.867747  0.957961
signboard    0.893319  0.915011
streetlight  0.970320  0.970320
escalator    0.963257  0.974874
fountain     0.955383  0.980829
pot          0.789116  0.879899
ashcan       0.834835  0.948805
flag         0.815054  0.829322
<span class="nt">---------------------</span>
fined_half
                  IoU       Acc
wall         0.927242  0.969644
floor        0.884865  0.929167
tree         0.800613  0.872910
ceiling      0.982292  0.984718
person       0.535153  0.836947
plant        0.896391  0.934735
seat         0.814371  0.892096
fence        0.379102  0.440763
column       0.813051  0.950291
signboard    0.766822  0.846946
streetlight  0.855530  0.865297
escalator    0.882820  0.893467
fountain     0.932883  0.960371
pot          0.649888  0.734513
ashcan       0.543210  0.600683
flag         0.604752  0.612691
</code></pre></div></div>

<h2 id="encountered-problems">Encountered problems</h2>

<h3 id="imports-while-on-gpu">Imports while on GPU</h3>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory <span class="k">for </span>plugin cuDNN when one has already been registered
E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory <span class="k">for </span>plugin cuFFT when one has already been registered
E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory <span class="k">for </span>plugin cuBLAS when one has already been registered
</code></pre></div></div>

<h3 id="warning-segformerimageprocessordo_reduce_labels">Warning <code class="language-plaintext highlighter-rouge">SegformerImageProcessor(do_reduce_labels)</code></h3>

<p><code class="language-plaintext highlighter-rouge">/opt/conda/lib/python3.10/site-packages/transformers/models/segformer/image_processing_segformer.py:103: FutureWarning: The </code>reduce_labels<code class="language-plaintext highlighter-rouge"> parameter is deprecated and will be removed in a future version. Please use </code>do_reduce_labels<code class="language-plaintext highlighter-rouge"> instead.</code></p>

<h3 id="warning-tsegformerforsemanticsegmentationfrom_pretrained">Warning <code class="language-plaintext highlighter-rouge">TSegformerForSemanticSegmentation.from_pretrained()</code></h3>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Some weights of SegformerForSemanticSegmentation were not initialized from the model checkpoint at nvidia/mit-b0 and are newly initialized: <span class="o">[</span><span class="s1">'decode_head.batch_norm.bias'</span>, <span class="s1">'decode_head.batch_norm.num_batches_tracked'</span>, <span class="s1">'decode_head.batch_norm.running_mean'</span>, <span class="s1">'decode_head.batch_norm.running_var'</span>, <span class="s1">'decode_head.batch_norm.weight'</span>, <span class="s1">'decode_head.classifier.bias'</span>, <span class="s1">'decode_head.classifier.weight'</span>, <span class="s1">'decode_head.linear_c.0.proj.bias'</span>, <span class="s1">'decode_head.linear_c.0.proj.weight'</span>, <span class="s1">'decode_head.linear_c.1.proj.bias'</span>, <span class="s1">'decode_head.linear_c.1.proj.weight'</span>, <span class="s1">'decode_head.linear_c.2.proj.bias'</span>, <span class="s1">'decode_head.linear_c.2.proj.weight'</span>, <span class="s1">'decode_head.linear_c.3.proj.bias'</span>, <span class="s1">'decode_head.linear_c.3.proj.weight'</span>, <span class="s1">'decode_head.linear_fuse.weight'</span><span class="o">]</span>
You should probably TRAIN this model on a down-stream task to be able to use it <span class="k">for </span>predictions and inference.
</code></pre></div></div>

<h3 id="warning-model_fined_halfpixel_valuespixel_values">Warning <code class="language-plaintext highlighter-rouge">model_fined_half(pixel_values=pixel_values)</code></h3>

<p><code class="language-plaintext highlighter-rouge">RuntimeError: Input type (float) and bias type (c10::Half) should be the same</code></p>

<p>Solution: <code class="language-plaintext highlighter-rouge">model_fined_half(pixel_values=pixel_values.half())</code></p>

<h3 id="warning-metric">Warning <code class="language-plaintext highlighter-rouge">metric</code></h3>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/conda/lib/python3.10/site-packages/datasets/features/image.py:341: UserWarning: Downcasting array dtype int64 to int32 to be compatible with <span class="s1">'Pillow'</span>
  warnings.warn<span class="o">(</span>f<span class="s2">"Downcasting array dtype {dtype} to {dest_dtype} to be compatible with 'Pillow'"</span><span class="o">)</span>
/root/.cache/huggingface/modules/evaluate_modaules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:259: RuntimeWarning: invalid value encountered <span class="k">in </span>divide
  iou <span class="o">=</span> total_area_intersect / total_area_union
/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:260: RuntimeWarning: invalid value encountered <span class="k">in </span>divide
  acc <span class="o">=</span> total_area_intersect / total_area_label
</code></pre></div></div>]]></content><author><name>qte77</name></author><summary type="html"><![CDATA[Resultes fine-tuning pre-trained SegFormer]]></summary></entry><entry><title type="html">Collection of Tools for ML</title><link href="https://qte77.github.io/ML-Tooling/" rel="alternate" type="text/html" title="Collection of Tools for ML" /><published>2024-05-27T00:00:00+00:00</published><updated>2024-05-27T00:00:00+00:00</updated><id>https://qte77.github.io/ML-Tooling</id><content type="html" xml:base="https://qte77.github.io/ML-Tooling/"><![CDATA[<h1 id="e2e-automated-ml-tools-amlt">E2E Automated ML Tools (AMLT)</h1>

<ul>
  <li><a href="https://h2o.ai/platform/ai-cloud/make/h2o-driverless-ai/">H2O Driverless AI</a></li>
  <li><a href="https://github.com/keras-team/autokeras">Auto-Keras</a></li>
  <li><a href="https://github.com/automl">AutoML.org</a>
    <ul>
      <li><a href="https://github.com/automl/Auto-PyTorch">Auto-PyTorch</a></li>
      <li><a href="https://github.com/automl/auto-sklearn">Auto-Sklearn</a></li>
    </ul>
  </li>
  <li><a href="https://github.com/autogluon/autogluon">AutoGluon</a></li>
  <li><a href="https://github.com/EpistasisLab/tpot">TPOT</a></li>
  <li><a href="https://microsoft.github.io/FLAML/">FLAML</a></li>
  <li><a href="https://github.com/sberbank-ai-lab/lightautoml">LightAutoML</a></li>
  <li><a href="https://github.com/alteryx/evalml">EvalML</a></li>
  <li><a href="https://github.com/pycaret/pycaret">pycaret</a></li>
  <li><a href="https://github.com/ThomasMeissnerDS/BlueCast">BlueCast</a></li>
  <li><a href="https://learn.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/how-to-use-the-automl-api">Microsoft ML.NET AutoML</a></li>
  <li>Hyperscaler
    <ul>
      <li><a href="https://cloud.google.com/vertex-ai/docs/beginner/beginners-guide">Vertex AI - Google AI Tables</a></li>
      <li><a href="https://aws.amazon.com/machine-learning/automl/">AWS Sagemaker</a></li>
      <li><a href="https://azure.microsoft.com/products/machine-learning/automatedml">Azure AutoML</a></li>
    </ul>
  </li>
  <li>Also
    <ul>
      <li><a href="https://rdrr.io/cran/rminer">rminer</a></li>
      <li><a href="https://docs.transmogrif.ai/en/stable/developer-guide">TransmogrifAI</a></li>
    </ul>
  </li>
</ul>

<h1 id="eda">EDA</h1>

<ul>
  <li><a href="https://docs.dataprep.ai/index.html">DataPrep</a></li>
  <li><a href="https://github.com/pandas-profiling/pandas-profiling">pandas_profiling.ProfileReport</a></li>
  <li><a href="https://github.com/fbdesignpro/sweetviz">SweetViz</a></li>
  <li><a href="https://github.com/AutoViML/AutoViz.git">AutoViz</a></li>
  <li><a href="https://github.com/lux-org/lux/">Lux</a></li>
  <li><a href="https://github.com/vaexio/vaex">Vaex</a></li>
  <li><a href="https://github.com/man-group/dtale">D-Tale</a></li>
  <li><a href="https://github.com/DistrictDataLabs/yellowbrick">Yellowbrick</a></li>
</ul>

<h1 id="cleaning">Cleaning</h1>

<ul>
  <li><a href="https://github.com/akanz1/klib">klib</a></li>
  <li><a href="https://pyjanitor-devs.github.io/pyjanitor/devguide/">pyjanitor</a></li>
</ul>

<h1 id="fe">FE</h1>

<ul>
  <li><a href="https://www.featuretools.com/">Featuretools</a></li>
  <li><a href="https://github.com/IIIS-Li-Group/OpenFE">OpenFE</a>
    <ul>
      <li><a href="https://github.com/qte77/OpenFE">qte77/OpenFE fork</a></li>
    </ul>
  </li>
  <li><a href="https://auto.gluon.ai/stable/tutorials/tabular/tabular-feature-engineering.html">AutoGluon Tabular - Feature Engineering</a></li>
</ul>

<h1 id="pipeline">Pipeline</h1>

<ul>
  <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">sklearn.pipeline.Pipeline</a></li>
  <li><a href="https://huggingface.co/docs/transformers/en/main_classes/pipelines">Pipelines - Hugging Face</a></li>
</ul>

<h1 id="tuning">Tuning</h1>

<ul>
  <li><a href="https://optuna.org/">Optuna - A hyperparameter optimization framework</a>, also Ensemble tuning</li>
</ul>

<h1 id="ensemblenas">Ensemble/NAS</h1>

<ul>
  <li><a href="https://scikit-learn.org/stable/auto_examples/ensemble/index.html">sklearn.ensemble Ensemble methods</a></li>
  <li><a href="https://auto.gluon.ai/stable/index.html">AutoGluon</a></li>
  <li><a href="https://github.com/deephyper/deephyper">deephyper</a></li>
</ul>

<h1 id="loggingtracking">Logging/Tracking</h1>

<ul>
  <li><a href="https://wandb.ai/site">Weights&amp;Biases</a></li>
  <li><a href="https://neptune.ai/">neptune.ai</a></li>
  <li><a href="https://www.tensorflow.org/tensorboard">TensorBoard - TensorFlow</a></li>
</ul>

<h1 id="gui">GUI</h1>

<ul>
  <li><a href="https://gradio.app/">Gradio</a></li>
</ul>

<h1 id="exploratory-runtimes">Exploratory Runtimes</h1>

<ul>
  <li><a href="https://colabl.research.google.com">Google Colab</a></li>
  <li><a href="https://aws.amazon.com/sagemaker/studio/">AWS Sagemaker Studio</a></li>
  <li><a href="https://scikit-learn.org/stable/lite/lab/">sklearn lab - Jupyter Lite</a></li>
  <li><a href="https://jupyter.org/try">Try Jupyter</a></li>
  <li><a href="https://mybinder.org/">Binder</a></li>
  <li><a href="https://www.kaggle.com/docs/notebooks">Kaggle Notebooks</a></li>
</ul>

<h1 id="operationalize-notebooks">Operationalize Notebooks</h1>

<ul>
  <li><a href="https://jupytext.readthedocs.io/">Jupytext</a></li>
  <li><a href="https://papermill.readthedocs.io/">papermill</a></li>
</ul>]]></content><author><name>qte77</name></author><category term="ml" /><category term="tools" /><summary type="html"><![CDATA[E2E Automated ML Tools (AMLT)]]></summary></entry><entry><title type="html">SegFormer Part 1, Description</title><link href="https://qte77.github.io/SegFormer-Part1-Description/" rel="alternate" type="text/html" title="SegFormer Part 1, Description" /><published>2024-05-05T00:00:00+00:00</published><updated>2024-05-05T00:00:00+00:00</updated><id>https://qte77.github.io/SegFormer-Part1-Description</id><content type="html" xml:base="https://qte77.github.io/SegFormer-Part1-Description/"><![CDATA[<h1 id="description">Description</h1>

<h2 id="model">Model</h2>

<p>Using <a href="https://huggingface.co/nvidia/mit-b0">Nvidia SegFormer (b0-sized) encoder pre-trained-only</a></p>

<ul>
  <li>“hierarchical Transformer encoder”, “lightweight all-MLP decode head” (for segmentation)</li>
  <li>“pre-trained on ImageNet-1k, after which a decode head is added and fine-tuned altogether on a downstream dataset”</li>
  <li>“SegformerForSemanticSegmentation adds the all-MLP decoder head on top”</li>
  <li>Paper <a href="https://arxiv.org/abs/2105.15203">SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers</a></li>
  <li><a href="https://github.com/NVlabs/SegFormer">Paper Github</a></li>
  <li>SegFormer Model Architecture</li>
</ul>

<p><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png" alt="SegFormer Model Architecture" /></p>

<h2 id="task">Task</h2>

<p>Using <code class="language-plaintext highlighter-rouge">scene-parsing</code> with Dataset <a href="https://huggingface.co/datasets/scene_parse_150">scene_parse_150</a>, a subset of <a href="https://paperswithcode.com/task/semantic-segmentation">semantic segmentation dataset</a> <a href="https://paperswithcode.com/sota/semantic-segmentation-on-ade20k">MIT ADE20k</a></p>

<ul>
  <li>“segment the whole image densely into semantic classes (image regions), where each pixel is assigned a class label”</li>
  <li>“mean of the pixel-wise accuracy and class-wise IoU as the final score”</li>
  <li>structure</li>
</ul>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="err">'image':</span><span class="w"> </span><span class="err">&lt;PIL.JpegImagePlugin.JpegImageFile</span><span class="w"> </span><span class="err">image</span><span class="w"> </span><span class="err">mode=RGB</span><span class="w"> </span><span class="err">size=</span><span class="mi">683</span><span class="err">x</span><span class="mi">512</span><span class="w"> </span><span class="err">at</span><span class="w"> </span><span class="mi">0</span><span class="err">x</span><span class="mi">1</span><span class="err">FF</span><span class="mi">32</span><span class="err">A</span><span class="mi">3</span><span class="err">EDA</span><span class="mi">0</span><span class="err">&gt;</span><span class="p">,</span><span class="w">
  </span><span class="err">'annotation':</span><span class="w"> </span><span class="err">&lt;PIL.PngImagePlugin.PngImageFile</span><span class="w"> </span><span class="err">image</span><span class="w"> </span><span class="err">mode=L</span><span class="w"> </span><span class="err">size=</span><span class="mi">683</span><span class="err">x</span><span class="mi">512</span><span class="w"> </span><span class="err">at</span><span class="w"> </span><span class="mi">0</span><span class="err">x</span><span class="mi">1</span><span class="err">FF</span><span class="mi">32E5</span><span class="err">B</span><span class="mi">978</span><span class="err">&gt;</span><span class="p">,</span><span class="w">
  </span><span class="err">'scene_category':</span><span class="w"> </span><span class="mi">0</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h2 id="execution-order-for-model-trainer">Execution order for model <code class="language-plaintext highlighter-rouge">Trainer()</code></h2>

<ol>
  <li>Transform on-the-fly
    * Data gets batch-wise prepared and augmented (<code class="language-plaintext highlighter-rouge">&lt;dataset&gt;.set_transform(&lt;transform_fn&gt;)</code>)</li>
  <li>Tokenize tansformed data (<code class="language-plaintext highlighter-rouge">image_processor</code>)
    * Inputs <code class="language-plaintext highlighter-rouge">image</code>, <code class="language-plaintext highlighter-rouge">annotation</code> (segmentation mask) and <code class="language-plaintext highlighter-rouge">scene_category</code> (label)
    * Outputs <code class="language-plaintext highlighter-rouge">pixel_values</code> and <code class="language-plaintext highlighter-rouge">labels</code> tensors</li>
  <li>Collate tokenized batch data (<code class="language-plaintext highlighter-rouge">data_collator=collate_fn</code>)
    * Returns stacked tensor of tokenized data batches</li>
  <li>Fine-tune model with prepared data
    * Also inputs <code class="language-plaintext highlighter-rouge">id2label</code> and <code class="language-plaintext highlighter-rouge">label2id</code>
    * Returns tensor of pixel-wise logits</li>
  <li>Evaluate model output (<code class="language-plaintext highlighter-rouge">compute_metrics</code>)
    * Compare output logits to input segmentation mask</li>
</ol>

<h2 id="pseudo-downstream-forward-run">Pseudo downstream forward run</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="n">no_grad</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="p">(</span>
  <span class="n">AutoModelForImageClassification</span><span class="p">,</span>
  <span class="n">AutoImageProcessor</span>
<span class="p">)</span>
<span class="n">image_processor</span> <span class="o">=</span> <span class="n">AutoImageProcessor</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">checkpoint</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForImageClassification</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">checkpoint</span><span class="p">)</span>
<span class="c1"># preprocess and tokenize, return PyTorch tensors
</span><span class="n">inputs</span> <span class="o">=</span> <span class="n">image_processor</span><span class="p">(</span><span class="n">image</span><span class="p">.</span><span class="n">convert</span><span class="p">(</span><span class="s">"RGB"</span><span class="p">),</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s">"pt"</span><span class="p">)</span>
<span class="c1"># forward only
</span><span class="k">with</span> <span class="n">no_grad</span><span class="p">():</span>
    <span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">)</span>
    <span class="n">logits</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">.</span><span class="n">logits</span>
<span class="n">pred_cls_idx</span> <span class="o">=</span> <span class="n">logits</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">).</span><span class="n">item</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">pred_cls_idx</span><span class="o">=</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="n">model</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">id2label</span><span class="p">[</span><span class="n">pred_cls_idx</span><span class="p">]</span><span class="o">=</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="some-weights-of-segformerforsemanticsegmentation-were-not-initialized">Some weights of SegformerForSemanticSegmentation were not initialized</h2>

<p>The following layers were not initialized because they should be fine-tuned to down-stream task.</p>

<ul>
  <li>‘decode_head.classifier.weight’</li>
  <li>‘decode_head.batch_norm.bias’</li>
  <li>‘decode_head.linear_c.3.proj.bias’</li>
  <li>‘decode_head.batch_norm.running_mean’</li>
  <li>‘decode_head.batch_norm.weight’</li>
  <li>‘decode_head.batch_norm.running_var’</li>
  <li>‘decode_head.linear_c.0.proj.weight’</li>
  <li>‘decode_head.linear_c.1.proj.weight’</li>
  <li>‘decode_head.classifier.bias’</li>
  <li>‘decode_head.linear_c.1.proj.bias’</li>
  <li>‘decode_head.linear_c.3.proj.weight’</li>
  <li>‘decode_head.linear_c.2.proj.bias’</li>
  <li>‘decode_head.linear_c.2.proj.weight’</li>
  <li>‘decode_head.linear_fuse.weight’ac</li>
  <li>‘decode_head.batch_norm.num_batches_tracked’</li>
  <li>‘decode_head.linear_c.0.proj.bias’</li>
</ul>

<p>In regards to the following warning:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Some weights of SegformerForSemanticSegmentation were not initialized from the model checkpoint at [...] are newly initialized because the shapes did not match:
- decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated
- decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
</code></pre></div></div>]]></content><author><name>qte77</name></author><category term="writeup" /><category term="transformer" /><category term="segformer" /><category term="description" /><summary type="html"><![CDATA[Description Model Using Nvidia SegFormer (b0-sized) encoder pre-trained-only “hierarchical Transformer encoder”, “lightweight all-MLP decode head” (for segmentation) “pre-trained on ImageNet-1k, after which a decode head is added and fine-tuned altogether on a downstream dataset” “SegformerForSemanticSegmentation adds the all-MLP decoder head on top” Paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Paper Github SegFormer Model Architecture Task Using scene-parsing with Dataset scene_parse_150, a subset of semantic segmentation dataset MIT ADE20k “segment the whole image densely into semantic classes (image regions), where each pixel is assigned a class label” “mean of the pixel-wise accuracy and class-wise IoU as the final score” structure { 'image': &lt;PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=683x512 at 0x1FF32A3EDA0&gt;, 'annotation': &lt;PIL.PngImagePlugin.PngImageFile image mode=L size=683x512 at 0x1FF32E5B978&gt;, 'scene_category': 0 } Execution order for model Trainer() Transform on-the-fly * Data gets batch-wise prepared and augmented (&lt;dataset&gt;.set_transform(&lt;transform_fn&gt;)) Tokenize tansformed data (image_processor) * Inputs image, annotation (segmentation mask) and scene_category (label) * Outputs pixel_values and labels tensors Collate tokenized batch data (data_collator=collate_fn) * Returns stacked tensor of tokenized data batches Fine-tune model with prepared data * Also inputs id2label and label2id * Returns tensor of pixel-wise logits Evaluate model output (compute_metrics) * Compare output logits to input segmentation mask Pseudo downstream forward run from torch import no_grad from transformers import ( AutoModelForImageClassification, AutoImageProcessor ) image_processor = AutoImageProcessor.from_pretrained(checkpoint) model = AutoModelForImageClassification.from_pretrained(checkpoint) # preprocess and tokenize, return PyTorch tensors inputs = image_processor(image.convert("RGB"), return_tensors="pt") # forward only with no_grad(): outputs = model(**inputs) logits = outputs.logits pred_cls_idx = logits.argmax(-1).item() print(f"{pred_cls_idx=}, {model.config.id2label[pred_cls_idx]=}") Some weights of SegformerForSemanticSegmentation were not initialized The following layers were not initialized because they should be fine-tuned to down-stream task. ‘decode_head.classifier.weight’ ‘decode_head.batch_norm.bias’ ‘decode_head.linear_c.3.proj.bias’ ‘decode_head.batch_norm.running_mean’ ‘decode_head.batch_norm.weight’ ‘decode_head.batch_norm.running_var’ ‘decode_head.linear_c.0.proj.weight’ ‘decode_head.linear_c.1.proj.weight’ ‘decode_head.classifier.bias’ ‘decode_head.linear_c.1.proj.bias’ ‘decode_head.linear_c.3.proj.weight’ ‘decode_head.linear_c.2.proj.bias’ ‘decode_head.linear_c.2.proj.weight’ ‘decode_head.linear_fuse.weight’ac ‘decode_head.batch_norm.num_batches_tracked’ ‘decode_head.linear_c.0.proj.bias’ In regards to the following warning: Some weights of SegformerForSemanticSegmentation were not initialized from the model checkpoint at [...] are newly initialized because the shapes did not match: - decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated - decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.]]></summary></entry><entry><title type="html">SegFormer Part 2, PoC Difficulties and Errors</title><link href="https://qte77.github.io/SegFormer-Part2-PoC-Difficulties/" rel="alternate" type="text/html" title="SegFormer Part 2, PoC Difficulties and Errors" /><published>2024-05-05T00:00:00+00:00</published><updated>2024-05-05T00:00:00+00:00</updated><id>https://qte77.github.io/SegFormer-Part2-PoC-Difficulties</id><content type="html" xml:base="https://qte77.github.io/SegFormer-Part2-PoC-Difficulties/"><![CDATA[<h1 id="difficulties-while-working-on-a-poc">Difficulties while working on a PoC</h1>

<p>This is a writup to difficulties and errors encountered while working on a <a href="https://github.com/qte77/SegFormerQuantization/blob/main/PoC/hf_segformer_PoC.ipynb">SegFormer PoC workbook</a>.</p>

<h1 id="model">Model</h1>

<p><code class="language-plaintext highlighter-rouge">ValueError: You passed along num_labels=1055 with an incompatible id to label map:{}</code></p>

<ul>
  <li>Passing <code class="language-plaintext highlighter-rouge">train_ds.features["scene_category"].num_classes</code>to <code class="language-plaintext highlighter-rouge">num_labels</code> when <code class="language-plaintext highlighter-rouge">len(id2label)</code> expected</li>
  <li>Solution: Use <code class="language-plaintext highlighter-rouge">len(id2label)</code></li>
</ul>

<p><code class="language-plaintext highlighter-rouge">RuntimeError: Error(s) in loading state_dict for SegformerForSemanticSegmentation: size mismatch for decode_head.classifier.weight: copying a param with shape torch.Size([150, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([151, 256, 1, 1]). size mismatch for decode_head.classifier.bias: copying a param with shape torch.Size([150]) from checkpoint, the shape in current model is torch.Size([151]). You may consider adding ignore_mismatched_sizes=True in the model </code>from_pretrained<code class="language-plaintext highlighter-rouge"> method.</code></p>

<ul>
  <li>Solution: Use <code class="language-plaintext highlighter-rouge">ignore_mismatched_sizes=True</code></li>
  <li>New alert: <code class="language-plaintext highlighter-rouge">- decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated - decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated</code></li>
</ul>

<p><code class="language-plaintext highlighter-rouge">NotImplementedError: Cannot copy out of meta tensor; no data!</code></p>

<ul>
  <li>When using <code class="language-plaintext highlighter-rouge">device_map=dev</code> in <code class="language-plaintext highlighter-rouge">from_pretrained()</code>.</li>
  <li>Solution: Add <code class="language-plaintext highlighter-rouge">accelerate.infer_auto_device_map(model)</code> to <code class="language-plaintext highlighter-rouge">model.hf_device_map</code> after model is loaded</li>
</ul>

<h2 id="train">Train</h2>

<p>HuggingFace Dataloader <code class="language-plaintext highlighter-rouge">RuntimeError: cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned</code></p>

<ul>
  <li>Dataloader loads data on device of model and tries loading data already loaded to ‘cuda’ into ‘cuda’</li>
  <li>Solution: Not using <code class="language-plaintext highlighter-rouge">.to(cuda)</code> inside <code class="language-plaintext highlighter-rouge">collator_fn</code></li>
</ul>

<p><code class="language-plaintext highlighter-rouge">OutOfMemoryError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 14.75 GiB total capacity; 11.08 GiB already allocated; 2.48 GiB free; 11.23 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF</code></p>

<ul>
  <li><a href="https://pytorch.org/docs/stable/notes/cuda.html#memory-management">PyTorch CUDA Memory management</a></li>
  <li>Solution in environment: <code class="language-plaintext highlighter-rouge">environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"</code></li>
  <li>Solution for training: <code class="language-plaintext highlighter-rouge">per_device_train_batch_size=batch_size</code> with <code class="language-plaintext highlighter-rouge">batch_size</code> from <code class="language-plaintext highlighter-rouge">32</code> to <code class="language-plaintext highlighter-rouge">8</code></li>
  <li>Solution for evaluation: <code class="language-plaintext highlighter-rouge">per_device_eval_batch_size=batch_size</code> with <code class="language-plaintext highlighter-rouge">batch_size</code> from <code class="language-plaintext highlighter-rouge">32</code> to <code class="language-plaintext highlighter-rouge">1</code></li>
</ul>

<p><code class="language-plaintext highlighter-rouge">RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)</code></p>

<ul>
  <li>Solution: Set <code class="language-plaintext highlighter-rouge">environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:2048"</code> to max <code class="language-plaintext highlighter-rouge">1024</code></li>
</ul>

<p><code class="language-plaintext highlighter-rouge">RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.</code></p>

<ul>
  <li>Error occurs in cross entropy, maybe wrong number of labels or label indexing, <code class="language-plaintext highlighter-rouge">id2label</code> or <code class="language-plaintext highlighter-rouge">label2id</code>, See <a href="https://stackoverflow.com/questions/51691563/cuda-runtime-error-59-device-side-assert-triggered">CUDA runtime error (59) : device-side assert triggered</a></li>
  <li>Switch to CPU to get more meaningful error messages</li>
  <li>Solution: Switching to CPU leads to <code class="language-plaintext highlighter-rouge">IndexError: Target 150 is out of bounds.</code></li>
</ul>

<p><code class="language-plaintext highlighter-rouge">IndexError: Target 150 is out of bounds.</code></p>

<ul>
  <li>Occurs in <code class="language-plaintext highlighter-rouge">torch._C._nn.cross_entropy_loss</code>, See <a href="https://stackoverflow.com/questions/51691563/cuda-runtime-error-59-device-side-assert-triggered">CUDA runtime error (59) : device-side assert triggered</a>.</li>
  <li>Maybe because <code class="language-plaintext highlighter-rouge">len(categories)</code> (150) smaller than <code class="language-plaintext highlighter-rouge">train_ds.features['scene_category'].num_classes</code> (1055) -&gt; No.</li>
  <li>Testing with <code class="language-plaintext highlighter-rouge">max([(i["labels"].min().item(), i["labels"].max().item()) for i in test_ds.shard(10, 0)])</code> yields <code class="language-plaintext highlighter-rouge">(0, 150)</code></li>
  <li>Solution: Prepend dummy class <code class="language-plaintext highlighter-rouge">id2label = {**{0:'NONE'}, **{k:v for k,v in enumerate(categories, 1)}}</code>. Has to be used with <code class="language-plaintext highlighter-rouge">ignore_mismatched_sizes=True</code> in <code class="language-plaintext highlighter-rouge">from_pretrained()</code>.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same</code></p>

<ul>
  <li>When trying to debug and trace <code class="language-plaintext highlighter-rouge">CUDA error: device-side assert triggered</code> with CPU instead of CUDA</li>
  <li>Solution: Do not use <code class="language-plaintext highlighter-rouge">device_map</code> for cpu</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">ValueError: Unsupported number of image dimensions: 2</code></p>

<ul>
  <li>Occuring at random batches with
    <ul>
      <li><code class="language-plaintext highlighter-rouge">PIL.mode='RGB'</code> (<code class="language-plaintext highlighter-rouge">['RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB']</code>)</li>
      <li><code class="language-plaintext highlighter-rouge">'pixel_values'</code>:<code class="language-plaintext highlighter-rouge">torch.Size([&lt;batch_size=8&gt;, &lt;chn_dim=3&gt;, 512, 512])</code></li>
      <li><code class="language-plaintext highlighter-rouge">'labels'</code>:<code class="language-plaintext highlighter-rouge">torch.Size([&lt;batch_size=8&gt;, 512, 512])</code></li>
    </ul>
  </li>
  <li>Maybe false <code class="language-plaintext highlighter-rouge">PIL.mode</code> like <code class="language-plaintext highlighter-rouge">RGBA</code> with 4 channels instead of <code class="language-plaintext highlighter-rouge">RGB</code>, See <a href="https://stackoverflow.com/questions/75168665/unsupported-number-of-image-dimensions-while-using-image-utils-from-transforme">“Unsupported number of image dimensions” while using image_utils from Transformers</a></li>
  <li>Solution (bad one): Using <code class="language-plaintext highlighter-rouge">image.convert("RGB")</code> on every image within the on-the-fly transform function <code class="language-plaintext highlighter-rouge">train_transforms(example_batch)</code></li>
</ul>]]></content><author><name>qte77</name></author><category term="writeup" /><category term="transformer" /><category term="segformer" /><category term="difficulties" /><category term="errors" /><summary type="html"><![CDATA[Difficulties while working on a PoC This is a writup to difficulties and errors encountered while working on a SegFormer PoC workbook. Model ValueError: You passed along num_labels=1055 with an incompatible id to label map:{} Passing train_ds.features["scene_category"].num_classesto num_labels when len(id2label) expected Solution: Use len(id2label) RuntimeError: Error(s) in loading state_dict for SegformerForSemanticSegmentation: size mismatch for decode_head.classifier.weight: copying a param with shape torch.Size([150, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([151, 256, 1, 1]). size mismatch for decode_head.classifier.bias: copying a param with shape torch.Size([150]) from checkpoint, the shape in current model is torch.Size([151]). You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method. Solution: Use ignore_mismatched_sizes=True New alert: - decode_head.classifier.weight: found shape torch.Size([150, 256, 1, 1]) in the checkpoint and torch.Size([151, 256, 1, 1]) in the model instantiated - decode_head.classifier.bias: found shape torch.Size([150]) in the checkpoint and torch.Size([151]) in the model instantiated NotImplementedError: Cannot copy out of meta tensor; no data! When using device_map=dev in from_pretrained(). Solution: Add accelerate.infer_auto_device_map(model) to model.hf_device_map after model is loaded Train HuggingFace Dataloader RuntimeError: cannot pin 'torch.cuda.FloatTensor' only dense CPU tensors can be pinned Dataloader loads data on device of model and tries loading data already loaded to ‘cuda’ into ‘cuda’ Solution: Not using .to(cuda) inside collator_fn OutOfMemoryError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 14.75 GiB total capacity; 11.08 GiB already allocated; 2.48 GiB free; 11.23 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF PyTorch CUDA Memory management Solution in environment: environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256" Solution for training: per_device_train_batch_size=batch_size with batch_size from 32 to 8 Solution for evaluation: per_device_eval_batch_size=batch_size with batch_size from 32 to 1 RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Solution: Set environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:2048" to max 1024 RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. Error occurs in cross entropy, maybe wrong number of labels or label indexing, id2label or label2id, See CUDA runtime error (59) : device-side assert triggered Switch to CPU to get more meaningful error messages Solution: Switching to CPU leads to IndexError: Target 150 is out of bounds. IndexError: Target 150 is out of bounds. Occurs in torch._C._nn.cross_entropy_loss, See CUDA runtime error (59) : device-side assert triggered. Maybe because len(categories) (150) smaller than train_ds.features['scene_category'].num_classes (1055) -&gt; No. Testing with max([(i["labels"].min().item(), i["labels"].max().item()) for i in test_ds.shard(10, 0)]) yields (0, 150) Solution: Prepend dummy class id2label = {**{0:'NONE'}, **{k:v for k,v in enumerate(categories, 1)}}. Has to be used with ignore_mismatched_sizes=True in from_pretrained(). RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same When trying to debug and trace CUDA error: device-side assert triggered with CPU instead of CUDA Solution: Do not use device_map for cpu ValueError: Unsupported number of image dimensions: 2 Occuring at random batches with PIL.mode='RGB' (['RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB']) 'pixel_values':torch.Size([&lt;batch_size=8&gt;, &lt;chn_dim=3&gt;, 512, 512]) 'labels':torch.Size([&lt;batch_size=8&gt;, 512, 512]) Maybe false PIL.mode like RGBA with 4 channels instead of RGB, See “Unsupported number of image dimensions” while using image_utils from Transformers Solution (bad one): Using image.convert("RGB") on every image within the on-the-fly transform function train_transforms(example_batch)]]></summary></entry></feed>