AgentBeats Basics

Overview¶

AgentX-AgentBeats Competition - Berkeley RDI (Oct 2025 - Jan 2026)

Competition Structure: Phase 1 (Green Agent) builds evaluation benchmarks, Phase 2 (Purple Agent) builds competing agents.

Deadline Phase 1 (Green): January 15, 2026 Deadline Phase 2 (Purple): February 22, 2026

Strategic Context¶

AgentBeats Competition (Deadline: Jan 15, 2026):

Outstanding tracks: Research Agent ($16k OpenAI), Multi-Agent, AAA
Critical gap: A2A Protocol (2-3 days effort)
Unique advantage: Graph-based coordination analysis (NOVEL)

Tool Synergy Discovered:

PydanticAI + Pydantic Evals + Logfire + Agents-eval
         ↓
Complete evaluation infrastructure

Why Agents-eval is an OUTSTANDING Competition Entry¶

For AgentBeats¶

Fills critical evaluation gap: While 28 benchmarks exist in AgentBeats (SciCode, GAIA, TheAgentCompany, etc.), NONE evaluate multi-agent coordination quality through graph-based behavioral analysis. Agents-eval brings a category-defining approach that quantifies what others ignore: how agents collaborate, not just whether they succeed.

Addresses competition judging criteria perfectly:

✅ Innovation & Impact: Post-execution graph analysis is NOVEL - no existing benchmark measures coordination centrality, communication overhead, or task distribution balance
✅ Evaluation Methodology: Three-tier system (Traditional + LLM-as-Judge + Graph) provides multi-dimensional scoring vs. competitors’ binary pass/fail
✅ Benchmark Design: PeerRead uses real academic papers with ground truth reviews, not synthetic tasks
✅ Technical Quality: Production-ready with PydanticAI, comprehensive tests, type safety
✅ Reproducibility: Config-driven with deterministic metrics and Docker deployment

How Agents-eval Stands Out¶

vs. Existing Benchmarks:

Benchmark	What They Measure	What Agents-eval Adds
SciCode, CORE-Bench	Task completion (binary)	Multi-dimensional scoring + behavioral patterns
TheAgentCompany	Real-world task success	Coordination quality metrics
GAIA	Accuracy	Planning rationality, tool efficiency
All others	Whether agents succeed	How agents collaborate

Unique differentiators NO competitor has:

Graph-based coordination analysis - NetworkX betweenness centrality, communication overhead, path convergence
Post-execution behavioral tracing - Agents operate autonomously, patterns analyzed retrospectively without interference
Composite academic scoring - 6 balanced metrics mapping to accept/reject decisions (mirrors real peer review)
Three-tier graceful degradation - Fast metrics (<1s) → LLM quality → Graph complexity, with fallback strategies

Bottom line: Agents-eval doesn’t just test if agents work — it reveals how well they work together, filling a gap that no existing benchmark addresses.

OUTSTANDING Participation Tracks for Agents-eval¶

🏆 Research Agent Track (OpenAI-sponsored, $16k prizes)¶

Perfect fit: PeerRead benchmark IS research agent evaluation with ground truth reviews

USP: “First research agent benchmark with post-execution behavioral analysis measuring coordination quality, planning rationality, and tool efficiency beyond task completion”

Differentiator: Three-tier evaluation (Traditional + LLM-as-Judge + Graph Analysis) vs. single-metric competitors

🏆 Multi-Agent Track (Category-defining opportunity)¶

Unique position: NO existing benchmark evaluates multi-agent coordination with graph metrics

USP: “Only benchmark that quantifies multi-agent coordination quality through NetworkX graph analysis, enabling comparison of agent architectures on collaboration efficiency”

Novel metrics: Coordination centrality, communication overhead, task distribution balance, path convergence

🏆 AAA Track (Agentified Agent Assessment)¶

Natural alignment: Tier 2 LLM-as-Judge = agent evaluating agents

USP: Three-tier system inherently implements “agents assess other agents” vision

Agentified Agent Assessment (AAA): A New Paradigm for Open, Standardized, Reproducible Agent Evaluation

Goals
1. Agentified evaluation
2. Standardization
3. Reproducibility
Obstacles 1. System implementation complexity 2. Lack of openness and adoption

Key Competitive Advantages¶

Three-Tier Evaluation - Combines fast traditional metrics (<1s), LLM-as-Judge quality assessment, and graph-based behavioral analysis
Composite Scoring - 6 weighted metrics mapping to academic review decisions (accept/reject) vs. binary pass/fail
Real Academic Domain - PeerRead provides ground truth scientific reviews vs. synthetic tasks
Post-Execution Behavioral Analysis - NOVEL approach: agents operate autonomously, observability logs analyzed retrospectively

Critical Gap: A2A Protocol Compliance¶

Required for all tracks:

Implement Google A2A protocol wrapper for agents
Add MCP (Model Context Protocol) compliance for tool access
AgentBeats SDK integration
Estimated effort: 2-3 days using agentbeats/tutorial

Implementation Files¶

New files required¶

docker/Dockerfile - Production containerization
docker/docker-compose.yml - Multi-service orchestration
src/app/protocols/a2a_wrapper.py - A2A protocol implementation
src/app/protocols/mcp_compliance.py - MCP tool access
docs/agentbeats/README.md - Competition-focused documentation
docs/agentbeats/demo_script.md - 3-minute demo video script

Files to modify¶

src/app/agents/orchestration.py - A2A protocol integration
pyproject.toml - AgentBeats SDK dependency

Quick Win Prioritization¶

fix_rate metric (30 min) - immediate value
Ralph completion promise (1 hour) - proven pattern
“Think first” Tier 2 (30 min) - 0.86 correlation

Immediate Next Steps¶

Register team on AgentBeats platform
Fork agentbeats/tutorial repository
Create A2A wrapper prototype
Build production Dockerfile
Join Discord for community support

Recommended Strategy¶

Dual-track submission (Research Agent + Multi-Agent) - same codebase, different marketing angles for 2x prize opportunity.