MAS Best Practices

Key Takeaways¶

Production requires infrastructure: 90% of effort is reliability/safety/observability, not core AI
Balance training approaches: Light Supervised Fine-Tuning (SFT) enables Reinforcement Learning (RL), which delivers diversity and exploration
Statistical rigor in evaluation: Small benchmarks are noisy; validate significance before claiming improvements
Diversity prevents fragility: Multi-agent leagues and varied training environments maintain robustness
Security by design: Contextual access control and channel separation are not optional for agentic systems
Real-world validation matters: Dynamic benchmarks and practical scenarios measure true capability

1. Production Infrastructure¶

Platform Requirements:

Adopt AI-native platforms unifying dev, training, and inference with elasticity, observability, and failure handling
Plan for sustained inference growth with utilization management, routing, and cost controls
Treat compute like supply chain: multi-cloud with portable abstractions and scheduling over heterogeneous accelerators
Modularize model interfaces to swap models and add inference-time techniques without rewriting agent logic

Reliability & Observability:

Long-running agent workflows require default reliability posture
Massive infrastructure needed beyond core AI: reliability, safety, observability
Non-deterministic agents require new testing methods (user simulation, τ-bench)
Standard metrics (e.g., Word Error Rate/WER) inadequate - need domain-specific quality measurements

Advanced Capabilities:

Agent Data Platforms provide long-term memory and integrate with Customer Data Platforms (CDPs) for proactive, context-aware engagement
Shift from transactional to relational agents via persistent memory systems

2. Training & Verification¶

Training Strategy:

Objective shift: Maximize verifiable rewards via environment/tool interaction (beyond human preference alone)
SFT + RL balance: Light SFT prevents meaningless attempts and enables tractable rollouts, then RL explores diverse tool-use trajectories
Data diversity: Prioritize diversity across environments, tools, and verifiers (environment & verifier diversity critical)

Verifier Design:

Minimize both false positives and false negatives
Reward all equivalent correct forms while enforcing stated constraints
Critical for post-training agentic model development

3. Evaluation & Benchmarking¶

Core Principles:

Holistic strategy: Evaluate many tasks and verifiers, vary harnesses and tool action spaces
Recognize benchmark suite defines operational notion of intelligence
“You can only improve what gets measured”

3.1 Design Checklist¶

Essential Criteria:

Outcome validity: High scores genuinely reflect successful task completion (most critical)
Real-world scenarios: Practical tasks (e.g., “book a flight”) over abstract puzzles
Contamination resistance: Dynamic benchmarks (DynaBench, LiveCodeBench) resist training data leakage and saturation
Appropriate difficulty: Stratified levels to differentiate capabilities
Baseline provision: Clear reference points for comparison
Reproducibility: Systematic measurement with ground truth and rigorous rubrics

Validation Methods:

Verifiable tasks: Exact matching, test execution, database state comparison
Non-verifiable tasks: Human evaluators or Large Language Model (LLM)-as-Judge with defined rubrics
Real artifacts: Use actual systems (e.g., 1,507 Common Vulnerabilities and Exposures/CVEs, 188 projects) not synthetic scenarios

Common Failures:

Task setup flaws overestimate performance by 100% [2507.02825] ABC
Insufficient tests (SWE-bench), degenerate solutions (TAU-bench empty responses)
Noisy/biased data, gaming via benchmark-specific optimization
Test/production environment mismatch

Statistical Requirements:

Small benchmarks have high noise (HumanEval N=164: 2.5% gains often insignificant)
Noise can follow Beta distribution based on model accuracy
Large multiple-choice benchmarks (MMLU, gsm8k) have better signal-to-noise than small code benchmarks

3.2 Trust & Validation¶

Trust Issues [2502.06559] Trust in Benchmarks:

Dataset biases from creation methodology
Data contamination in training sets
Gaming via benchmark-specific optimization
Over-focus on text-based one-time testing (ignores multimodal/human-AI interaction)
Misaligned incentives: State-of-the-Art (SOTA) pursuit over societal relevance

Real-World Validation [2506.02548] CyberGym:

Top agents: ~20% success on real tasks vs inflated benchmark scores
Discovered 35 zero-days, 17 incomplete patches in actual CVEs
Proof-of-concept generation validates genuine capability

3.3 Consistency Metrics¶

[2406.12045] τ-bench:

pass^k metric measures consistency across multiple trials
GPT-4o: <50% task success, <25% pass^8 in retail domain
Domain-specific rules critical for deployment

[2506.07982] τ²-bench Dual-Control:

Decentralized Partially Observable Markov Decision Process (Dec-POMDP) framework tests agent-user coordination
Performance drops significantly when users modify environment
Fine-grained ablations separate reasoning from communication errors

3.4 Ecosystem Gaps¶

Current State:

Lack of interoperability between evaluation frameworks
Limited reproducibility across implementations
Fragmented landscape with discovery challenges
LLM-centric evaluations, fixed harnesses, high overhead

4. Multi-Agent Systems¶

Coordination Patterns:

League of Exploiters: Prevents main policy over-specialization, maintains strategy diversity through adversarial training
Architecture: Auto-regressive action sequences (commands + arguments) used in LLM function calling

5. AI Safety & Security¶

Attack Surface:

Agentic AI has fundamentally larger attack surface than standalone LLMs
Three expansion factors: tools (code/API execution), memory (state persistence), autonomy (active systems)
Compounded vulnerabilities: all classic software vulnerabilities + new AI-specific vulnerabilities

Prompt Injection:

Root cause: Lack of separation between control channel (system instructions) and data channel (user input)
Direct attacks: Malicious user input treated as executable commands
Indirect attacks: Hidden instructions in documents/webpages (e.g., white text) cause data exfiltration

Retrieval-Augmented Generation (RAG)-Specific Threats:

Data poisoning: Small number of malicious documents in knowledge base triggered by specific keywords
Backdoor attacks: Targeted conditional behavior changes

Defense Strategies:

Layered defense: Guardrails and supervisors required (beyond single-layer protection)
Least privilege / Contextual security: Dynamically restrict available tools and data access based on workflow context/step
Separation of concerns: Isolate control and data channels where possible