MAS Best Practices
Key Takeaways¶
- Production requires infrastructure: 90% of effort is reliability/safety/observability, not core AI
- Balance training approaches: Light Supervised Fine-Tuning (SFT) enables Reinforcement Learning (RL), which delivers diversity and exploration
- Statistical rigor in evaluation: Small benchmarks are noisy; validate significance before claiming improvements
- Diversity prevents fragility: Multi-agent leagues and varied training environments maintain robustness
- Security by design: Contextual access control and channel separation are not optional for agentic systems
- Real-world validation matters: Dynamic benchmarks and practical scenarios measure true capability
1. Production Infrastructure¶
Platform Requirements:
- Adopt AI-native platforms unifying dev, training, and inference with elasticity, observability, and failure handling
- Plan for sustained inference growth with utilization management, routing, and cost controls
- Treat compute like supply chain: multi-cloud with portable abstractions and scheduling over heterogeneous accelerators
- Modularize model interfaces to swap models and add inference-time techniques without rewriting agent logic
Reliability & Observability:
- Long-running agent workflows require default reliability posture
- Massive infrastructure needed beyond core AI: reliability, safety, observability
- Non-deterministic agents require new testing methods (user simulation, τ-bench)
- Standard metrics (e.g., Word Error Rate/WER) inadequate - need domain-specific quality measurements
Advanced Capabilities:
- Agent Data Platforms provide long-term memory and integrate with Customer Data Platforms (CDPs) for proactive, context-aware engagement
- Shift from transactional to relational agents via persistent memory systems
2. Training & Verification¶
Training Strategy:
- Objective shift: Maximize verifiable rewards via environment/tool interaction (beyond human preference alone)
- SFT + RL balance: Light SFT prevents meaningless attempts and enables tractable rollouts, then RL explores diverse tool-use trajectories
- Data diversity: Prioritize diversity across environments, tools, and verifiers (environment & verifier diversity critical)
Verifier Design:
- Minimize both false positives and false negatives
- Reward all equivalent correct forms while enforcing stated constraints
- Critical for post-training agentic model development
3. Evaluation & Benchmarking¶
Core Principles:
- Holistic strategy: Evaluate many tasks and verifiers, vary harnesses and tool action spaces
- Recognize benchmark suite defines operational notion of intelligence
- “You can only improve what gets measured”
3.1 Design Checklist¶
Essential Criteria:
- Outcome validity: High scores genuinely reflect successful task completion (most critical)
- Real-world scenarios: Practical tasks (e.g., “book a flight”) over abstract puzzles
- Contamination resistance: Dynamic benchmarks (DynaBench, LiveCodeBench) resist training data leakage and saturation
- Appropriate difficulty: Stratified levels to differentiate capabilities
- Baseline provision: Clear reference points for comparison
- Reproducibility: Systematic measurement with ground truth and rigorous rubrics
Validation Methods:
- Verifiable tasks: Exact matching, test execution, database state comparison
- Non-verifiable tasks: Human evaluators or Large Language Model (LLM)-as-Judge with defined rubrics
- Real artifacts: Use actual systems (e.g., 1,507 Common Vulnerabilities and Exposures/CVEs, 188 projects) not synthetic scenarios
Common Failures:
- Task setup flaws overestimate performance by 100% [2507.02825] ABC
- Insufficient tests (SWE-bench), degenerate solutions (TAU-bench empty responses)
- Noisy/biased data, gaming via benchmark-specific optimization
- Test/production environment mismatch
Statistical Requirements:
- Small benchmarks have high noise (HumanEval N=164: 2.5% gains often insignificant)
- Noise can follow Beta distribution based on model accuracy
- Large multiple-choice benchmarks (MMLU, gsm8k) have better signal-to-noise than small code benchmarks
3.2 Trust & Validation¶
Trust Issues [2502.06559] Trust in Benchmarks:
- Dataset biases from creation methodology
- Data contamination in training sets
- Gaming via benchmark-specific optimization
- Over-focus on text-based one-time testing (ignores multimodal/human-AI interaction)
- Misaligned incentives: State-of-the-Art (SOTA) pursuit over societal relevance
Real-World Validation [2506.02548] CyberGym:
- Top agents: ~20% success on real tasks vs inflated benchmark scores
- Discovered 35 zero-days, 17 incomplete patches in actual CVEs
- Proof-of-concept generation validates genuine capability
3.3 Consistency Metrics¶
[2406.12045] τ-bench:
- pass^k metric measures consistency across multiple trials
- GPT-4o: <50% task success, <25% pass^8 in retail domain
- Domain-specific rules critical for deployment
[2506.07982] τ²-bench Dual-Control:
- Decentralized Partially Observable Markov Decision Process (Dec-POMDP) framework tests agent-user coordination
- Performance drops significantly when users modify environment
- Fine-grained ablations separate reasoning from communication errors
3.4 Ecosystem Gaps¶
Current State:
- Lack of interoperability between evaluation frameworks
- Limited reproducibility across implementations
- Fragmented landscape with discovery challenges
- LLM-centric evaluations, fixed harnesses, high overhead
4. Multi-Agent Systems¶
Coordination Patterns:
- League of Exploiters: Prevents main policy over-specialization, maintains strategy diversity through adversarial training
- Architecture: Auto-regressive action sequences (commands + arguments) used in LLM function calling
5. AI Safety & Security¶
Attack Surface:
- Agentic AI has fundamentally larger attack surface than standalone LLMs
- Three expansion factors: tools (code/API execution), memory (state persistence), autonomy (active systems)
- Compounded vulnerabilities: all classic software vulnerabilities + new AI-specific vulnerabilities
Prompt Injection:
- Root cause: Lack of separation between control channel (system instructions) and data channel (user input)
- Direct attacks: Malicious user input treated as executable commands
- Indirect attacks: Hidden instructions in documents/webpages (e.g., white text) cause data exfiltration
Retrieval-Augmented Generation (RAG)-Specific Threats:
- Data poisoning: Small number of malicious documents in knowledge base triggered by specific keywords
- Backdoor attacks: Targeted conditional behavior changes
Defense Strategies:
- Layered defense: Guardrails and supervisors required (beyond single-layer protection)
- Least privilege / Contextual security: Dynamically restrict available tools and data access based on workflow context/step
- Separation of concerns: Isolate control and data channels where possible