Skip to content

MAS Best Practices

Key Takeaways

  1. Production requires infrastructure: 90% of effort is reliability/safety/observability, not core AI
  2. Balance training approaches: Light Supervised Fine-Tuning (SFT) enables Reinforcement Learning (RL), which delivers diversity and exploration
  3. Statistical rigor in evaluation: Small benchmarks are noisy; validate significance before claiming improvements
  4. Diversity prevents fragility: Multi-agent leagues and varied training environments maintain robustness
  5. Security by design: Contextual access control and channel separation are not optional for agentic systems
  6. Real-world validation matters: Dynamic benchmarks and practical scenarios measure true capability

1. Production Infrastructure

Platform Requirements:

  • Adopt AI-native platforms unifying dev, training, and inference with elasticity, observability, and failure handling
  • Plan for sustained inference growth with utilization management, routing, and cost controls
  • Treat compute like supply chain: multi-cloud with portable abstractions and scheduling over heterogeneous accelerators
  • Modularize model interfaces to swap models and add inference-time techniques without rewriting agent logic

Reliability & Observability:

  • Long-running agent workflows require default reliability posture
  • Massive infrastructure needed beyond core AI: reliability, safety, observability
  • Non-deterministic agents require new testing methods (user simulation, τ-bench)
  • Standard metrics (e.g., Word Error Rate/WER) inadequate - need domain-specific quality measurements

Advanced Capabilities:

  • Agent Data Platforms provide long-term memory and integrate with Customer Data Platforms (CDPs) for proactive, context-aware engagement
  • Shift from transactional to relational agents via persistent memory systems

2. Training & Verification

Training Strategy:

  • Objective shift: Maximize verifiable rewards via environment/tool interaction (beyond human preference alone)
  • SFT + RL balance: Light SFT prevents meaningless attempts and enables tractable rollouts, then RL explores diverse tool-use trajectories
  • Data diversity: Prioritize diversity across environments, tools, and verifiers (environment & verifier diversity critical)

Verifier Design:

  • Minimize both false positives and false negatives
  • Reward all equivalent correct forms while enforcing stated constraints
  • Critical for post-training agentic model development

3. Evaluation & Benchmarking

Core Principles:

  • Holistic strategy: Evaluate many tasks and verifiers, vary harnesses and tool action spaces
  • Recognize benchmark suite defines operational notion of intelligence
  • “You can only improve what gets measured”

3.1 Design Checklist

Essential Criteria:

  1. Outcome validity: High scores genuinely reflect successful task completion (most critical)
  2. Real-world scenarios: Practical tasks (e.g., “book a flight”) over abstract puzzles
  3. Contamination resistance: Dynamic benchmarks (DynaBench, LiveCodeBench) resist training data leakage and saturation
  4. Appropriate difficulty: Stratified levels to differentiate capabilities
  5. Baseline provision: Clear reference points for comparison
  6. Reproducibility: Systematic measurement with ground truth and rigorous rubrics

Validation Methods:

  • Verifiable tasks: Exact matching, test execution, database state comparison
  • Non-verifiable tasks: Human evaluators or Large Language Model (LLM)-as-Judge with defined rubrics
  • Real artifacts: Use actual systems (e.g., 1,507 Common Vulnerabilities and Exposures/CVEs, 188 projects) not synthetic scenarios

Common Failures:

  • Task setup flaws overestimate performance by 100% [2507.02825] ABC
  • Insufficient tests (SWE-bench), degenerate solutions (TAU-bench empty responses)
  • Noisy/biased data, gaming via benchmark-specific optimization
  • Test/production environment mismatch

Statistical Requirements:

  • Small benchmarks have high noise (HumanEval N=164: 2.5% gains often insignificant)
  • Noise can follow Beta distribution based on model accuracy
  • Large multiple-choice benchmarks (MMLU, gsm8k) have better signal-to-noise than small code benchmarks

3.2 Trust & Validation

Trust Issues [2502.06559] Trust in Benchmarks:

  • Dataset biases from creation methodology
  • Data contamination in training sets
  • Gaming via benchmark-specific optimization
  • Over-focus on text-based one-time testing (ignores multimodal/human-AI interaction)
  • Misaligned incentives: State-of-the-Art (SOTA) pursuit over societal relevance

Real-World Validation [2506.02548] CyberGym:

  • Top agents: ~20% success on real tasks vs inflated benchmark scores
  • Discovered 35 zero-days, 17 incomplete patches in actual CVEs
  • Proof-of-concept generation validates genuine capability

3.3 Consistency Metrics

[2406.12045] τ-bench:

  • pass^k metric measures consistency across multiple trials
  • GPT-4o: <50% task success, <25% pass^8 in retail domain
  • Domain-specific rules critical for deployment

[2506.07982] τ²-bench Dual-Control:

  • Decentralized Partially Observable Markov Decision Process (Dec-POMDP) framework tests agent-user coordination
  • Performance drops significantly when users modify environment
  • Fine-grained ablations separate reasoning from communication errors

3.4 Ecosystem Gaps

Current State:

  • Lack of interoperability between evaluation frameworks
  • Limited reproducibility across implementations
  • Fragmented landscape with discovery challenges
  • LLM-centric evaluations, fixed harnesses, high overhead

4. Multi-Agent Systems

Coordination Patterns:

  • League of Exploiters: Prevents main policy over-specialization, maintains strategy diversity through adversarial training
  • Architecture: Auto-regressive action sequences (commands + arguments) used in LLM function calling

5. AI Safety & Security

Attack Surface:

  • Agentic AI has fundamentally larger attack surface than standalone LLMs
  • Three expansion factors: tools (code/API execution), memory (state persistence), autonomy (active systems)
  • Compounded vulnerabilities: all classic software vulnerabilities + new AI-specific vulnerabilities

Prompt Injection:

  • Root cause: Lack of separation between control channel (system instructions) and data channel (user input)
  • Direct attacks: Malicious user input treated as executable commands
  • Indirect attacks: Hidden instructions in documents/webpages (e.g., white text) cause data exfiltration

Retrieval-Augmented Generation (RAG)-Specific Threats:

  • Data poisoning: Small number of malicious documents in knowledge base triggered by specific keywords
  • Backdoor attacks: Targeted conditional behavior changes

Defense Strategies:

  • Layered defense: Guardrails and supervisors required (beyond single-layer protection)
  • Least privilege / Contextual security: Dynamically restrict available tools and data access based on workflow context/step
  • Separation of concerns: Isolate control and data channels where possible