Multi-Agent System Evaluation: Macro-Eval Workflow Tutorial Released
Summary
- • A new cookbook teaches macro-eval workflows for multi-agent AI systems at scale
- • Framework separates per-agent evals from population-level pattern discovery across traces
- • Synthetic EV order workflow with six specialist agents serves as the worked example
- • Teams can move from thousands of agent events to a few actionable failure patterns
Details
Two-level eval architecture: lower-level and macro
Lower-level evals grade individual agents, handoffs, tools, and completed runs. Macro evals then operate across many lower-level findings to surface recurring population-level patterns, directing engineering effort toward the highest-leverage intervention point.
Synthetic EV workflow with six specialist agents
Agents handle pricing, compliance, supply, factory routing, scheduling, and release decisions. Market and operational conditions vary across scenarios, producing a realistic mix of case types including clean orders, validation blocks, supplier substitutions, and pricing exceptions.
Four structured labels organize the full analysis pipeline
case_type describes the business setup; run_outcome describes how the run ended (completed, awaiting review, blocked, or failed); eval_finding captures the local agent-level symptom; behavior_pattern is the recurring population-wide pattern discovered across many traces.
Promptfoo used as stand-in agent-level eval layer
The notebook uses precomputed synthetic traces and saved lower-level eval labels so practitioners can run the full macro-eval workflow without a live OpenAI API key, lowering the barrier to experimentation.
Plausible final outputs can hide serious upstream failures
A release recommendation can appear correct while the underlying trace reveals the pricing agent ignored an incentive, the supply agent missed a stockout, or the orchestrator bypassed a required review step — failures invisible without trace-level analysis.
Tech Info = technical specification or detail, Research = methodology or study finding, Insight = analytical observation derived from the content
What This Means
AI engineering teams running multi-agent systems now have a concrete, runnable framework for moving from thousands of raw trace events to a prioritized shortlist of recurring failure patterns without needing live API access. This is practical infrastructure for any team trying to maintain reliability in agentic pipelines beyond simple input-output evals.
Sources
- Evaluating Multi-Agent Systems at ScaleDevelopers
