Multi-Agent System Evaluation: Macro-Eval Workflow Tutorial Released

Products1 source·May 25

agents multi-agent agent-evaluation observability

Summary

• A new cookbook teaches macro-eval workflows for multi-agent AI systems at scale
• Framework separates per-agent evals from population-level pattern discovery across traces
• Synthetic EV order workflow with six specialist agents serves as the worked example
• Teams can move from thousands of agent events to a few actionable failure patterns

Adjust signal

Details

#	Type	Key Point	Context
1	Tech Info	Two-level eval architecture: lower-level and macro	Lower-level evals grade individual agents, handoffs, tools, and completed runs. Macro evals then operate across many lower-level findings to surface recurring population-level patterns, directing engineering effort toward the highest-leverage intervention point.
2	Tech Info	Synthetic EV workflow with six specialist agents	Agents handle pricing, compliance, supply, factory routing, scheduling, and release decisions. Market and operational conditions vary across scenarios, producing a realistic mix of case types including clean orders, validation blocks, supplier substitutions, and pricing exceptions.
3	Research	Four structured labels organize the full analysis pipeline	case_type describes the business setup; run_outcome describes how the run ended (completed, awaiting review, blocked, or failed); eval_finding captures the local agent-level symptom; behavior_pattern is the recurring population-wide pattern discovered across many traces.
4	Tech Info	Promptfoo used as stand-in agent-level eval layer	The notebook uses precomputed synthetic traces and saved lower-level eval labels so practitioners can run the full macro-eval workflow without a live OpenAI API key, lowering the barrier to experimentation.
5	Insight	Plausible final outputs can hide serious upstream failures	A release recommendation can appear correct while the underlying trace reveals the pricing agent ignored an incentive, the supply agent missed a stockout, or the orchestrator bypassed a required review step — failures invisible without trace-level analysis.

1.Tech Info

Two-level eval architecture: lower-level and macro

Lower-level evals grade individual agents, handoffs, tools, and completed runs. Macro evals then operate across many lower-level findings to surface recurring population-level patterns, directing engineering effort toward the highest-leverage intervention point.

2.Tech Info

Synthetic EV workflow with six specialist agents

Agents handle pricing, compliance, supply, factory routing, scheduling, and release decisions. Market and operational conditions vary across scenarios, producing a realistic mix of case types including clean orders, validation blocks, supplier substitutions, and pricing exceptions.

3.Research

Four structured labels organize the full analysis pipeline

case_type describes the business setup; run_outcome describes how the run ended (completed, awaiting review, blocked, or failed); eval_finding captures the local agent-level symptom; behavior_pattern is the recurring population-wide pattern discovered across many traces.

4.Tech Info

Promptfoo used as stand-in agent-level eval layer

The notebook uses precomputed synthetic traces and saved lower-level eval labels so practitioners can run the full macro-eval workflow without a live OpenAI API key, lowering the barrier to experimentation.

5.Insight

Plausible final outputs can hide serious upstream failures

A release recommendation can appear correct while the underlying trace reveals the pricing agent ignored an incentive, the supply agent missed a stockout, or the orchestrator bypassed a required review step — failures invisible without trace-level analysis.

Tech Info = technical specification or detail, Research = methodology or study finding, Insight = analytical observation derived from the content

What This Means

AI engineering teams running multi-agent systems now have a concrete, runnable framework for moving from thousands of raw trace events to a prioritized shortlist of recurring failure patterns without needing live API access. This is practical infrastructure for any team trying to maintain reliability in agentic pipelines beyond simple input-output evals.

Sources

Evaluating Multi-Agent Systems at ScaleDevelopers

Similar Events

AWS Open-Sources Agent-EvalKit for Systematic AI Agent Evaluation

Jun 11

LangChain's Deep Agents Eval Framework: Targeted, Behavior-First Testing Over Broad Benchmarks

Jun 16