← Back to feed
6

Multi-Agent System Evaluation: Macro-Eval Workflow Tutorial Released

Products1 source·May 25

Summary

  • • A new cookbook teaches macro-eval workflows for multi-agent AI systems at scale
  • • Framework separates per-agent evals from population-level pattern discovery across traces
  • • Synthetic EV order workflow with six specialist agents serves as the worked example
  • • Teams can move from thousands of agent events to a few actionable failure patterns
Adjust signal

Details

1.Tech Info

Two-level eval architecture: lower-level and macro

Lower-level evals grade individual agents, handoffs, tools, and completed runs. Macro evals then operate across many lower-level findings to surface recurring population-level patterns, directing engineering effort toward the highest-leverage intervention point.

2.Tech Info

Synthetic EV workflow with six specialist agents

Agents handle pricing, compliance, supply, factory routing, scheduling, and release decisions. Market and operational conditions vary across scenarios, producing a realistic mix of case types including clean orders, validation blocks, supplier substitutions, and pricing exceptions.

3.Research

Four structured labels organize the full analysis pipeline

case_type describes the business setup; run_outcome describes how the run ended (completed, awaiting review, blocked, or failed); eval_finding captures the local agent-level symptom; behavior_pattern is the recurring population-wide pattern discovered across many traces.

4.Tech Info

Promptfoo used as stand-in agent-level eval layer

The notebook uses precomputed synthetic traces and saved lower-level eval labels so practitioners can run the full macro-eval workflow without a live OpenAI API key, lowering the barrier to experimentation.

5.Insight

Plausible final outputs can hide serious upstream failures

A release recommendation can appear correct while the underlying trace reveals the pricing agent ignored an incentive, the supply agent missed a stockout, or the orchestrator bypassed a required review step — failures invisible without trace-level analysis.

Tech Info = technical specification or detail, Research = methodology or study finding, Insight = analytical observation derived from the content

What This Means

AI engineering teams running multi-agent systems now have a concrete, runnable framework for moving from thousands of raw trace events to a prioritized shortlist of recurring failure patterns without needing live API access. This is practical infrastructure for any team trying to maintain reliability in agentic pipelines beyond simple input-output evals.

Sources

Similar Events