Aggregate metrics are a blind spot in agent evaluation

Why aggregate eval metrics hide AI agent regressions, and how statistical testing catches what aggregates miss.