testing — orekhov.work

Why aggregate eval metrics hide AI agent regressions, and how statistical testing catches what aggregates miss.