Your model migration passed. Here's what the aggregate didn't show.
75% of AI agents break working behavior over time — including across model upgrades. Dashboards show the aggregate. Statistical comparison shows what moved underneath.
7 articles
75% of AI agents break working behavior over time — including across model upgrades. Dashboards show the aggregate. Statistical comparison shows what moved underneath.
When agent traces are trees, naive aggregation of cost, tokens, and step counts produces wrong numbers. Here's the problem, what major platforms do about it, and the concrete approaches that work.
Why aggregate eval metrics hide AI agent regressions, and how statistical testing catches what aggregates miss.
Exactly-once delivery is impossible at the transport layer. The pattern that gives you the semantics anyway: at-least-once delivery plus an idempotent writer.
A model scores 92% on MMLU — but did it learn the concepts or memorize the answers? Four detection strategies, from first principles.
LLM-as-a-judge from first principles — when to use it, how to design rubrics, the three biases that skew scores, and when to use something simpler.
Why ClickHouse queries billions of rows in milliseconds — columnar storage, compression, and the MergeTree engine explained from first principles.