All posts

7 articles

2026

Mar 23, 2026 4 min

Your model migration passed. Here's what the aggregate didn't show.

75% of AI agents break working behavior over time — including across model upgrades. Dashboards show the aggregate. Statistical comparison shows what moved underneath.

Mar 20, 2026 9 min

When agent trace metrics lie: the span tree double-counting problem

When agent traces are trees, naive aggregation of cost, tokens, and step counts produces wrong numbers. Here's the problem, what major platforms do about it, and the concrete approaches that work.

AI agents observability OpenTelemetry OpenInference evaluation

Mar 19, 2026 5 min

Aggregate metrics are a blind spot in agent evaluation

Why aggregate eval metrics hide AI agent regressions, and how statistical testing catches what aggregates miss.

AI agents evaluation open source testing

Mar 12, 2026 5 min

Exactly-once semantics on at-least-once infrastructure

Exactly-once delivery is impossible at the transport layer. The pattern that gives you the semantics anyway: at-least-once delivery plus an idempotent writer.

system design distributed systems data pipelines

Mar 8, 2026 7 min