How to detect benchmark contamination in LLMs
A model scores 92% on MMLU — but did it learn the concepts or memorize the answers? Four detection strategies, from first principles.
2 articles
A model scores 92% on MMLU — but did it learn the concepts or memorize the answers? Four detection strategies, from first principles.
LLM-as-a-judge from first principles — when to use it, how to design rubrics, the three biases that skew scores, and when to use something simpler.