How to detect benchmark contamination in LLMs
A model scores 92% on MMLU — but did it learn the concepts or memorize the answers? Four detection strategies, from first principles.
1 article
A model scores 92% on MMLU — but did it learn the concepts or memorize the answers? Four detection strategies, from first principles.