Latest overblown AI claim: GPT-4 achieves a perfect score on questions representing the entire MIT Math, EE, and CS curriculum

June 17, 2023 Carol Anderson

This paper, released on arXiv two days ago, is getting a lot of attention.

But some issues with it were immediately apparent:

Answers were graded automatically by GPT-4 itself.
When the LLM gave a wrong answer, it was given multiple chances to try again with different prompt formats until the correct answer was achieved.

The researchers didn’t make their data publicly available to prevent contamination of future LLMs, but three MIT undergrads managed to dig the test set out of the authors' git history. This led to some interesting discoveries:

4% of the test set questions were unanswerable. Some of them weren’t even questions. How did GPT-4 get them right?
5% of the test set questions were duplicates.
Many of the few-shot prompts provided to the model contained the answer, largely due to leakage of information between multi-part questions.

The incisive critique by MIT students Raunak Chowdhuri, Neil Deshmukh, and David Koplow is well worth reading.