Over the past few years, artificial intelligence has advanced in a number of remarkable ways. Today, machines can not only solve complex problems, but also develop their own unique proof strategies. But are they really that smart? In a new study, cutting-edge AI systems are being challenged by leading mathematicians. The unpeer-reviewed paper is out now on the preprint server ArXiv.
Ability of AI to solve problems
While the ability of AI to solve problems like the GSM8K set (8.5 thousand high school math problems that require multiple steps to solve) or the International Mathematical Olympiad is impressive, these are still not the most advanced areas of mathematics, but rather the level of an advanced school rather than the limits of human knowledge in this area.
In addition, there is a problem of a lack of new problems for various AI programs.
A significant problem when assessing large language models [LLMs] is contamination of data.”In other terms, this is the unintentionally inclusion of trial issues in the initial research data,” the researchers write.

As a result, like a student who knows the answers to a test in advance, the success rates of the models are inflated, obscuring the true reasoning abilities of the models.
This is no empty promise: the project involved Fields Medalists, including those who submitted problems to the data set, and mathematicians at the graduate level and above from universities around the world.The proposed problems had to satisfy four criteria: be original—that is, their solution required genuine mathematical insight, rather than fitting known problems; be testable without guesswork; be computationally solvable; and be quickly and automatically verified. Once the problems were checked against all these criteria, they were peer-reviewed, given difficulty ratings, and submitted to the AI.Could today’s programs handle them? Alas, no.
The solutions are so complex that they require large amounts of training data that are not available in reality, notes Fields Medalist Terry Tao. However, this is a temporary limitation, because as AI systems improve, the situation should change, as the authors note.