Latest Updates

Why AI passes exams… but fails on simple problems

December 7, 2025

AI passes doctoral exams but fails to count words. Ilya Sutskever (ex-OpenAI) breaks the silence and explains why increasing the size of models will not resolve this paradox.

Ilya Sutskever has just broken the silence. Co-founder of OpenAI and former scientific director of the company, this discreet computer scientist hit the headlines at the end of 2023 by participating in the lightning ousting of Sam Altman, its CEO, before leaving the company in May 2024 to launch his own startup. Known for avoiding the spotlight and granting very few interviews, he recently spoke on Dwarkesh Patel’s podcast, delivering an analysis as surprising as it was disturbing on the current state of artificial intelligence.

His observation? Imagine a student capable of passing the entrance exam to Polytechnique, but unable to count the number of words in a sentence. This is exactly the paradox that the most advanced artificial intelligences experience today. Sutskever sums up this enigma with a troubling question:

How can we explain that these systems obtain excellent test results, while their real economic impact remains so limited?

Spectacular performances on difficult exercises

Recent artificial intelligence models accumulate impressive feats. Take the example of OpenAI’s o1 model: it ranks among the top 11% programmers in international coding competitions, and even outperforms doctoral students on advanced questions in physics, chemistry and biology.

These systems function as immense living libraries, capable of drawing on astronomical quantities of information absorbed during their learning. They particularly excel on standardized university exams and professional qualification tests, where the restitution of knowledge and the resolution of complex problems are essential.

However, this success hides a major problem: data contamination. Clearly, some models have already “seen” the exam questions during their training. It’s like a student taking a test with all the answers in their pocket. Under these conditions, the AI only reproduces memorized responses rather than truly reasoning.

To guarantee an honest assessment, researchers are now developing tests guaranteed to be “never before seen” by the models, in order to measure their real capacity for understanding.

The surprising failure in the face of elementary problems

This is where the problem lies. While it can solve complex equations, AI regularly stumbles on tasks that a ten-year-old child would master without difficulty.

Researchers have created an eye-opening test called Unpuzzles. The principle? Taking famous logic puzzles and simplifying them to the point of triviality. Result: models that brilliantly solve the difficult version fail on the easy version.

For what ? Because they overthink. When faced with a simple problem, AI automatically brings out the sophisticated techniques learned for complex problems, where simple common sense would suffice. It’s like using a jackhammer to drive a nail.

You have surely noticed it yourself: artificial intelligence also struggles when a task requires attention to detail and sustained attention. These limitations are all the more important to understand for anyone wishing to master generative AI in a professional context, where precision and reliability are essential:

Counting words: As soon as the text exceeds a few lines, performance plummets. Some models fall to 0% success when it comes to counting 6 precise words in a 150 word paragraph. Errors accumulate as the count progresses.
Plan a trip: faced with simple constraints (budget, number of cities, means of transport), the AI sometimes invents figures. For example, during a test, the Claude 3.7 model recently claimed that a trip cost $93 instead of $103, just to make the result fit the requested budget.

The AI also gets confused by unimportant details. Add an innocuous sentence to a math problem (for example, “Carlo is 35 years old and likes pizza”) and performance drops dramatically. Worse yet, some models use this unnecessary information to “justify” their reasoning, going so far as to interpret a character’s age as a mathematical clue.

Simple family relationships are also problematic. Ask the AI how many sisters Alice’s brother has when you know Alice has two sisters, and it can get lost in convoluted reasoning.

Understanding the gap: human versus machine

Ilya Sutskever illustrates this difference with a telling analogy: “AI looks like a student who has spent 10,000 hours training just for programming competitions. He has memorized all the techniques, has become ultra-specialized, but does not know how to adapt outside of this specific area. Humans have this mysterious ability to generalize with much less data. »

The AI scoring system amplifies this illusion of performance. Current tests often focus on isolated tasks and multiple-choice questions, where AI can shine by recognizing superficial patterns rather than truly understanding.

Additionally, models display inordinate confidence in their wrong answers, accompanied by rationales that seem logical but are completely wrong. These hallucinations make their mistakes all the more misleading.

To assess true intelligence, researchers are developing new criteria: goal success rate, ability to work independently, and above all, resilience to errors in tasks that require several steps.

From the race for size to the era of fundamental research

The gap between AI’s academic achievements and its fragility in the face of basic problems teaches us a crucial lesson: increasing the size of models and the amount of data is not enough to create truly flexible and reliable intelligence.

Ilya Sutskever thus predicts the end of the era of the “race to gigantism” and the entry into the “age of research”. The challenge is no longer to build ever-larger models, but to discover the learning principle that will allow natural generalization, this capacity that we, humans, possess instinctively.

To truly evaluate the next generation of artificial intelligence, we need to change our perspective: no longer ask what is the most difficult problem it can solve, but rather what is the simplest problem with which it struggles.

Today’s AI is like a student prodigy but rigid: capable of reciting complex theorems after swallowing entire libraries, but completely destabilized by a simple question of common sense as soon as the context changes slightly. The path to true general intelligence lies in mastering this simple and adaptable wisdom – the one that seems so natural to us, but which remains the Holy Grail of artificial intelligence.

Jake Thompson

Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.