Grok-4 now exceeds O3-Pro of Openai in benchmarks, so far better LLM in terms of raw performance.
It is a new success for Xai. The young start-up founded by Elon Musk in March 2023 publishes a family of reasoning models at the forefront of benchmarks. Presented Wednesday, July 9 (Thursday, July 10, Paris time), it is available in two versions: Grok-4 and Grok-4 Heavy, which mobilizes several agents in parallel to solve complex problems. XAI claims performance superior to the best models of Openai, Anthropic and Google Deepmind.
A focus on reasoning
Xai has focused his efforts on reasoning. Unlike general models that try to excel in all areas, Grok-4 focuses exclusively on tasks requiring complex reflection and advanced logic. XAI focused on strengthening learning rather than in the gross increase in data from dataset. XAI would have mobilized “10 times more calculation than any existing model on strengthening learning, an unprecedented scale”, using all 200,000 GPUs of the Colossus Superborn.
Like O3, Gemini 2.5 Pro or Claude-4, Grok-4 methodically decomposes complex problems in several stages and identifies logical relationships (principle of the Chain of Thought). Grok-4 Heavy goes even further using several instances of the model that approach a problem from different angles, compare their approaches and converge on the best answer. The model has a context of 256,000 tokens.
A very good model in benchmarks
It was expected: Grok-4 has new records on several reference benchmarks. On humanities Last Exam (2,500 PHD level problems), Grok-4 solves 26.9% of questions in standard mode and more than 45% with the Heavy version. Results that place it at the post-doctoral level “in all subjects, without exception”, according to Elon Musk, who emphasizes that a human would “perhaps only 5%” on this test. In mathematics, he produced a perfect 100% score on AIME25 against 98.4% for O3, and 96.7% on HMMT25 against 82.5% of Claude 4 opus.
Even more remarkable, Grok-4 becomes the first public model to cross the 10% mark on Arc-Agi, reaching 15.9% precision. Greg Kamradt, president of Arc Prize, confirms this performance after independent validation on a semi-private dataset. “Grok-4 shows non-zero levels of fluid intelligence,” he said, adding that the highest previous score was around 8% with Claude Opus 4.
We go a call from @xai 24 Hours AGO
We want to test grok 4 on arc-agi
We Heard the Rumors. We knew it would be good. We Didnt Know It Would Become the #1 Public Model On Arc-Agi
Heres The Testing Story and What The Results Mean:
Yesterday, we catted with jimmy from the https://t.co/3HH6EDZ9BX
– Greg Kamradt (@gregkamradt) July 10, 2025
Finally, the artificial analysis Intelligence Index, which agrees seven different assessments, place Grok-4 in mind with a score of 73 points. A score that gives a good idea of its general classification in benchmarks compared to other competing models.
We go a call from @xai 24 Hours AGO
We want to test grok 4 on arc-agi
We Heard the Rumors. We knew it would be good. We Didnt Know It Would Become the #1 Public Model On Arc-Agi
Heres The Testing Story and What The Results Mean:
Yesterday, we catted with jimmy from the https://t.co/3HH6EDZ9BX
– Greg Kamradt (@gregkamradt) July 10, 2025
However, the model has notable limitations apart from pure reasoning. Its multimodal capacities remain rudimentary. Elon Musk recognizes that Grok-4 is “partially blind” and that “his understanding of images must be much better”. Even more disappointing, the model has more contrasted performance in programming. On LiveCodebench, which assesses coding capacities on recent problems, Grok-4 reaches 79.4%, positioning itself at Gemini 2.5 Pro (79.3%) and slightly behind O3. XAI, also announced that a specialized coding model was in development and would be “both fast and intelligent”, with a scheduled availability “in a few weeks”.
Very demanding pricing
For consumer users, Grok-4 is accessible via the Supergrok subscription to 30 monthly dollars while the Supergrok Heavy subscription at 300 dollars per month gives access to Grok-4 Heavy with its multi-agent capabilities. A pricing grid that makes XAI one of the most expensive AI providers. The model is also available via the Grok API, without for the moment we know its official price.
With Grok-4, XAI temporarily imposes itself at the top of the reasoning models, but this domination could be short-lived. The company provides for an ambitious deployment calendar with a coding model specializing in August, a multimodal agent in September and a video generation model in October. However, competition is not inactive: new versions of Claude have been seen on the web test, Google prepares Gemini 3.0, and Openai should launch GPT-5 in the coming weeks.




