Which one to choose: GPT-5.4 for the price or Claude Opus 4.6 for the precision?

Which one to choose: GPT-5.4 for the price or Claude Opus 4.6 for the precision?

On the one hand GPT-5.4, which OpenAI positions as its company spearhead. On the other, Claude Opus 4.6, installed in this segment for months. We compared their actual performances.

While OpenAI is going through a crisis of confidence linked to its partnership with the Pentagon, in which Sam Altman himself claims to have little power over the decisions of the military body, the San Francisco laboratory is launching a product counter-offensive. To try to retain ChatGPT users migrating to Claude in recent days, OpenAI is increasing model releases. After GPT-5.3 Instant, dedicated to everyday tasks and deployed on March 3, the lab is back at it three days later by unveiling GPT-5.4, its new model designed for white-collar workers. A model that is positioned in the same segment as Anthropic with Claude Opus 4.6: use cases applied to the company. Comparative.

Equality in benchmarks

In the benchmarks, Claude Opus 4.6 and GPT-5.4 are very often neck and neck. In web search, Opus 4.6 does slightly better than GPT-5.4 on BrowseComp (search for information difficult to find online): 84% against 82.7%. Same micro-advantage for Anthropic on Humanity’s Last Exam without tools (expert-level multidisciplinary reasoning), at 40% versus 39.8%. On pure tool use, the two models neutralize each other on τ2-bench Telecom (resolution of customer service tasks with tools), at 99.3% and 98.9% respectively. The benchmark is therefore considered saturated.

On the other hand, GPT-5.4 widens the gap on the use of MCP. On MCP Atlas (using large-scale tools via MCP servers) GPT-5.4 scores 67.2% versus 59.5%, a significant advantage for configurations involving many connectors. In vision and visual reasoning, GPT-5.4 dominates MMMU Pro (visual understanding and reasoning) at 81.2% compared to 73.9%, for Claude Opus 4.6. In coding, the match is almost zero on SWE-bench Verified (resolution of real bugs): 80.8% for Opus 4.6, 80% for GPT-5.4.

Benchmark Object GPT-5.4 Opus 4.6
BrowseComp Finding Difficult Information Online 82.7% 84%
Humanity’s Last Exam (without tools) Expert multidisciplinary reasoning 39.8% 40%
τ2-bench Telecom Solving customer service tasks with tools 98.9% 99.3%
MCP Atlas Using tools at scale via MCP 67.2% 59.5%
MMMU Pro Visual understanding and reasoning 81.2% 73.9%
SWE-bench Verified Fixed real bugs 80% 80.8%

Concretely and in theory, with GPT-5.4, OpenAI is truly catching up with Anthropic. The LLM seems cut out for environments with many MCP connectors: typically the case of an agent. Opus 4.6 performs well on pure reasoning and persistence, with better performance on long contexts and in-depth web search. Finally in code, both are at the same level. Please note, however, that GPT-5.4 is not specifically designed for development. OpenAI should certainly release a codex version, optimized in the field, in the coming weeks.

The JDN comparison

To give an idea of ​​the performance of the two models, we will subject them to three different use cases: summarizing a research paper in 100 words maximum, generating a complete Excel sheet from the last four quarterly reports of a listed company, and producing the SVG image of an iPhone. Three exercises that mobilize distinct skills: synthesis and compliance with a strict constraint, extraction and structuring of real financial data, and generation of complex visual code.

Summarize a research paper in 100 words: GPT-5.4 in mind

The aim here is to analyze the models’ abilities to analyze long and complex documents (graphic tables, etc.) and to synthesize their very essence while scrupulously respecting a numerical instruction (100 words).

Prompt : A partir de ce papier de recherche, génère un résumé en exactement 100 mots (compte chaque mot et vérifie avant de répondre). Le résumé doit couvrir : (1) la méthodologie utilisée, (2) les principaux résultats. Ton factuel et précis, sans formules introductives. Après ta réponse, indique le décompte total entre parenthèses.

Result: neither model strictly respects the 100 word guideline. GPT-5.4 produces 109, Opus 4.6 rises to 116. Basically, GPT-5.4 generates clearer text with a step-by-step methodology before delivering the key figures. Opus 4.6 is denser, stacking more encrypted data. The point here goes to GPT-5.4.

Produce an Excel sheet from the financial results of a listed company; Opus 4.6 winner

The objective here is to test the ability of GPT-5.4 and Claude Opus 4.6 to ingest raw financial data and render it in the form of a structured and usable Excel file. We take Tesla’s results for Q1-2-3-4 2025.

Prompt : A partir des quatre derniers rapports trimestriels de Tesla (Q1, Q2, Q3 et Q4 2025) ci-joint, génère un fichier Excel complet comprenant : un onglet "Données" avec un tableau structuré contenant pour chaque trimestre le chiffre d'affaires, le coût des revenus, le résultat brut, la marge brute en pourcentage, le résultat opérationnel, le résultat net, le BPA, le cash-flow opérationnel, le capex et le free cash-flow ; un onglet "Dashboard" avec des graphiques montrant l'évolution du chiffre d'affaires par trimestre en barres, l'évolution de la marge brute en ligne, une comparaison résultat net vs cash-flow opérationnel en barres groupées, et l'évolution du free cash-flow en ligne ; un onglet "Analyse" avec 5 bullet points résumant les tendances clés sur l'année. Formate le fichier de manière professionnelle avec en-têtes colorés, nombres formatés et sources indiquées.

Claude Opus 4.6 generates the file in around 4 minutes, whereas GPT-5.4 takes more than 21 minutes, the OpenAI model having had to try several times after errors in generating the file. On the results side, Claude respects the instructions to the letter: all the tabs, all the graphs requested, all the data calculated. GPT-5.4 delivers a visually more polished file, but incomplete: certain metrics are not calculated, such as the comparison of net income vs. operational cash flow.

Generate the SVG of an iPhone: Opus 4.6 net winner

This is the task that best allows comparing the raw performance of a model in coding and spatial reasoning. We had already subjected GPT-5.1 and Gemini 3 to this exercise in a previous article. The interest: directly visualize the differences in reasoning between models, and measure progress from one update to another.

Prompt : Génère le code SVG complet et autonome d'un iPhone 16 Pro avec un niveau de détail maximal. Reproduis fidèlement les proportions exactes du modèle, les courbes arrondies caractéristiques, le module caméra triple avec disposition en triangle, le bouton Action, les boutons de volume, le port USB-C, et l'encoche Dynamic Island. Soigne particulièrement les dégradés de couleur pour le titane, les reflets sur l'écran, les ombres portées et les détails du module photo. Le SVG doit être complet, prêt à l'usage et visuellement réaliste avec des finitions professionnelles dignes d'un rendu de produit Apple.

In the SVG generation, Claude clearly wins. The rendering is visually coherent: proportions respected, photo module with its three lenses correctly arranged and processing of reflections and gradients which gives a realistic result. GPT-5.4 produces a more complex SVG in terms of lines of code, but the visual result is less convincing, with poorly positioned elements.

Two very similar models, different pricing

On the context side, both models increase to 1M context tokens in beta. But the price difference is significant: GPT-5.4 starts at $2.50 for input and $15 for output per million tokens, compared to $5 and $25 for Opus 4.6 below 200K tokens. Beyond that, the gap narrows ($5/$22.50 for GPT-5.4 versus $10/$37.50 for Opus 4.6), but OpenAI remains systematically cheaper. For a company that deploys agents at scale, the bill can quickly make a difference.

GPT-5.4 Opus 4.6
Input (< 200K tokens) $2.50 $5
Output (<200K tokens) $15 $25
Input (> 200K tokens) $5 $10
Output (> 200K tokens) $22.50 $37.50

Results: Opus 4.6 remains ahead

At the end of this comparison, Claude Opus 4.6 remains one step ahead. It is more reliable on complex tasks, faster in execution, and very precise on financial verticals. In visual code, Opus 4.6 produces more consistent results. But GPT-5.4 reasons differently: in our tests, the OpenAI model fetched the exact dimensions of the iPhone on the web before generating its SVG, an agent behavior that Claude did not adopt. With this update and a price half the entry price, OpenAI is clearly getting closer to Anthropic in the business segment. Thus, it is better to favor Claude for enterprise use cases where reliability is paramount and GPT-5.4 for agentic configurations where cost at scale matters.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment