O3, Gemini 2.5 Pro, R1… What is the best reasoning model to create an AI agent?

O3, Gemini 2.5 Pro, R1… What is the best reasoning model to create an AI agent?

The LLM of reasoning constitute excellent models of AI to orchestrate agental workflows.

These are the most advanced AI models in the generative. The reasoning models offer performance in the state of the art on the main stem benchmarks. As Openai confirmed to us at the start of the year, the reasoning of models was the missing stone to the building to develop really reliable and effective AI agents. In a few months, many publishers, from Google to Chinese Deepseek via XAI, revealed their own model of reasoning.

14 reasoning models, 6 criteria

To allow you to see a little more clearly, the JDN has compared the main models of reasoning currently on the market. For the occasion we have retained six essential criteria for developing agents: generation latency in inference, the modalities supported in input, the availability of weights (for a local deployment for example), the size of the context window and the price. We have also added the results of the models on the Benchmark Swe-Bench Verified (resolution of concrete bugs in autonomy). The latter gives an idea of ​​overall performance on reasoning tasks, planning, iteration and validation. Key capacities for an agent orchestration engine.

Model Editor Latency Modality Free weight Context size

Swe-Bench Verified

Claude 3.7 Sonnet Thinking Anthropic Important Text / Image 200,000 62.3
Gemini-2.0-Flash-Thinking Deepmind (Google) Weak Text / Image 1,000,000 Nc
Gemini-2.5-Flash Deepmind (Google) Weak Text / Image 1,000,000 Nc
Gemini-2.5-Pro Deepmind (Google) Important Text / Image / Video / Audio 1,000,000 63.8
Grok 3 (Think) xai Important Text / Image 131,000 Nc
GROK 3 Mini (Think). xai Average Text / Image 131,000 Nc
O1 OPENAI Average Text / Image 200,000 48.9
O1-mini OPENAI Weak text 128,000 Nc
O1-Pro OPENAI Important Text / Image 200,000 Nc
o3 OPENAI Important Text / Image 200,000 69.1
O3-mini (high) OPENAI Weak text 200,000 49.3
O4-mini OPENAI Weak Text / Image 200,000 68.1
QWQ-32B Qwen (Alibaba) Average text 131,000 Nc
R1 Deepseek Important text 128,000 49.2
Seed-Thinking-V1.5 Bytedance Average text Nc Nc

For projects requiring a great depth of reasoning and a multimodal processing capacity, the Gemini 2.5 Pro and Claude 3.7 Sonnet Thinking and O3 models emerge as champions. They excel in particular in the fields requiring in -depth analysis, such as solving complex technical problems, advanced programming, or scientific research.

Conversely, for applications requiring rapid responsiveness and minimum consumption of resources, models like O4-Mini, Grok 3 Mini or Gemini 2.5 Flash offer remarkable performance. They are cut for agent workflows where speed is a key criterion. Example: Vocal assistant (with TTS and STT models in addition), threat detection agent, trading agent … Finally for professionals wishing to have a local model, three possibilities: Deepseek R1 (a reference, but it will be necessary to opt for the non-censored version on Hugging Face), QWQ-32B of Alibaba or the latest seed-Thinking-V1.5, BytedancePromteur on paper (the weights should soon be released).

Price differences of around 120,000 %

Model Input price (1m tokens, $)

Output price (1m tokens, $)

Claude 3.7 Sonnet Thinking 3 15
Gemini-2.0-Flash-Thinking Nc Nc
Gemini-2.5-Flash 0.15 3.5

Gemini-2.5-Pro

$ 1.25 for requests ≤ 200k tokens
$ 2.50 for requests> 200k tokens

$ 10.00 for requests ≤ 200k tokens
$ 15.00 for requests> 200k tokens

Grok 3 (Think) 3 15
GROK 3 Mini (Think). 0.3 0.5
O1 15 60
O1-mini 1.1 4.4
O1-Pro 150 600
o3 10 40
O3-mini (high) 1.1 4.4
O4-mini 1.1 4.4
QWQ-32B Np Np
R1 Np Np
Seed-Thinking-V1.5 Np Np

Prices differences between models can reach up to 120,000%. Premium models like O1-Pro display very high prices (they consume a lot of compute), reaching up to $ 150 per million tokens as a starter and $ 600 at output, while much more affordable options like Grok 3 Mini (Think) start at only $ 0.3 per million tokens at entry and $ 0.5 at out. Between these two extremes, models like Claude 3.7 Sonnet Thinking, Gemini 2.5 Pro or O3 offer a relevant balance between performance and costs.

When the models are evaluated according to their value for money, three names clearly stand out: O4-Mini, Gemini 2.5 Flash and Grok 3 Mini (Think). The O4-Mini of Openai model is distinguished thanks to its excellent results on Swe-Bench Verified (68.1 %) combined at an accessible rate ($ 1.1 at entry and $ 4.4 out of output). For its part, Gemini 2.5 Pro offers multimodal capacities of good quality at a very competitive intermediate rate, ideal for ambitious projects requiring visual methods. Finally, for projects sensitive to costs, Grok 3 Mini (Think) represents an unbeatable option (beware of limited guardrails). But the best model remains that which is 100% suitable for your personal use case. It will therefore be necessary with several before finding the right one.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment