The LLM of reasoning constitute excellent models of AI to orchestrate agental workflows.
These are the most advanced AI models in the generative. The reasoning models offer performance in the state of the art on the main stem benchmarks. As Openai confirmed to us at the start of the year, the reasoning of models was the missing stone to the building to develop really reliable and effective AI agents. In a few months, many publishers, from Google to Chinese Deepseek via XAI, revealed their own model of reasoning.
14 reasoning models, 6 criteria
To allow you to see a little more clearly, the JDN has compared the main models of reasoning currently on the market. For the occasion we have retained six essential criteria for developing agents: generation latency in inference, the modalities supported in input, the availability of weights (for a local deployment for example), the size of the context window and the price. We have also added the results of the models on the Benchmark Swe-Bench Verified (resolution of concrete bugs in autonomy). The latter gives an idea of overall performance on reasoning tasks, planning, iteration and validation. Key capacities for an agent orchestration engine.
Model | Editor | Latency | Modality | Free weight | Context size |
Swe-Bench Verified |
---|---|---|---|---|---|---|
Claude 3.7 Sonnet Thinking | Anthropic | Important | Text / Image | ❌ | 200,000 | 62.3 |
Gemini-2.0-Flash-Thinking | Deepmind (Google) | Weak | Text / Image | ❌ | 1,000,000 | Nc |
Gemini-2.5-Flash | Deepmind (Google) | Weak | Text / Image | ❌ | 1,000,000 | Nc |
Gemini-2.5-Pro | Deepmind (Google) | Important | Text / Image / Video / Audio | ❌ | 1,000,000 | 63.8 |
Grok 3 (Think) | xai | Important | Text / Image | ❌ | 131,000 | Nc |
GROK 3 Mini (Think). | xai | Average | Text / Image | ❌ | 131,000 | Nc |
O1 | OPENAI | Average | Text / Image | ❌ | 200,000 | 48.9 |
O1-mini | OPENAI | Weak | text | ❌ | 128,000 | Nc |
O1-Pro | OPENAI | Important | Text / Image | ❌ | 200,000 | Nc |
o3 | OPENAI | Important | Text / Image | ❌ | 200,000 | 69.1 |
O3-mini (high) | OPENAI | Weak | text | ❌ | 200,000 | 49.3 |
O4-mini | OPENAI | Weak | Text / Image | ❌ | 200,000 | 68.1 |
QWQ-32B | Qwen (Alibaba) | Average | text | ✅ | 131,000 | Nc |
R1 | Deepseek | Important | text | ✅ | 128,000 | 49.2 |
Seed-Thinking-V1.5 | Bytedance | Average | text | ✅ | Nc | Nc |
For projects requiring a great depth of reasoning and a multimodal processing capacity, the Gemini 2.5 Pro and Claude 3.7 Sonnet Thinking and O3 models emerge as champions. They excel in particular in the fields requiring in -depth analysis, such as solving complex technical problems, advanced programming, or scientific research.
Conversely, for applications requiring rapid responsiveness and minimum consumption of resources, models like O4-Mini, Grok 3 Mini or Gemini 2.5 Flash offer remarkable performance. They are cut for agent workflows where speed is a key criterion. Example: Vocal assistant (with TTS and STT models in addition), threat detection agent, trading agent … Finally for professionals wishing to have a local model, three possibilities: Deepseek R1 (a reference, but it will be necessary to opt for the non-censored version on Hugging Face), QWQ-32B of Alibaba or the latest seed-Thinking-V1.5, BytedancePromteur on paper (the weights should soon be released).
Price differences of around 120,000 %
Model | Input price (1m tokens, $) |
Output price (1m tokens, $) |
---|---|---|
Claude 3.7 Sonnet Thinking | 3 | 15 |
Gemini-2.0-Flash-Thinking | Nc | Nc |
Gemini-2.5-Flash | 0.15 | 3.5 |
Gemini-2.5-Pro |
$ 1.25 for requests ≤ 200k tokens |
$ 10.00 for requests ≤ 200k tokens |
Grok 3 (Think) | 3 | 15 |
GROK 3 Mini (Think). | 0.3 | 0.5 |
O1 | 15 | 60 |
O1-mini | 1.1 | 4.4 |
O1-Pro | 150 | 600 |
o3 | 10 | 40 |
O3-mini (high) | 1.1 | 4.4 |
O4-mini | 1.1 | 4.4 |
QWQ-32B | Np | Np |
R1 | Np | Np |
Seed-Thinking-V1.5 | Np | Np |
Prices differences between models can reach up to 120,000%. Premium models like O1-Pro display very high prices (they consume a lot of compute), reaching up to $ 150 per million tokens as a starter and $ 600 at output, while much more affordable options like Grok 3 Mini (Think) start at only $ 0.3 per million tokens at entry and $ 0.5 at out. Between these two extremes, models like Claude 3.7 Sonnet Thinking, Gemini 2.5 Pro or O3 offer a relevant balance between performance and costs.
When the models are evaluated according to their value for money, three names clearly stand out: O4-Mini, Gemini 2.5 Flash and Grok 3 Mini (Think). The O4-Mini of Openai model is distinguished thanks to its excellent results on Swe-Bench Verified (68.1 %) combined at an accessible rate ($ 1.1 at entry and $ 4.4 out of output). For its part, Gemini 2.5 Pro offers multimodal capacities of good quality at a very competitive intermediate rate, ideal for ambitious projects requiring visual methods. Finally, for projects sensitive to costs, Grok 3 Mini (Think) represents an unbeatable option (beware of limited guardrails). But the best model remains that which is 100% suitable for your personal use case. It will therefore be necessary with several before finding the right one.