Devstral 2 and Devstral Small 2: does Mistral’s on-device AI keep its promises?

Devstral 2 and Devstral Small 2: does Mistral's on-device AI keep its promises?

Devstral 2 and Devstral Small 2 offer developers real alternatives to American proprietary solutions as well as open source models from China.

Mistral back in the AI ​​race? Unveiled at the beginning of December, Devstral 2 and its on-device version, Devstral Small 2, mark a notable advance in open source AI specialized in code generation, a field still largely dominated by large Chinese laboratories until then. But beyond the flattering benchmarks, a question remains: can you really use the new Mistral model to code locally on your PC? We put it to the test.

Excellent scores on SWE-bench Verified

This is the main advantage of the new Mistral code model. Devstral 2 shows industry-leading performance on SWE-bench Verified, the current benchmark for evaluating the ability of LLMs to autonomously solve real-world, reproducible code problems. The model thus reached 72.2%, surpassing Kimi K2 Thinking, a benchmark in the open source ecosystem, and placing itself at the level of DeepSeek V3.2. The gap with the current reference, Claude Opus 4.5, is only 8.7 points, even though Devstral 2 has “only” 123 billion parameters, compared to several hundred billion (estimated) for the OpenAI model.

But the real breakthrough comes with Devstral Small 2. Mistral has anticipated developers’ appetite for models that can really be used locally and is unveiling a lightweight variant that can be used on a single GPU. With only 24 billion parameters, Devstral Small 2 reached 68% on SWE-bench Verified, only showing a gap of 3 points with Kimi K2 Thinking despite a size 41 times smaller. For the community, this is a major step forward: an on-device model offering performance previously reserved for much heavier architectures.

Be careful, however, that developers will not be able to reproduce the announced raw performance identically. With 24 billion parameters, even recent configurations (M4 in particular) will require the use of a quantized version to obtain smooth execution. Despite this, performance will remain significantly higher than that of models of equivalent size currently available on the market.

The other break: the price

Model

Input (1M tokens)

Output (1M tokens)

Devstral Small 2

$0.10

$0.30

Devstral 2

$0.40

$2

Gemini 3 Pro (Context ≤ 200k)

$2

$12

Gemini 3 Pro (Context > 200k)

$4

$18

Opus 4.5

$5

$25

GPT-5.1-codex-max

$1.25

$10

This is the other strong point of Devstral 2, the price. Thanks to its efforts on model size, Mistral is able to offer a truly affordable code model for developers. Even if it does not outperform GPT-5.1-codex-max (OpenAI’s star for code), Devstral 2 is around three times cheaper in input and five times cheaper in output. If we compare it to Claude Opus 4.5, the model is then 12.5 times cheaper, both in input and output.

Another strategic option: inference on infrastructure. Devstral 2, although relatively compact in terms of its performance, is sized to run on four H100 GPUs or equivalent computing capacity. The investment remains significant, but it becomes relevant for mid-sized companies seeking to control their costs and their sovereignty. Finally, the real good deal for independent developers and modest structures remains Devstral Small 2. Designed to run on a single GPU, it can even run on CPU in its most quantized versions, but we should clearly not expect miracles from it.

What is Devstral Small 2 worth with Cline?

To evaluate the performance of Devstral Small 2 on our machine, a Mac mini equipped with an M4 chip and 24 GB of RAM, we use local inference via LM Studio, connected to Cline in VS Code. The model used is a 4-bit quantized version (devstral-small-2:24b-instruct-2512-q4_K_M), which we download directly from the LM Studio library. In this configuration, the model requires approximately 15 to 20 GB of RAM to be inferred smoothly and stably on a consumer machine.

We ask the model to design a minimalist calculator in JavaScript, integrated into a web page and without heavy dependencies. Devstral Small 2 then runs without any particular difficulty. However, the whole thing remains significantly slower than what we might be used to with Claude Code: it takes around 3 to 5 minutes to generate around a hundred lines of code. A rate which nevertheless remains higher than that of manual development. The product code is clean, functional, and the calculator has a neat design, very close to the aesthetic of an iPhone calculator. Devstral 2 also impresses with its operational robustness: where many models, including larger ones, multiply tool calling errors locally, the Mistral model does not generate any over our entire development sequence. It’s notable.

The only real limit is context size. Although Devstral Small 2 theoretically has a maximum context of 256,000 tokens, we voluntarily limited the window to 16,000 tokens during inference in order to avoid saturation of our machine’s RAM. Such a constraint greatly reduces the possible uses: it does not allow you to work comfortably on long-term projects, nor to handle medium-sized code bases. In practice, Devstral Small 2 must therefore, for daily use, be used on a machine with a significantly higher amount of RAM, ideally between 32 and 64 GB to work peacefully. Otherwise, the model remains of occasional interest: rapid code editing, on-the-fly generation or offline assistance, particularly when the internet connection is not available.

Mistral Small 2 and especially Devstral 2 constitute real progress for developers who wish to free themselves from proprietary solutions such as Claude Code, Gemini CLI, Kiro or Codex CLI, whether for reasons of cost or sovereignty. For everyday use, we recommend Devstral 2, which is clearly at the top of the list of current open source code agents. Devstral Small 2, for its part, finds its relevance in more specific contexts: travel, environments with limited connectivity or rapid offline editing needs. In our opinion, this combination represents today the most balanced configuration for coding effectively with AI, at lower cost and without dependence on American giants.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment