Google publishes Gemma 3N, a multimodal SLM suitable for on-defense inference. The model takes advantage of a brand new architecture developed by Deepmind.
The team in charge of the Open Source Gemma family at Deepmind presented on the occasion of Google I/O 2025 Gemma 3n, its new reference SLM and most likely that of the market. The fully multimodal model (text, audio, video, image), was designed to be inferred to CPU.
A new architecture made in Deepmind
The Gemma family takes advantage of all the improvements of their owner’s big brother, Gemini. For Gemma 3N, Deepmind engineers went even further and have developed a new architecture optimized for device inference with limited hardware capacities. The main innovation, called Per-Layer, makes it possible to drastically limit the consumption of RAM of the model. By default, the model has 5 and 8 billion parameters depending on the version used but with the per-layer, Gemma 3N runs with a memory imprint comparable to models of 2 and 4 billion parameters. Technically, the technique of per-layer Embeddings makes it possible to dynamically reduce the use of RAM by optimizing the representations of each layer of the model.
Deepmind’s architecture also introduces matform, which natively integrates a submodle of 2 billion parameters. The goal? Use a submodèle with an optimized size according to the complexity of the task and thus reduce resource needs. Developers can also create several submodèle sizes in Gemma 3N according to their needs.
Gemma 3n, close to Claude 3.7 Sonnet in the Chatbot Arena
Gemma 3n is very strong in benchmarks, especially in view of its small size. On the Chatbot Arena, which measures user preferences anonymized, Gemma 3n obtains an Elo score of 1269, placing it just behind Claude 3.7 Sonnet (1289) and before GPT-4.1 or Llama-4-Maverick-17b. A small feat.
On more classic Benchmarks Gemma 3n displays solid results with 64.9% on MMLU, 63.6% success on MBPP, and 75.0% success on Humaneval. If we compare it to models of equivalent size, Gemma 3n then becomes Sota on the majority of benchmarks. A feat for a model which also has a reduced size to inference. For example, while Microsoft Phi-3 (14B) requires nearly 14 billion parameters to reach a MMLU score of around 62%, Gemma 3N actually only uses 4 billion active parameters to reach 64.9%.
Local inference or from the cloud
Gemma 3N is now available in Google AI Studio for the moment, like all Gemma models. The weights of the model can also be downloaded from Hugging Face. Only limitation, the currently deployed version allows you to deal with only text and images methods. Google, however, plans to update weights with all the terms in the coming weeks.
For the open source version, weights are protected by the general license of Gemma models. The model can be used for commercial purposes without license fees or royalty in Google. In addition, Google reserves the right to restrict the use of the model if it considers it reasonably that it violates the conditions of use. It is notably forbidden to use the model for the generation of content protected by Copyright, illegal content. More restrictive, it is also prohibited to use Gemma 3N for “making automated decisions” in fields which “affect material or individual rights or well-being” such as “finance, legal, employment, health care, housing, insurance or social assistance.”
Gemma 3n stands out as the new reference to the Open Source SLM. Google notably recommends its use for text generation, use in chatbot mode, information summary, visual analysis or transcription or audio file analysis. Its optimization for mobile inference (only 3924 MB of RAM) makes it a perfect model to experiment with new uses.