Llama 4 arrives in two different versions with 400 and 109 billion parameters respectively. Meta also introduces a context window to 10 million tokens.
Meta has just unveiled a major update of her LLM family, one of the most used open source models in business. Two new models are published under open source license with specific conditions of use: LLAMA 4 MAVERICK and LLAMA 4 Scout. For this update, META is focusing on multimodality and efficiency. The Menlo Park firm also calls into question the relevance of the RAG. Explanations.
Efficient architecture
With Llama 4, Meta does not simply evolve Llama 3.3 but changes architecture. Formerly based on a classic transforming architecture, the models of Llama 4 now take advantage of the MOE. Popularized a year ago already by Mistral AI with Mixtral 8x22b and more recently with Deepseek R1, the MOE saves efficiency. The model works on the basis of experts. Each active expert only part of the weight of the model for each input token. This approach therefore makes it possible to act only part of all the weights of the model to inference and therefore to considerably reduce energy needs.
Llama 4 Scout has 109 billion parameters, including 17 billion active in inference and Llama 4 Maverick is based on 400 billion parameters, including 17 billion in inference. Meta should later publish Llama 4 Behemoth, a model model at 2,000 billion parameters, including 288 billion active. It was Behemoth that made it possible to train by distillation Maverick and Scout.
The three models share common methods, each being able to process text, images or even video at the entrance. Finally, Scout and Maverick arrive with significant context windows of 10 and 1 million tokens. A context that opens the way to new uses, we will come back to it.
Advanced performance in benchmarks
Llama 4 Scout, despite its relatively modest size, surpasses all the previous Llama models and demonstrates performance higher than Gemma 3, Gemini 2.0 Flash-Lite and Mistral 3.1 on a wide range of benchmarks. Particularly impressive in understanding images and reasoning on long context, Scout establishes a new standard for models accessible on a single GPU.
Llama 4 Maverick redefines expectations for multimodal models by beating GPT -4O and Gemini 2.0 Flash on many benchmarks, while competing with Deepseek V3 on reasoning and coding – with less than half of the active parameters. Its fine -fetched experimental version for the cat obtains an Elo score of 1417 on LMARENA confirming its good conversational capacities.
However, the real tour de force comes from Llama 4 Behemoth, which although not publicly available, demonstrates notable capacities on mathematical benchmarks. The model surpasses GPT-4.5, Claude Sonnet 3.7 and even Gemini 2.0 Pro on several demanding scientific tests such as Math-500 and GPQA Diamond. Meta does not, however, compare her foal to the last Google Deepmind model, Gemini 2.5 Pro.
The end of the rag?
The real change of Llama 4 surely comes from the context size offered by the two models already published (Maverick and Scout therefore). With a context of 10 million tokens for Scout and 1 million for Maverick, the question of the relevance of the RAG returns to the table. Is it really necessary to use an increased recovery generation system when the model itself can take as much data as input? The question arises. Tests must be carried out to ensure that the model remains relevant with high context sizes in an operational framework.
If we consider that a word in French represents approximately 1.5 token, Maverick could deal with contexts of around 666,666 words (about 6 to 7 pounds) and up to 6,666,666 words for scout (about 60 to 80 pounds). What send fairly complete documentary bases. Finally, such a size of context also makes it possible to treat fairly long videos, again opening the way to new uses (analysis of training videos to extract key points, analysis of customer video calls …).
A license (almost) open source
With the deployment of Llama 4, Meta publishes the models under a new semi-open license. Professionals can commercially use the models, reproduce, re-training and even create derivative models. Meta, however, limits the rights of use and re-exploitation to companies with less than 700 million monthly active users. Beyond this threshold, it will be mandatory to request a specific license directly in Meta. The goal is here, most certainly, to supervise access to models to prevent competing companies in AI from using LLAMA 4 to train their own model. Small bonus: Meta also explains that anyone would pursue her in court for intellectual property issues on Llama would then see her Llama 4 license canceled.
In the coming months, it is a safe bet that many companies pass on one of the versions of Llama 4 to reduce costs to inference or develop new multimodal use cases. The size of the still substantial models could however limit, initially, on site deployment. Even using the MOE architecture, all the weights of the model must be loaded in the GPU memory. Llama 4 Behemoth and its 2,000 billion parameters should therefore be distributed mainly when it comes out at Providers Clouds.