Large Language Model broadcasts are starting to seriously emerge in the generative AI landscape. They notably promise record generation speeds and a drastic reduction in hallucinations.
After the LLM, a new family of generative artificial intelligence models gradually appears. The DLLMs, based on a diffusion architecture, offer in practice generation speeds much higher than current LLM. The overall quality of the text or the generated code is also significantly higher. With Gemini Diffusion, Deepmind was in mid-May the first major player in the AI sector to unveil a stable model based on this architecture.
DLLM: an architecture inspired by image generation models
The DLLM approach was popularized by 5 Stanford researchers in 2022. They then sought a technique to better control the release of the LLM. The architecture of DLLM is directly inspired by research work on diffusion models for the generation of visual content. As a model of diffusion images which starts from a noise of noise (colored pixels in a completely random manner) to gradually sculpt the desired image, a DLLM begins by transforming the text into random noise without structure, then a transforming (the neural architecture of the model) learns to eliminate the noise step by step.
At each iteration (debritating cycle), the model is based on the partially debruity version and on the instructions given to it. During training, he observes pairs “original text / Bruity text” and learns to predict the optimal correction for each start -up stage. The final result is therefore refined step by step to reach a coherent outing on the entire text. Unlike a self-regressive LLM model (Case of Llama, GPT-4, Gemini, etc.) which generates the token text by Token, the DLLM generates the whole text in a blow or at least on large segments.
4 major advantages
DLLMs have four major advantages, starting with the overall generation speed. As the text is not generated token by token but block by block in parallel, the answer to a prompt is generated much faster. We thus reach generation speeds exceeding 1,000 tokens per second, 3 to 10 times faster than the classic LLM based on transformers (excluding diffusion). This significant reduction in generation time directly impacts hardware consumption: since the model requests resources like the VRAM and the GPU, the energy expenditure decreases further.
The generation by text block also makes it possible to obtain a overall coherence Much more stable on the entire generated text. Instead of only being focused on the previous token with a standard LLM, the attention of the model is fully focused on the block to be produced. The entire context is therefore gradually refined over the debriting process. Concretely, the text or code generated is therefore of better quality.
In addition, the construction of the text step by step by debriting makes it possible to follow more carefully all the instructions addressed in the initial prompt. Each iteration (therefore) offers a closer version of initial demand.
Finally, the diffusion also allows the model of better generalize (Understand and use relationships regardless of their order in a text), always thanks to the block treatment. For example, a classic LLM will find it difficult to understand inverted relationships. (A is equal to B and therefore B is also equal to A).
Gemini Diffusion: the state of the art of the DLLM
Certainly seeing the many advantages of the diffusion approach, DeepMind has developed a version of Gemini based on diffusion: Gemini Diffusion. The model manages to generate text at a speed of 1,479 tokens per second. The first demonstrations in generation of code in particular are impressive.
The perform particularly well in development. It achieves results close and sometimes higher than Gemini 2.0 Flash-Lite. With a few months of additional training and perfect optimization, as Deepmind knows how to do it, the model seems truly promising for developers. For the time being, Gemini Diffusion is still at the research stage. It is only accessible by invitation. We were able to try it in preview: the result is truly stunning. In a few seconds, Gemini Diffusion generates whole cobblestones of code or text. Quite surprising.
Other projects are starting to arrive on the market, especially in the open source with, for example, MMMADA (8 billion parameters), a multimodal DLLM with reasoning capable of understanding and generating text and images.
The DLLMs will certainly have to multiply in the coming months, over the evolution of research. Their almost instantaneous generation speed, the main asset, could become a key to the generation of code and in a second step for the agentics, which requires reduced latency to become even more effective. Thus potentially, a web browser controlled by a DLLM would be able to perform tasks much faster than the current generation (Operator for example).




