Language models: what alternatives to transformers

Language models: what alternatives to transformers

State Space Models, Mixture-Of-Experts, RWKV … Overview of the main alternatives to the technology on which Chatgpt and Gemini rests.

Since the publication in 2017 of the article “Attention is all you need”, the models such as transform have become the framework of modern generative. Chatgpt, Claude, Gemini, Llama or Mistral are all from it. Thanks to their attention mechanism, these architectures are able to capture the relationships between tokens in long textual sequences, promoting exceptional performance on language, vision, code or computational biology tasks.

But as the models grow and the contexts to be managed extend, the limits of the transformers become more and more visible. Exponential training costs, slow inference, quadratic increase in memory consumed according to the length of the entries … These technical, energy and economic brakes have prompted research to explore new approaches. Several credible alternatives are emerging today. From Mamba to RWKV, passing by MOE and hybrid architectures, OCTRIE AT NEW TISTES which could supplant the transformers.

Mamba: the linear efficiency of the State Space Models

Mamba architecture, published in 2023 by a MIT and Princeton team, is one of the most promising alternatives. It belongs to the family of State Space Models (SSM), architectures inspired by the dynamic systems used in signal or automatic processing. Unlike the transformers, whose memory complexity is quadratic compared to the length of the sequence, Mamba is linear. In other words, the calculation time increases in proportion to the number of tokens. This property is made possible by replacing the attention mechanism by a structured recurrent system, which encodes the evolution of the internal state of the model over the entries, without having to compare all the tokens between them.

Mamba is based on a key innovation: a selective recurring system, capable of “forgetting” unnecessary information and focus on relevant signals at each stage. Thanks to this, MAMBA models can treat very long sequences (up to 1 million tokens) while maintaining competitive performance. On various tasks (code prediction, text modeling, audio or genomic data), Mamba-3B surpasses transformers of equivalent size, and even rivals with twice larger models, with an inference speed up to five times higher.

Several companies have started to exploit this paradigm. Mistral AI, the French startup at the origin of Mixtral and Mistral-7b, published Codestral-Mamba, a model of code generation based on the second version of Mamba. This model offers a more fluid management of long sequences and a much better inference efficiency than conventional transformers. AI21 Labs, for its part, revealed Jamba, a hybrid model mixing Mamba blocks, transformed with the Mixture-Of-Experts technology (MOE).

Mixture-inf-experts: dynamic specialization to reduce costs

The principle of architectures mixture-of-experts is apparently simple: instead of activating all the parameters of the model at each token, an active selection system only a small part of the network, ie the most relevant “experts”. This makes it possible to build very powerful models with billions of parameters, while considerably reducing the cost of inference. When a MOE model of 40 billion parameters activates only 10% at each stage, the performance of the inference is close to that of a classic model of 4 billion parameters, but with a much higher quality.

Mistral was one of the first players to demonstrate the industrial viability of this approach. Its 8 × 7B Mixtral model is based on eight experts of 7 billion parameters, two of which are activated for each token. It competes with much more massive models such as GPT-3.5 or Claude 1, while offering much more interesting memory and memory efficiency. AI21 Labs also uses Moe in its Jamba model, strengthening its adaptability without weighing down inference. The mixture-expperts are particularly promising in industrial contexts where the load varies depending on the task or type of user.

RWKV: The return of the RNNS, modernized

In contrast to the Tree Transform and Mamba technology, RWKV architecture offers an original path: that of RNNS reinvention. The acronym means Weighted Key Value receipt, and evokes its mid-recurring, mid-affair. RWKV works as a transforming during training – in parallelizing the GPU calculation – but as an RNN during inference. This property allows a sequential generation with a unique memory state, without having to recharge or recalculate the history at each stage. This makes RWKV extremely fast for inference, even on modest machines.

RWKV is an open source community project, which has experienced rapid growth thanks to its lightness. Recent versions, such as RWKV-5 World, can run on CPUs with less than 3 GB of RAM, and are used in on-board chatbots, Off-line or local AID. Certain educational or medical applications in areas with low connectivity also use it. Unlike other approaches, RWKV explicitly targets accessibility and energy sobriety, while remaining competitive on standard text generation tasks.

Although no major enterprise has yet publicly bet on RWKV on a large scale, projects like LM Studio, Ollama, or Localai integrate it alongside Llama or Mistral. This testifies to an increasing interest in this alternative architecture, especially in environments where latency, consumption or confidentiality are critical.

And tomorrow? Towards hybrid and modular architectures

If Mamba, Moe and RWKV each embody an answer to the limits of transforming, their convergence today seems inevitable. The Jamba model of AI21 Labs is a good example: it selectively combines Blocs Transform, Mamba and Moe, in order to exploit the best of each world. Some researchers even evoke the idea of ​​a “transform 2.0” which would integrate mechanisms inspired by Mamba (linearity), RWKV (streammable inference), and MOE (dynamic adaptation). The artificial intelligence of tomorrow could be composite, depending on the needs: transforming for short sequences, Mamba for long texts, RWKV for embedded, MOE to dynamically adapt power.

While language models are called upon to integrate into billions of heterogeneous objects, services and environments, this architectural diversity appears as a necessity more than a fashion. The monopoly of transforming probably affects its end, not because it is exceeded, but because the contemporary challenges of AI require more sobriety, modularity and scalability. In this new era, Mamba, RWKV and Moe models are not exceptions: they announce a new standard.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment