Size of the model, accommodation mode, compression of prompts … A set of choices and good practices makes it possible to reduce the particularly high energy bill of major language models.
What if the biggest generative AI problem was its environmental footprint? Particularly energy -consuming, the major language models (LLM) explode, especially in their training phase, the electricity consumption of the data centers that house them. Each request from Chatgpt, Gemini or Claude also goes through a server that performs thousands of calculations to generate a text or an image.
In connection with researchers from the University of California, in Riverside, the Washington Post established that an email of 100 words generated by GPT-4 represents the consumption of a little more than a small bottle of water (519 mml) and the energy used to supply 14 LED bulbs for an hour. According to a study by McKinsey, the development of AI changes “the dynamics of the global energy market”. At the current rate of adoption of artificial intelligence, the energy consumption of data centers in Europe should almost triple, from around 62 to more than 150 TWh by the end of the decade.
By publishing its environmental report in early July, Google recognized that its CO2 emissions had jumped 13% in 2023 and 48% since 2019. In question: the rise of the latest generation chatbots. An invite of a generative AI model consumes about ten times more energy than a query on a traditional search engine.
To absorb this exponential progression of the IT load, the data operators use virtuous cooling techniques such as free cooling, which consists in using outside air at night or winter, or use renewable energies to reduce their PUE (Power Effectiveness), the reference indicator of the energy efficiency of a data center.
Arbitrate according to the environmental cost
At its level, a company can set up a number of good practices to reduce the environmental footprint of the models it uses or put in production. First question to ask? Should we really get into a generative AI project. “Within the six Bouygues trades, arbitration committees, made up of business experts and data science specialists, evaluate the interest of an operational point of view, its level of maturity and associated cost whether economic or environmental”, explains Christophe Lienard, President of Impact AI and Central Director of Innovation of the Bouygues group.
This “by design” approach allows you to stop even before it starts a project that turns out to be an environmental chasm. It is also necessary to envisage less energy -consuming alternative technologies. “Using generative AI systems because it is fashionable for little interest,” judges Sergio Winter, Ml Engineer AWS at Devoteam Revolve. “Some can advantageously replace with simple business rules or more traditional AI tools.”
Measure the environmental cost of a model
To master your environmental footprint, it must be measured. However, all generative AI models are not equal to energy. According to the AI index of the University of Stanford Artificial, it took the equivalent of 502 tonnes of carbon dioxide emissions to train GPT-3 and 1,287 megawattheures of energy. In equal performance, Bloom training required 25 tonnes of CO2 equivalent and 433 MWh. The two models have almost the same number of parameters, respectively 175 and 176 billion.
Likewise, the models do not have the same carbon assessment during the inference phase. Ecologits offers a piece of code to follow energy consumption and the environmental impacts of the use of generative AI models called by API, including those of Openai, Anthropic, Google Gemini or Mistral AI. From Ecologits data, Hugging Face posted a calculator that transforms these metrics into an round trip Paris-New York or into a number of kilometers traveled by an electric car. Speaking examples that make it possible to educate users internal to reasoned use of LLM.
On the side, developers, according to Sergio Winter should be monitored the model using follow -up indicators. “A version change or the addition of a functionality can explode its carbon footprint,” he notes.
LLM vs slm, the size counts
For the past few months, LLMs have been challenged by SLMs (for Small Language Models). These reduced models include a few million to 10 billion parameters against several tens or even hundreds of billions of parameters for large foundation models. If they have lower performance, they fulfill specific tasks with reduced latency while offering the possibility of performing location.
Who says reduced size says above all less environmental imprint. “Regarding less general models, SLMs may require a recovery and fine tuning, which will raise the energy bill,” tempers Sergio Winter. “Likewise, it may be necessary to associate several specialized models for the same project when a single LLM could have been enough.”
Between SLM and LLM, Mistral AI offers an intermediate approach. Its concept of SMOE (Mixture-inf-experts) consists in acting only part of the parameters (39 out of 141 billion in the case of Mixtral 8x22b) to offer the best ratio of economic and environmental costs.
Sergio Winter also evokes the quantity technique which consists in reducing the accuracy of digital values going from floating comma numbers from 32 bits to 16 bits, or even 8 bits without losing too much in quality. This approach allows “a time saving during inference, a reduction in the necessary memory, the computing power and therefore a reduction in the energy consumed”, notes Sergio Winter.
On-premise versus Cloud
Then arises the question of the accommodation of the model. A company can decide to self-heberize an open source model. “It will then have to invest in its own infrastructure, used punctually, and acquire graphic processors, a rare and particularly energy -increasing commodity,” notes Céline Albi, Data Science Manager at Axionable and member of the IA and Environment Working Group at Impact AI.
“Turning LLM with the existing infrastructure will be complicated,” confirms Nicolas Cavallo, Head of Generative AI at Octo Technology. “If an organization has to change its servers and terminals park and acquire under-exploited and quickly obsolete graphics cards, this can plug its carbon footprint.” For Sergio Winter, self-hosting can find its relevance in batch treatments, such as the automatic translation of a lot of items, thus avoiding servers are constantly running.
Another possibility: calling on a model hosted in the cloud through an API. With, this “AS A Service” approach, the company benefits from a shared infrastructure and techniques for optimizing its provider such as cache management. On the principle of payment for use, it consumes occasional resources lowering its energy bill.
Providers are also the first to have new GPUs, both more efficient and less energy -consuming. Presented as the most powerful in the world, the future Nvidia Blackwell chip could consume up to 25 times less energy than its current equivalents.
Only problem, the lack of transparency of providers cloud regarding their emission factors. Based on a declarative mode, the data transmitted lacks precision and is difficult to compare with other cloud suppliers. “We must put pressure on the providers so that they communicate clear calculation elements,” said Céline Albi.
Otherwise, all an investigation work must be undertaken to, from the chosen configuration (processor, graphics card, RAM, etc.) and the chosen region (the energy mix of the country of Datacenter), assess the environmental cost of a cloud service.
The art of prompt engineering
A company can also reduce the number of inferences. Now well known, RAG technology is to provide the module with content deemed reliable and ask him to base his answers as a priority on the information contained in these documents. The prompt engineering, which aims to optimize how we “speak” at LLM, also makes it possible to reduce the consumption of tokens at its cloud supplier.
Nicolas Cavallo pleads in favor of the technique of compression of prompts which consists in reducing the number of tokens at the start of the models, without compromising the quality of the responses generated.
By hunting for words crutches and other redundant terms, it is a question of achieving the most concise formulation possible. In short, express an idea with the minimum of words. “Being polished, it has a cost,” laughs Nicolas Cavallo. “To say thank you is an additional call to the model”. Specific algorithms automate this compression of prompt.




