Generative AI: 9 levers to reduce the bill

Generative AI: 9 levers to reduce the bill

How to optimize the cost of a generative artificial intelligence system? There are several techniques to limit expenses, whether with open source or proprietary models.

After the big wave of production, time for optimization. Companies that have integrated generative AI into their processes are now looking to streamline existing systems. And the first variable in their sights: cost. Even if the price of inference has fallen over the past two years, thanks in particular to the progress of model editors, generative still represents a significant part of IT budgets. For the occasion, we have drawn up and separated the main levers according to the type of model used: proprietary or open source.

The main levers of action on proprietary models

To optimize the cost of using proprietary models called via an API, we have adopted three main tips.

1. Use the right template

This is the main lever for action on cost. Choosing the right model adapted to your specific use case is essential. The price differences between the latest reasoning models and small LLMs are abysmal. At OpenAI for example, the Nano version of GPT-5 is priced at $0.05 per million tokens for input and $0.4 per million tokens for output, while GPT-5 Pro is priced at $15 for input and 120 for output. That’s a difference of 30,000%.

It is therefore appropriate to precisely benchmark the success of your use case (according to one or more metrics) by primarily using small models. The latest research shows this: by using small models, which are fine-tuned for a given task, the results are sometimes as good or even better than using general models with several hundred billion parameters.

2. Control reasoning

Reasoning models are increasingly used in production, particularly for agentic AI. But their operation makes them more expensive. In addition to tokens for simple output generation, CoT (chain of thought) uses a large quantity of tokens. To limit the verbosity of thought chains, the main vendors offer control variables to be included in the API call. At OpenAI or Grok, it is for example possible to use the variable “reasoning.effort” by specifying “low”, “medium” (OpenAI API only), or “high.” Anthropic and Google Cloud offer an even more precise approach with the “reasoning_tokens” variable which allows you to specify the exact number of tokens that will be dedicated solely to CoT.

3. Use prompt caching mechanisms

To drastically limit the cost of generative AI, there remains prompt caching. Deployed for several months now, it allows repetitive portions of a prompt to be cached in order to reuse the internal states already calculated, which translates into cost savings. To maximize its use, it is recommended to place in your prompt (sent to the API) the invariable instructions and the fixed context at the start of the prompt. You will benefit from real earnings on numerous and similar calls. The cost per million tokens input to OpenAI for GPT-5 is reduced by approximately 90% for cached tokens.

4. Use batch mode

Finally, do not hesitate to use the Batch mode of the APIs when your processing does not require an immediate response. This mode allows you to send batches of requests to be executed in a deferred manner, generally within 24 hours, and to obtain significant discounts on the cost of tokens (often around 50%).

The main levers of action with open source models

For a company, using open source models has the advantage of almost total control of the chain. The levers to reduce their impact on the overall budget are numerous, but often technical and of limited effect. To drastically reduce costs, it is necessary to increase the number of micro-optimizations. We have therefore retained here the two main levers, those which generate the most significant impact on the final bill.

1. Use the right size and quantification

To choose the right open source model, it is important, again and again, to benchmark your use case using models of increasing size. Start by benchmarking your use case using an SLM (see an NLP model in certain use cases) and finish with a model with several dozen parameters if the results are not there. No point using a DeepSeek V3 when a phi-4 is sufficient.

Likewise, be sure to use the appropriate quantization for your model. Benchmark using progressively higher or lower levels of quantification in order to identify the best compromise between performance and cost. Thus, moving from a 16-bit model to a 4-bit format reduces the required memory space by a factor of four, while causing only a moderate degradation in quality. Here again, the key words are test and iteration.

2. Use a production-optimized inference engine

The inference engine certainly represents the second most impactful lever for optimizing the cost of a generative model. A good inference engine can significantly multiply the number of queries processed with the same resources. For production, choose vLLM, TensorRT-LLM or Text Generation Inference (TGI). These optimize memory management, parallel query processing (batching) and the use of quantized models. Conversely, certain engines designed for local development (like Ollama) quickly show their limits, due to lack of optimized parallelization and advanced load management.

Three tips, for open source or proprietary models

Finally, three other tips can apply to all models, whether open source or proprietary:

  1. Clearly require a response in a structured language such as XML or JSON in the instructions to the model. The model will thus respond using only tags. In addition to the efficiency gain, the structure requirement reduces the generation of free text by constraining the model to produce only the required elements between tags.
  2. Optimize context size by summarizing previous exchanges. When using a model in chat mode (a corporate copilot, for example), each new request reintegrates the complete exchange history. Summarizing or condensing previous messages (using a smaller template with a summary prompt) allows you to keep only the elements relevant to the current task. This reduces the number of tokens sent as input and therefore the invoice.
  3. Finally, last tip (certainly the most important): bring in the competition. Continuously monitor the market, new models and the latest optimizations, then regularly compare their performance. In the area of ​​proprietary models, publishers frequently adjust their prices downward as efficiencies improve. On the open source side, progress is just as rapid: from one month to the next, certain models achieve performance equivalent to yours with a much smaller size. Benchmark regularly and don’t hesitate to change models if your quality indicators remain within the acceptable range.
Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment