This underutilized technique can significantly reduce API calls to AI models for similar prompts.
LLMs are still expensive in 2026, very expensive. Whether in API or on-device mode, each call to a large LLM generates significant costs. Reducing the number of API calls or tokens consumed is therefore among the priorities of companies that have deployed an enterprise chatbot or customer service assistant. The latter are large consumers of tokens: each user question requires a call to the model and generates more or less significant costs depending on the load.
To limit the cost of fixed prompts, prompt caching has become widely available among publishers. But an even more effective technique remains under-exploited: semantic caching. It promises a considerable reduction in latency and a drastic reduction in the number of calls to the model. Explanations.
The limit of prompt caching
Massively deployed among model publishers since the end of 2024, OpenAI, Google, Anthropic (but not Mistral AI, strangely), prompt caching already offers a significant reduction in token consumption. The principle: the fixed parts of a prompt (those which never change from one request to another) are cached. Result: the model does not reprocess them on each call, which saves around 90% of the cost on these input tokens. Concretely, you only pay the full price for the variable tokens, not for the stable context already in cache. And this, from the second request.
But this system has its limits. It only allows hiding the input of the stable prompt, not the response which will be regenerated by the model anyway. It is therefore impossible to save money on the outing. Likewise, if a prompt is semantically close to another but does not have exactly the same wording, prompt caching will not apply. Thus, in a corporate or customer service chatbot, your employees or customers who often ask the same questions with different wording will not be able to fully benefit from prompt caching.
Semantic caching, for semantically similar prompts
To resolve this structural limitation, semantic caching has been developed for several months. It makes it possible to drastically limit not the number of tokens used by the model, but rather direct API calls, whether the model is deployed locally or via a cloud API. The goal? That a request with the same objective (therefore semantically very close) only triggers the API call once. The following times, if the request is very close to the previous one, we directly expose the stored response without requesting the model again.
Example of three semantically very close prompts where the same response can be proposed:
Prompt 1: “My printer no longer works, what should I do?”
Prompt 2: “The printer is broken, how do I repair it?”
Prompt 3: “I have a problem with my printer, it’s not working”
The principle is simple, but how does it work technically? To set up this system, simply vectorize user requests (with an embeddings engine), store them in a semantic index, then store the response generated by the API. Once a second request arrives, we compare the vectors with each other using a formula to measure their similarity (we use cosine similarity, for the more mathematically inclined).
Similarity can be adjusted to the desired level of precision. In all cases, it is necessary to define a similarity threshold greater than 80% (0.8) for basic tasks and 95 to 98% (0.95 to 0.98) for increased reliability. Mathematically, the higher the threshold, the lower the chances of identifying two similar queries. A threshold of 85% (0.85) is therefore quite conservative without being too restrictive. If the similarity is greater than or equal to the threshold you set, you directly expose the already generated response corresponding to the previous vector. Easy as pie, right? The gains are clear and vary depending on the study, but semantic caching would save on average, depending on the deployment context, between 40 and 70% of API calls. Likewise, the speed is drastically increased because the model no longer calculates the response: the system simply sends the response stored in cache.
Implement semantic caching in your projects
Ready-to-use frameworks have flourished in recent months on GitHub. One of the most famous, GPTCache, was recently integrated into the LangChain framework. It’s easy to use and there’s no shortage of documentation. On the cloud provider side, turnkey solutions are also offered. Microsoft offers numerous examples of building a chatbot using its Azure Cosmos DB vector database or via its Azure OpenAI service to do semantic caching. For its part, Google offers a complete guide to creating a semantic caching system based on its Vertex AI Text Embeddings with Vector Search embedding engine. AWS also offers a solution based on its Titan embedding model using MemoryDB. In short: there is no shortage of solutions.
When does semantic caching make sense?
As you will have understood, semantic caching does not replace prompt caching: it adds an additional layer to avoid API calls. It will be particularly relevant in cases where your users ask the same questions but with different wording. This is typically the case for an internal chatbot (the same HR questions can be asked by different employees) or a customer service chatbot where the same answers come up regularly (return policy, question about a specific product, etc.). In these cases, the technique takes on its full meaning and avoids straining the model.
For its part, prompt caching remains fully effective for general chatbots which are primarily used for their utility function (generate me a text, summarize this, do that…). The questions will be drastically different from one user to another, but the system prompt and the context of use will often be the same. There is therefore no need to spend time and energy on it.




