Dynamic speculative planning, the method to reduce the cost of AI agents by 30%

Dynamic speculative planning, the method to reduce the cost of AI agents by 30%

Researchers present DSP, a framework for optimizing the cost and latency of LLM-based agents.

Despite marketing campaigns galore, AI agents are struggling to enter production in companies. The reasons are multiple, but two factors come up regularly: the cost and the response time of generative models. Agentic systems require, in fact, heavy reasoning models that consume more tokens by design and are slower in inference.

In an attempt to optimize the operation of the core model of an agentic workflow, researchers from several North American universities, Microsoft and DeepMind have developed a framework to accelerate the workflow while reducing costs by approximately 30%.

Speculative execution at the heart of the method

The technique developed by the researchers consists of improving speculative execution (or ISP, Independent Speculative Planning). The latter, rather than using a single model and waiting for it to plan and execute the next steps sequentially, proposes to carry out the prediction of the next n steps in parallel. More precisely, two LLMs are used: a lightweight model (A) to predict the next n tasks, and a heavier model (B) to check and correct if necessary the tasks produced by the lightweight model. Everything is parallelized to maximize response speed.

Concretely, model A executes the steps of the agentic workflow (the actions necessary to resolve a task) in advance, while the more robust model B checks these steps in parallel. If B detects an error, invalid predictions are interrupted and execution resumes from the last correct step. The parameter K defines the maximum number of steps that A can speculate before validation: a small K limits the part entrusted to the fast model (favoring the reliability of B), while a large K relies more on A to accelerate the execution, at the risk of wasting tokens in the event of an error. The ISP approach consists of setting K manually according to the desired compromise between speed and precision. However, setting the value of K arbitrarily was not optimal.

The researchers’ goal was to intelligently automate the choice of K by dynamically adapting it to each task and each stage of reasoning, in order to obtain the best compromise between speed of execution and precision of results. For this, the researchers developed an online reinforcement learning system that trains a small prediction model (DistilBERT) continuously and asynchronously during execution. The model learns to predict the optimal value of K for each step, without requiring a pre-training phase. We now speak of Dynamic-Speculative-Planning or DSP.

The results are there

To evaluate DSP performance and measure its impact on costs and accuracy, the researchers integrated their method into realistic agentic workflows. They compared three configurations: classic sequential execution, ISP with manually set K values, and DSP with dynamic adjustment. At each execution, they measured the total cost (number of tokens generated), the wasted cost (tokens invalidated by the verifying model), the response latency and the final quality of the results.

And the results are there. With DSP, researchers noted a cost reduction of around 30% compared to sequential agentic execution (used in the majority of use cases on the market currently). All with zero degradation in accuracy compared to standardized execution. The DSP therefore makes it possible, in theory, to save 30% of the cost (on average) over a classic agentic workflow, without reducing the final precision.

How to set up the DSP

The method can be reproduced and adapted to agentic business cases. The researchers published the project code on GitHub to allow interested teams to experiment with the DSP in their own workflows. The repository contains the framework and test scripts used in their experiments (in particular on the OpenAGI and TravelPlanner test benchmarks), which will have to be adapted to the real use cases of each company.

For their evaluations, the researchers used in Model A a lightweight, fast and inexpensive LLM, such as GPT-4.1-mini or DeepSeek-chat, responsible for generating the next steps. The heavier Model B, such as GPT-4.1-mini in ReAct (reasoning) mode or DeepSeek-reasoner, ensured verification and correction of the outputs of the first model. Finally, for the classification part, a small distinct model (DistilBERT type) was used as a predictor in order to dynamically estimate the optimal value of K.

Please note, the implementation requires machine learning skills and a fairly complex technical architecture.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment