Behind the technical scenes of Alexa+, “one of the most complex AIs in the world”

Behind the technical scenes of Alexa+, “one of the most complex AIs in the world”

Alexa+ relies on a (very) complex orchestration of more than 70 specialized models, both in the cloud and on the device.

“Transform your requests into actions.” This is the ambition, hammered out by Amazon, of the new generation of Alexa+. The voice assistant, integrated across the entire Echo range of connected devices, is about to undergo its biggest transformation since its launch in 2014. Recent advances in artificial intelligence now allow Alexa+ to understand the user context, conduct natural exchanges and perform concrete actions on behalf of the user. A system, which can undoubtedly be described as agentic, which is based on a (very) complex orchestration of specialized AI.

Alexa+ should allow users to benefit from a truly pro-active personal assistant. Amazon’s AI will be able to orchestrate complete tasks in the real world, beyond simply answering factual questions. For example, the system will be able to plan a family dinner taking into account the dietary restrictions of each member of the household, compose a suitable menu, order the necessary ingredients and organize delivery. All while having in mind the user’s entire personal context (location, number of children, food preferences, diary, etc.).

How does Alexa+ work technically?

The architecture of Alexa+ is, according to Tom Butler, principal scientist at Amazon, “one of the most complex AI applications currently in production in the world”. This complexity arises directly from its ability to simultaneously process multimodal inputs (voice, text, image and video) then orchestrate real-world actions via hundreds of third-party APIs, before rendering a response through visual, textual or voice output channels. The system relies on several central LLMs. Amazon remains discreet on the precise identity of the LLMs mobilized by Alexa+, but confirms that the architecture is mainly based on models available via Amazon Bedrock, its platform for access to third-party foundation models.

This is augmented by more than 70 additional specialized models, each finely tuned to accomplish specific tasks. “Our specialized models are calibrated to perform precise operations such as content synthesis, understanding visual content, or even retrieval on specialized content,” explains Tom Butler. The system is thus capable of determining in real time which models to activate depending on the nature of the user request, of executing the calculations in parallel on all of the selected models, then of aggregating the results to generate a coherent and contextualized response.

Finally for the action part in the real world, Alexa+ relies on more than a hundred third-party APIs. To maintain latency consistent with natural conversation, Amazon uses “speculative execution.” In short, the system anticipates the user’s request and preemptively launches the necessary calculations. Finally, Alexa+ is based on a prompt caching system, where the common prompt portions are precalculated, to further gain speed and save computing.

The difficult internationalization of Alexa+

While it allows for more natural communications, generative AI does not only provide solutions. “Despite all the progress in generative AI, LLMs remain mainly trained on data in English,” recalls Tom Butler. However, this asymmetry degrades performance as soon as the system works in other languages. The example of planning a family dinner perfectly illustrates this difficulty, “even the most powerful LLMs have more difficulty with this task in Spanish than in English”, explains the Amazon scientist. In more underrepresented languages, the problem is even greater.

To overcome this limitation, Amazon teams got around the problem by using transfer learning techniques. “Very concretely, we explicitly show Alexa how to transform a request into a series of concrete actions that it can execute in the real world, in the user’s language,” explains Tom Butler. The models thus learn to decompose the request, select the relevant APIs, generate the necessary arguments and chain these calls with great precision. And when Alexa+ improves in a language like Italian, “these improvements partially carry over to other related languages, such as Portuguese and Spanish.” A cross-transfer phenomenon that allows Alexa+ to gain reliability without completely retraining the models for each language.

Amazon has also started its international deployment in Mexico, without communicating a precise timetable for other markets. According to our information, Alexa+ should only arrive in France in 2026, time for Amazon teams to finalize the linguistic adaptations and integrate French cultural specificities into the system.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment