By simulating real environments, world models make it possible to increase the capabilities of humanoid robots, helping them understand the laws of physics.
World models, which make it possible to train AI agents by simulating reality with an unprecedented degree of precision, are at the heart of all discussions within the tech ecosystem. The amounts of investment in the sector since the start of the year speak for themselves: the start-up World Labs, founded by AI pioneer Fei-Fei Li, has raised a billion dollars. AMI Labs, co-founded by another artificial intelligence legend, Yann Le Cun, has joined the French unicorn club with a $1.03 billion fundraising round. And the American start-up Runway raised $315 million. The giants Google, Meta and NVIDIA have also identified the potential of this emerging technology, as has Amazon CEO Jeff Bezos, who positioned himself by co-founding Project Prometheus, focused on physical AI.
While the potential applications of world models are numerous, from autonomous vehicles to scientific research to video games, the humanoid robotics sector could be the main beneficiary. Applied to robotics, this type of AI model is called robotic world model (RWM). It promises to significantly improve the capabilities of humanoids, allowing them to integrate the dynamics of the physical world.
A new era for humanoid robotics
“If AI wants to be truly useful, it must understand worlds, not just words,” Fei-Fei Li said when announcing the World Labs fundraising in February. She was thus referring to the fact that LLMs like ChatGPT or Claude operate on text by predicting the next word, while world models predict the next “state” of the world generated by an action of the agent evolving there. To do this, these models rely on immense amounts of multimodal data (videos, images, audio, depth data and robotic sensors), which allows them to understand and integrate the characteristics of the physical world: gravity, friction, interactions with objects and the laws governing these interactions.
Until now, the training of AI powering modern humanoid robots has relied on LLMs, video models, vision-language-action (VLA), 3D simulations, or even simulations in real conditions with a teleoperator. But each of these methods, effective when robots are deployed in predictable environments, shows its limits when it comes to confronting them with more complex situations. For example, they make it possible to teach robots to move, but struggle to teach them to manipulate objects. By aggregating these different types of data and integrating the laws of physics, world models provide a response to these gaps.
They thus allow robots to learn through experience, by anticipating and then evaluating the consequences of their actions. They can perform thousands of iterations within the simulation, receive feedback, and adjust their behavior accordingly, without ever breaking a real object or hurting anyone. This brings us closer to the way animals and humans learn.
World models could thus allow the emergence of humanoid robots with capabilities similar to ours. “Over the next few months, we’re going to experience a ‘ChatGPT moment’ in robotics,” Andy Chen, head of special projects at Runway, told Journal du Net. “As world models and world simulators grow in scale, companies like Runway will develop ever larger and more capable models. This will pave the way for greater generalization, allowing robots to begin to act like humans, being able to perform a wide variety of tasks rather than being limited to specific functions.”
Data-intensive models
Before establishing themselves as a truly effective solution, however, world models face certain obstacles. To transcribe reality in all its nuances, they require even greater quantities of data than LLMs. Even simple tasks for a human, like opening a door or grabbing a glass, involve a multitude of micro-variations that are sometimes difficult to capture.
Furthermore, unlike text or images, there is currently little “action – consequence” data. Videos alone, for example, are not enough, because they show what is happening, not why. Finally, physical interactions are expensive to record. This explains why many players in robotics and embodied AI (including 1X, Agility, Figure and NEURA Robotics) use the world model platform launched by NVIDIA, Cosmos, trained on more than 20 million hours of data from the real world.
As with LLMs, another major challenge concerns the relevance of the data used to train the world models. “At Runway, we prioritize data quality over quantity,” says Andy Chen. “This includes, for example, collaborations with players in the film and creative industry, including Lucasfilm. The goal is to have truly qualitative data, not just to increase scale with random videos from the Internet.”
World models, the key to AGI?
If the arrival of world models promises to propel robotics into a new era, they could even, according to some, constitute the missing link towards artificial general intelligence (AGI). If Sam Altman and the creators of ChatGPT remain convinced that LLMs are able to bring about such an entity, many specialists believe that the text will not be enough. “A chatbot can pass a law exam with flying colors, but it cannot understand physical space like a cat does, the one with whiskers,” explained Yann Le Cun, who also prefers to use the terms “Advanced Machine Intelligence” (AMI), last February. “It is no longer a question of generating the most probable sequence, as in language, but of constructing an abstract representation of the world, which knows how to ignore unpredictable elements and retain the useful structure.”
Will world models, by allowing AI agents to perceive the physical world in all its subtleties and to interact with it, open the way towards a form of artificial consciousness? It remains difficult to say, but what is certain is that humanoid robots are preparing to become a little more “human”.




