Andrew Wooten (Rhoda AI) “Rhoda AI has developed the fastest and most accurate video model ever”

Andrew Wooten (Rhoda AI) “Rhoda AI has developed the fastest and most accurate video model ever”

Rhoda AI, one of Silicon Valley’s most prominent AI and robotics startups, has just raised $450 million. Its co-founder, Andrew Wooten, presents his “Direct Video-to-Action” artificial intelligence model.

JDN. Can you introduce Rhoda AI and the specific problem you want to solve?

Andrew Wooten. The world of robotics is divided in two. On the one hand, traditional industrial robots: they are very present in factories but remain rigid, pre-programmed to tirelessly repeat the same thing. On the other, so-called “intelligent” robots, capable of learning from data to accomplish variable tasks in unknown environments. The problem is that when we looked at this second category, we observed that almost all of them got stuck in the laboratory. None were deployed in real conditions. The real bottleneck is not the hardware, but the absence of a sufficiently general AI model. It is to free these robots and finally bring them into the real world that Rhoda AI was created.

How does your approach differ from current models?

The industry has reached a dead end with current models, called VLA (Visual Language Action), which attempt to graft robotic data onto ChatGPT-like language models. The problem is simple: language does not teach physics. This has pushed the industry into a data race where thousands of contractors are hired to teleoperate machines to create robotic datasets. This is the fundamental error of current robotics. To make OpenAI’s models intelligent, they were trained on Internet-wide data. If OpenAI had simply had its own engineers write books, ChatGPT would never have left the laboratory. At Rhoda, we had a different intuition: to teach a robot to move, we must train it on what the Internet already has in abundance: video.

Why do you think video is the best teacher for a robot?

Because video is the richest source of data for understanding movement, dynamics and the laws of physics. By observing billions of hours of content, video generation models like Sora or Luma have proven that AI can simulate coherent physics through simple observation. Our flagship innovation consists of using this lever to “solve” physics in the digital domain before translating it into the mechanical domain. We have built a bridge between these two worlds: our model is capable of generating a video of an action in real time and then instantly converting it into motor commands. This is the Direct Video-to-Action (DVA) model.

How does this translate into practice? Can you give us an example?

Let’s take an audio earphone box. To open it, a traditional robot would need to be shown the gesture thousands of times. Rhoda’s robot observes the object and instantly generates an internal video showing its own arm opening the case. Once this video is produced by the AI, the system translates it directly into physical movements. The problem of robotics then changes dimension: we no longer look for unobtainable robotic data, we exploit an infinite stock of video data.

What obstacles did you have to overcome to arrive at the DVA model?

The concept came up against two technical obstacles: the slowness of classic video models and their lack of precision. Our team, made up of talents from NVIDIA, World Labs and OpenAI, spent 18 months developing the fastest and most precise video model ever designed. By predicting the immediate future through images and converting it directly into action, we are making a total paradigm shift. Where traditional industry requires 1,000 hours of data to automate a factory task, our model can master a complex action in just 10 hours, while remaining capable of adapting to a change in environment, without training.

There’s a lot of talk about World Models at the moment. Is your model one?

The term “World Models” is thrown around quite loosely these days. I define a World Model as a model that takes actions as input and predicts the state of the world as output. It’s a bit like a simulator based on generative AI. What we do is slightly different: we observe a state in order to generate actions. There are similarities: both are trained on internet-scale video. But we wanted to specifically describe our approach, “Direct Video-to-Action”, rather than slotting ourselves into a trendy category.

Regarding the potential gap between training and reality, physical execution and movements, how do you deal with that?

What you’re describing is the Sim-to-Real gap, which explains why most policies trained on simulated data fail in the real world. The big difference is that we operate in a “closed loop” on the robot. We do not use synthetic data from a simulator to train the robot. The model is trained on web videos and refined using robotic data. In operation, the robot observes the situation, generates a video prediction several hundred times per second and converts this prediction into action. If the robot misses its gesture or if an object falls, it immediately observes it, and the following video prediction adjusts accordingly. This allows us to handle tasks that are traditionally impossible to simulate, such as handling plastic bags, deformable materials or t-shirts.

Have you already integrated your model into production environments?

We started by interacting with nearly 100 major global companies from the logistics and manufacturing sectors. We found that they all had the same requirements: the robot must be certified for safety, it must be reliable, and it must support a substantial payload – around 22 kg to meet industry standards. We reviewed over 120 hardware options on the market. And none of today’s humanoid robots check these boxes.

How are you adapting to this situation?

Since no one is building a robot that meets these industrial requirements, we decided to design our own robot with an in-house team of experts from the humanoid and automotive sectors. Second, since our AI model is already up and running, we don’t want to wait until our own model is ready. For now, we use industrial arms with seven degrees of freedom, already deployed in the factory and much more reliable than any humanoid purchasable today.

What is your business model?

Customers don’t necessarily want to buy a robot or AI model: they want their problems solved reliably and at an acceptable price. This is why we favor a Robot-as-a-Service (RaaS) model. They submit a set of use cases to us, we ship the robots, we update the models and we manage the execution of the tasks. They pay an annual fee, and we relieve them of these missions. This saves them from heavy capital investments or worrying about maintenance.

You mentioned deployments in logistics, but how far are we from robots capable of acting autonomously at home?

We will deploy our robots in factories this year. In terms of economic impact, improving the way things are made and moved in production and logistics has a far greater impact on human prosperity than a domestic robot. If you unlock a 10x efficiency gain in this sector, the benefits ripple across society. However, the future belongs to domestic robots, and our model could very well be used for them. But the home is a much more complex environment: every home is different, every task is different. Additionally, hardware maintenance is a major obstacle. In a warehouse, you have technicians on site, not at home. And there are security issues.

You recently raised $450 million. How will this contribute to the development of the company?

Almost everything we do costs a lot of money. Pre-training models from scratch requires tens of millions of dollars in computing resources, GPU chips etc. The second area of ​​expenditure is the development of our humanoid robot model. And the third part is structuring the business to scale. We are not here to provide simple demonstrations on YouTube. We want to see tens of thousands of robots deployed in real factories. This requires deployment teams, support teams and supply chain management. We will therefore use this capital to secure our computing power, develop our robot and attract the best global talents in order to build a sustainable business.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment