Sora 2 vs Veo 3.1: which is the best AI for video generation?

Sora 2 vs Veo 3.1: which is the best AI for video generation?

We compared the performance of the latest video generation models from OpenAI and DeepMind on the quality and accuracy of the generated videos. The results are interesting.

Which of Google (DeepMind) or OpenAI offers the most efficient video generator? American publishers offer the two best video generation models on the market (with Kling, according to benchmarks). OpenAI presented its Sora 2 model on September 30, while Google unveiled the latest 3.1 update of its Veo model on October 15. We tested both models on video generation in 4 different scenarios.

Sora 2, Veo 3.1: two models at the cutting edge of realism

Sora 2 now excels in realism. OpenAI researchers specifically trained the model to understand the world and the different forces that apply on Earth in an attempt to produce videos as faithful to reality as possible. V2 includes synchronized sound effects and dialogue. All videos produced by Sora are watermarked with an invisible watermark in addition to classification in C2PA metadata. The model can generate 4k (3840×2160) videos up to 25 seconds.

For its part, with Veo 3.1, a latent diffusion model, Google DeepMind is banking on both realism and compliance with initial instructions (adherence to the prompt). In theory, this is one of the best models for following instructions. Like Sora 2, it is capable of generating audio effects and dialogue. With version 3.1, the audio is richer and more detailed. The videos produced are watermarked with the (invisible) SynthID watermark. Veo 3.1 produces videos up to 1080P of 8 seconds with an extension of the already generated video up to 7 seconds (20x) for a maximum duration of 148 seconds.

The JDN test

In this article, we will only test the capabilities of Veo and Sora in generating text-to-image video. We will not test the video editing capabilities, which still remain incomplete and inaccessible for the majority of users. We will also look at the sound design accompanying the videos but this functionality is still in its infancy and remains very imperfect.

A traveling

For the first test, we ask the AIs to generate a simple cinematic scene: a golden retriever running on a wet sandy beach at sunset. The camera follows the dog with a tracking shot. The goal is to finely compare the realism of the camera movements, the dog and the water reflections.

Prompt:

A realistic cinematic video of a golden retriever running along a wet sandy beach at sunset.

The dog leaves footprints in the sand and splashes water as waves reach the shore.

Reflections of the orange sky shimmer on the wet sand.

The camera follows with a smooth handheld tracking shot at low angle, 24 fps, natural lighting, 4K.

The video produced by Veo:

The video produced by Sora:

Following the instructions, both models manage to reproduce the different requests quite faithfully. On the other hand, in terms of pure realism, the movement of the dog proposed by Sora is abnormal, as if the scene were filmed in slow motion. The general texture of the material (sand, sea) is not realistic either. Veo 3.1 offers a completely realistic rendering that meets our expectations. On the quality of reflections, the two models are equal: the sun is reflected with credibility in the sea. Finally on pure physics, Veo wins again: the splashes of dog paws in contact with water are perfectly reproduced. On the sound part, Veo wins hands down, Sora producing a metallic and very unbelievable sound from the dog’s panting.

A POV shot

For this second test, we have Veo and Sora generate a video from a dashcam, showing a car driving on a foggy forest road at sunset. As the car negotiates a smooth turn, a deer briefly appears on the side of the road, calmly watching before wandering off into the trees. The goal is to test the perception of rapid movement and the arrival of a new element.

Prompt:

A cinematic dash-cam style video showing a car driving through a misty forest road at sunset.

The headlights illuminate the fog and reflect on the wet asphalt.

As the car rounds a gentle curve, a deer appears briefly on the roadside, watching calmly before walking away into the trees. The car slows slightly, maintaining a smooth motion as golden light filters through the mist. Realistic lighting, motion blur, dynamic reflections, 24 fps, 4K.

The video produced by Veo:

The video produced by Sora:

The result turns out to be particularly interesting. In terms of respecting initial instructions, Sora has the advantage: OpenAI’s AI follows around 80% of the instructions given. Conversely, Veo introduces a slight freedom of interpretation by generating two deer, one of which runs towards the road rather than towards the trees, as requested. The car also stops on the shoulder, which we didn’t ask for. Google’s AI therefore chooses to reinterpret the scene in its own way.

In terms of realism, the two models deliver convincing renderings, but with distinct approaches: Veo offers a perfectly clear road and overall consistency even in driving, while Sora opts for a more dramatic atmosphere, with a winding road and dense fog. Sora is distinguished by a subtle visual detail: the reflection of the dashboard on the windshield, which reinforces the credibility of the scene.

On the texture and rendering of the elements, Sora plays the realism card by letting us guess the deer in the mist rather than showing it entirely. Veo, on the other hand, offers an image that is too clear and detailed, with almost artificial contours: deer so perfect that they evoke computer-generated images.

Finally, the audio landscape is generally credible in both scenes. Be careful though, according to our tests, Sora tends to generate background music without this being requested in the initial instructions.

A weightless plane

We then ask the AI ​​to generate a creative video of an astronaut brewing coffee in zero gravity with coffee droplets floating in the air. The goal is to measure the degree of realism of the models and their understanding of the forces present.

Prompt:

Inside a space station, an astronaut in a white spacesuit prepares coffee in zero gravity.

Coffee droplets float and merge in the air, tools and cups drift slowly around. Soft morning light enters through the window showing Earth below. Smooth camera movement, high-detail textures, realistic lighting and reflections, 24 fps, 4K.

The video produced by Veo:

The video produced by Sora:

Here again, Sora and Veo offer two very different visions. On a physical level, Sora’s version is the most credible. The coffee appears there in the form of a cluster of compact particles, consistent with what one could observe in weightlessness. Veo features an astronaut pouring coffee from a container. Problem: some of the liquid escapes from the glass, but most of it still seems to flow into it. However, without gravity, the coffee cannot “flow” downwards; it should float freely in the air.

Sora gets around this difficulty by choosing a scene where the coffee is no longer in its container, thus avoiding any physical inconsistency.

Finally, aesthetically, Veo takes the advantage. Its colors are harmonious and the photorealistic rendering could already be used as is. Sora, for its part, offers images that are more credible than realistic, with a texture that is more reminiscent of computer-generated images.

An animated

Finally, last test, we ask the AIs to generate a Studio Ghibli style anime featuring a little fox playing happily in the freshly fallen snow. The aim here is to judge the capacity for stylization and coherence.

Prompt:

An animated short film in Studio Ghibli style of a small fox joyfully playing in freshly fallen snow.

Soft pastel colors, gentle snowflakes drifting, visible breath in cold air, painterly textures. Warm sunlight filters through trees as the fox jumps and rolls in the snow. Soft depth of field, 24 fps, 4K.

The video produced by Veo:

The video produced by Sora:

Aesthetically, the video generated by Sora evokes more the style of Studio Ghibli, with a soft and painterly atmosphere, while Veo is clearly in a register close to Pixar, focusing on fluid and impeccably rendered 3D. In both cases, the result can be used as is: the images are coherent, credible and physically realistic. The choice between the two therefore depends mainly on artistic taste. For our part, we lean towards Veo’s version, more detailed, bright and expressive.

In terms of soundtrack, Veo takes the advantage. The ambient noises and footsteps in the snow are convincing, despite the slightly intrusive background music. Conversely, Sora offers a poorer sound environment, with (still) metallic tones. Classic animation could well be the first audiovisual sector to massively adopt AI, as the video generation has now reached a remarkable level of maturity on most models.

The road is still long

Despite spectacular advances in just a few months, the road to unretouched, first-time video generation still remains long. Follow-up of instructions, although generally satisfactory, remains subject to improvement: AI continues to introduce freedom of interpretation or miss certain details. As for physics, it remains a weakness.

In the end, the two models turn out to be almost equal, each excelling in their field. Sora 2 prevails in overall realism, delivering convincing videos where natural forces are better respected. Veo 3.1 shines with its stylistic finesse and its production-ready rendering: its images are more polished, more detailed, ready to be used immediately without major retouching. The choice between the two will therefore mainly depend on your creative priorities. On the other hand, traditional animation could well be the big winner of this revolution. Both models deliver truly actionable results, paving the way for massive and rapid adoption of AI.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment