VEO 3 vs Sora: Comparison of video generation models

VEO 3 vs Sora: Comparison of video generation models

Veo and Sora remain considered by the market as the two most successful video generation solutions in July 2025.

Google arrived on a market on which it was not expected: the generative AI video. Presented during the Google I/O 2025, Veo 3 offers photorealistic renderings of a quality never yet reached before. What to compete with Sora of Openai? Not so sure. The aesthetics proposed by Sora still has its charm on prompts that require a more creative paw. Comparative.

Veo 3 in 4K, Sora in 1080p

Functions

Veo 3

Sora

Resolution

4K, 1080p

1080p

Max.

8s 4k, 2m+ HD

20s

Audio generation

Yes

No

Modeality supported in Input

text, image

text, image

Watermark

Synthid

C2PA

API

Yes

No

Veo 3 takes one step ahead of the technical level: the model is able to generate 4K videos, Sora is limited to 1080p. VEO 3 allows you to generate 8 seconds maximum in 4K, but more than 2 minutes in 1080p for longer productions. Google also incorporates native audio generation, allowing you to create complete videos with synchronized soundtrack. In comparison, Sora offers mute videos of 20 seconds maximum, forcing creators to add the audio in post-production.

A pricing at a high price at Google

The two models adopt radically different price strategies. Google offers VEO 3 from 20 €/month with the AI pro plan. It offers access to Veo 3 Fast, limited to 1080p and audio without audio. To unlock all the power of the model with audio generation and 4K resolution, you have to go to the AI ultra plan at 250 €/month. For its part, Openai allows with chatgpt plus at € 20/month to generate 50 videos. Chatgpt Pro at 200 €/month allows unlimited use. Veo 3 therefore invoices his technical superiority at the high price.

Google offers an official API via Vetex AI which allows developers to integrate Veo 3 directly into their applications, at 0.75 € per second with audio. OPENAI still does not offer API for Sora, drastically limiting access to developers.

In this comparison we will only try to compare the quality of the generated videos, their overall physical consistency and their loyalty to the prompt. It is indeed impossible to compare audio generation capacities and labial synchronization, sora not being able to produce them. To generate optimal prompts, we use a GPT assistant. The latter takes the description of the expected scene as a start and generates the optimal prompt out.

A astronaut on his horse

We start with a video in a prompt a little complex and which also uses notions of physics. We ask the AI to generate the video of a cosmonaut rising a galloping horse in the desert.

Prompt:

A silver mylar-clad astronaut riding a galloping horse through a vast desert at sunset.
The astronaut's suit is highly reflective, catching warm tones from the golden hour light. The horse is muscular and in full sprint, kicking up clouds of sand with each powerful stride. The desert environment is arid and expansive, featuring rolling dunes and distant rocky outcrops under a dramatic sky tinged with orange and purple. The scene has a cinematic depth of field, with soft foreground blur and crisp focus on the astronaut and horse.
Camera movement: smooth tracking shot from the side, slightly low angle to emphasize speed and heroism, dust particles trailing in slow motion.
Lighting: natural golden hour with high contrast and lens flares from the sun.
Style: cinematic, science-fiction surrealism, high-definition, ultra-detailed textures.
cinematic, high quality, ultra-detailed, golden hour lighting, desert landscape, slow motion, surreal, science fiction, mylar suit, heroic action, dramatic scenery.

Video generated by Sora:

Video generated by Gemini:

The video generated by Sora is graphically stylized but the sand raised by the passage of the horse is quite physically speaking. Likewise, the appearance of a second rider (not-ended in the prompt) is problematic. VEO 3 generates a perfect video in view of the prompt requested and a pushed realism, but slightly less visually stylized.

A macro plane of a slow motion water drop

We then ask the models to generate the video of a macro plane of a drop of water that falls in a glass of water in slow motion. A prompt that precisely measures the model’s ability to manage a physically coherent scene.

Video generated by Sora:

Video generated by Gemini:

Prompt:

A single droplet of water falling into a glass of water in extreme slow motion.

The scene is tightly framed, macro-level, focusing on the precise moment the droplet makes contact with the water surface. The impact creates concentric ripples and a high crown-shaped splash, with individual droplets suspended mid-air. The glass is crystal clear, filled halfway, placed on a reflective surface. The background is softly blurred with a minimalistic, studio-like setup.

Camera movement: static close-up shot with ultra-smooth focus pull to capture depth.

Lighting: high-key lighting with soft shadows and subtle highlights on the splash, enhancing the transparency and clarity of the water.

Style: hyper-realistic, macro photography aesthetic, slow-motion physics study. cinematic, high quality, ultra-detailed, macro, slow motion, transparent water, ripple effect, splash crown, realistic lighting, physics simulation.

The result presented by Sora is physically disappointing. The model does not seem to understand the law of universal gravitation. Gout is frozen by an invisible force in the air before falling back into a liquid that is more like molten tin than water. Veo 3 on the other hand offers a physically realistic video. Two light problems, however, spoil the final rendering. First, the video poses a prompt grip problem: several drops are generated and not only one as requested in the prompt. Finally, more problematic, the development seems made in the wrong place, result the scene is slightly vague.

A front tracking lined with a time-lapse

More complicated, we are now asking Veo 3 and Sora to generate a complex plan but regularly present in many films: a traveling effect before on a fixed character, with accelerated environment.

Video generated by Sora:

Video generated by Gemini:

Prompt:

Forward dolly shot of a 30-year-old woman standing still at the center of Times Square during daytime. The environment around her is in fast motion: people walking, running, cycling in all directions, creating a time-lapse effect. The woman remains calm and sharply in focus, wearing modern urban clothing. Neon signs, giant billboards, and taxis contribute to the dynamic atmosphere. The lighting is natural daylight with strong shadows and reflections from the glass surfaces. Cinematic depth of field with bokeh in the background. High-resolution image with realistic textures, detailed crowd movement, and smooth camera motion. cinematic, time-lapse background, realistic crowd dynamics, shallow depth of field, high quality, forward dolly, hyperrealism, urban scene, motion blur on background, dynamic environment

The video generated by Sora is quite close to the original prompt. However, several elements make the scene quite incoherent: the main character seems to be tapping, as if the AI had not managed to fix it among the whole crowd in motion. Finally, the crowd and the vehicles in the background go in the same direction. The scene proposed by Veo 3 is, once again, the most physically consistent and the most realistic. The video perfectly respects the prompt. The colors are however less dense than those offered by Sora.

A complex scene

For the fourth and last video of this comparison, we ask models to generate a video in three sequences. The goal? Push the cursor as much as possible to test the grip for prompt. The AI will have to generate a video where a man hits with a small spoon on a bowl containing cat pâté in a modern kitchen, then where we see the cat start in the corridor before throwing himself on his bowl to eat.

Prompt:

Scene 1: Inside a sleek, modern kitchen with minimalist white cabinetry, matte black fixtures, and soft natural lighting, a man gently taps a small silver spoon against a ceramic bowl filled with cat pâté. The kitchen is quiet and pristine, with light bouncing softly off the polished surfaces. The camera is at countertop level, focusing on the bowl and the man's hand, creating a shallow depth of field with the background slightly blurred.

Scene 2: From a connected hallway in the same contemporary home—same flooring, consistent lighting style—a domestic short-haired tabby cat, with distinctive gray and white markings, suddenly darts into frame. The camera uses a low tracking shot to follow the cat dynamically as it runs toward the kitchen, preserving spatial continuity and orientation.

Scene 3: The same cat eagerly pounces on its bowl of pâté in the kitchen. The camera cuts to a close-up from the side, showing detailed textures of the cat's fur, the movement of its head as it eats, and the sheen of the pâté. Lighting remains soft and natural, with sunlight from a nearby window casting warm highlights across the scene. The environment and cat remain exactly the same as in previous scenes for narrative and visual continuity.

Technical specifications: cinematic lighting, consistent environment, 4K resolution, photorealistic rendering, shallow depth of field, smooth realistic motion

Motion and dynamics: gentle tapping and static framing in Scene 1, fast-paced camera tracking in Scene 2, close-up energetic detail in Scene 3

cinematic, photorealistic, high quality, consistent character design, detailed environment, dynamic camera, realistic lighting, smooth animation

Video generated by Sora:

Video generated by Gemini:

For this even more complex video, Sora derails. The Openai model fails to model the first scene (a man is hit with a small spoon on a bowl). Subsequently the cat in the corridor is, surprisingly, capable of crossing the walls and finally for the third part, the cat has real difficulties in eating its mash. The scene is generally not very consistent and not usable. For his part, Veo 3 offers a single frame for the first scene (a fixed image), without anyone knowing why. The rest of the scene is then quite consistent.

Veo 3, a more versatile model

Veo 3 stands out as the most successful video generator. Google has succeeded in developing a model that excels in respecting physical laws and narrative coherence. VEO 3 demonstrates a higher understanding of physical reality.

Sora nevertheless retains a significant asset: its particularly neat and attractive aesthetic. The Openai model offers a more “licked” visual rendering which can seduce for simple creations favoring the visual impact on physical precision. For professionals seeking to create realistic and technically flawless video content, Veo 3 stands out as the obvious choice, despite its higher prices.

Jake Thompson
Jake Thompson
Growing up in Seattle, I've always been intrigued by the ever-evolving digital landscape and its impacts on our world. With a background in computer science and business from MIT, I've spent the last decade working with tech companies and writing about technological advancements. I'm passionate about uncovering how innovation and digitalization are reshaping industries, and I feel privileged to share these insights through MeshedSociety.com.

Leave a Comment