Kuaishou unveils Kling 2.0, a new promising video generation model – on paper – unprecedented realism.
According to its creator, its primary mission is to allow everyone to tell “beautiful stories” thanks to AI. Kuaishou, the main rival of Bytedance in China, presented a new peak video generation model on April 15. The Kling team, in charge of generative AI models of the group, says that Kling AI 2.0 Master responds to the most common problems of developers wishing to use generative AI for video generation: lack of adhesion to prompt and more generally realism.
Kling ai 2.0 Master promises real realism
This is the main novelty of Kling AI 2.0 Master. The Kuaishou AI would offer, according to its creators, one of the best adhesions to the market prompt. Concretely, the model would be able to respect very faithfully (in text-to-video mode) the initial instructions, whether expressions, camera movements or sequential actions. Kling also claims that his model offers more fluid and natural human -style movements on the screen. Finally, the AI would also produce richer details and a globally of better quality photography. On the image-to-video part, Kling AI 2.0 Master would also be more efficient than Veo 2 or Runway V4. The overall style of the source image would be reproduced more faithfully.
From a technical point of view, Kling AI introduces a new concept called multimodal visual language (MVL), which allows you to integrate more than a simple text during the video generation. From now on, users can combine textual instructions with image references, video clips, and even indications on camera movements. The model incorporates a multimodal reflection chain which simultaneously analyzes the different inputs (text, image, style reference) to generate a video with maximum semantic adhesion.
How to prompt Kling well?
To prompt his model, Kling recommends a precise structure. First the main subject, then the movements then the general description of the scene and finally the possible cinematographic details (light, atmosphere, focal length, etc.). The company advises to be both descriptive and concise, providing enough details to guide AI without drowning it under a mass of information. For example, instead of simply writing “a cat in a garden”, favor “a Persian cat with blue eyes, sitting elegantly on a stone bench, in a green English garden with hobble roses in the background, lit by a light and diffuse late afternoon”.
To prompt with a reference image, Kling recommends starting with the subject, the movements and finally the description of the background. The most important elements being the subject and the movements. Kling recalls that the most important thing is to clearly identify the subject to animate so that the model understands which element it must animate in priority. In general, the group recommends prioritizing the image-to-video model to obtain more coherent and realistic results.
Kling: the JDN test
To assess the general capabilities of Kling AI 2.0, we test the model in text-to-Video mode and in image-to-video mode with slightly complex prompts and human subjects in both cases.
For the first test we ask the AI to generate the four horsemen of the Apocalypse in a lunar decor.
Prompt:
"The Four Horsemen of the Apocalypse, silhouetted against the stark lunar landscape, gallop across the cratered surface of the moon. Their ethereal steeds kick up swirls of moon dust that hang suspended in the low gravity, creating haunting trails behind each rider. The barren, desolate moonscape stretches endlessly beneath a pitch-black sky filled with distant stars and the looming blue Earth. Shot in cinematic 4K with a 24mm wide-angle lens at f/8, featuring dramatic high-contrast lighting with sharp shadows, cold blue undertones, and an otherworldly atmospheric haze."
The result is generally disappointing. The model only understands our prompt: only two of the four riders are generated. If the dust on the passage of horses is well reproduced, the scene is generally not realistic. Likewise, the moon is represented in the sky. A possible lack of video data on the lunar environment in the training datase can partly explain this last problem.
For the second test, we ask Kling 2.0 to generate the video of a helicopter landing on a plane door in the middle of the ocean.
Prompt:
"Military helicopter hovering and descending onto an aircraft carrier deck, blades whipping against dense fog, massive waves crashing against the naval vessel's hull, storm-tossed ocean stretching to the horizon, dramatic low-angle perspective, cinematic lighting with high contrast shadows, moody atmosphere with desaturated blue-gray color palette, shot with anamorphic lens, 4K ultra-high definition."
The result here is much more coherent in a global way. The helicopter is faithfully reproduced and the general aspect of the well -respected scene. However, the general movement of the helicopter does not respect our request. The latter seems to take altitude above the boat when we ask for a progressive landing in the initial prompt.
For our third test, we give AI the image of a cat on a living room table with two glasses of water (generated with GPT-4O). We ask the model, in a very simple prompt, to generate a video of the cat overturning the glass of water on the left with its paw. The water must then run on the table.
Prompt:
"The cat in the center of the table knocks over the water glass on the left. The water then spills onto the table."
Disappointment, Kling seems to ignore our main instruction. In the video generated, the cat does not overthrow the glass of water but spits water! The scene is generally realistic but does not meet our initial expectations.
Finally for our fourth and last test, we submit to Kling 2.0 the image of Albert Einstein meeting Steve Jobs (fictitious). The goal being that the two men greet themselves by shaking their hand. Will Kling be able to do it?
prompt:
"Albert Einstein and Steve Jobs shaking hands firmly in a historical meeting. Dignified expressions on both faces as they share a moment of connection. Background features a subtle academic setting with bookshelves and scientific instruments slightly blurred."
Kling finally manages to clearly identify our request. The video generated is fully in accordance with our initial request. The whole is generally quite credible. It therefore seems that AI is more likely to identify movements of a subject when the latter is quite popular and represented in general culture (and therefore certainly in its dataset).
On paper, Kling AI 2.0 appears as a promising but still perfectible model in theory. Unlike Veo or Sora of Openai, it requires particularly advanced promptting skills and an iterative approach. Our tests demonstrate that the results vary significantly according to the complexity and precision of the prompt, and that the first attempt does not always give satisfaction. However, it is good to remember that the results observed today are only the most primitive version of what we will see tomorrow. The video generation market by AI is still young (only a few months).