As part of a performance test of video generation models by AI, we tried to reproduce “the sprinkler sprinkled”, the first fictional film in history.
Is it already possible to reproduce with artificial intelligence the first filmed fiction of history, the famous “the sprinkler sprinkled” by Louis Lumière, and thus demonstrate that a new film era begins? Titled by the question, the JDN hasten the task and did not skimp on the efforts, since the Tour-à-Tour Sora of Openai, Gen-4 of Runway, Veo-2 of Google and Kling of Kuaishou were tested!
A 45 -second film in 4 sequences
Directed by Louis Lumière and unveiled in 1895, “the sprinkler sprinkled” is a comic film of around 45 seconds.
To try to reproduce the film, we cut it into 4 main sequences:
- The gardener sprinkles his garden.
- The boy cuts the water by putting his foot on the pipe, which forces the gardener to look at the pipe by pointing it towards his face.
- The boy then takes off his foot from the pipe, the water returns in force and sprinkles the gardener’s face.
- The gardener tries to catch the boy to inflict a correction.
We first try a classic approach: use a text-to-video model to produce each sequence individually. To do this, we use Sora’s storyboard (Openai) which allows you to generate a video from several distinct sequences described in natural language. The result is really out of context and does not represent the expected video.
We therefore change our shoulder rifle and use Sora’s image-to-seeo capabilities, with the aim of giving the model a visual work base for more coherent results. We use GPT-4O (Openai) to generate a photorealistic image (therefore frozen) of each scene. The results are also incoherent there. For example, the model manages to generate the boy with a foot on the pipe. We then take the side of capturing images of the original film (yes, it’s a bit of cheating). Capture that we will then ask Gemini Flash 2.0 Exp (Google) to color. Gemini is performing perfectly and offers us beautiful images faithful to what a real color capture could have produced.
New test with Sora, then Runway
Once the four images of the four colored sequences are, we add them to the Sora storyboard to generate the video again. The final style is already closer to the original film but none of the sequences requested is properly transcribed in the final video. Disappointment.
We persevere. Exit Sora d’Opnai, place in Gen-4 from Runway, an image-to-Video model. We start the process again by trying to generate 4 sequences thanks to our 4 images in the starting point of each of the sequences. The result is slightly more relevant (and it must be said quickly) but still largely below our expectations, as illustrated by the 2e sequence when the boy is supposed to walk on the pipe:
For our third attempt, we are changing AI again, by focusing on Veo2, the latest video generation model from Google Deepmind. Rebelote, we submit our 4 images in the starting cook, and describe each of the sequences. Surprised, Veo2 manages to generate the sequences quite faithfully! The end result is certainly far from perfect, but space-time consistency is credible. The main problem concerns above all the overall coherence of the two subjects present on the screen: the face and the different outfits change over the sequences …
Staked and not yet desperate, we try a radically different approach … and a fourth model. First, we deamand to GPT-4O to rework the images generated by Gemini to make them smoother and detailed. Then we have the description of each of the 4 sequences write by GPT-4O. Finally, this time, we submit images and descriptions to the Kling 2.6 model of the Chinese Kuaishou.
Prompt to obtain the description of the sequences by GPT-4O:
Agissez comme un expert en prompting spécialisé dans l'optimisation des requêtes pour les modèles text-to-video. Votre mission est de transformer la description en langage naturel que je vous fournirai en un prompt parfaitement structuré et hautement efficace pour le modèle text-to-video de Kling AI, en anglais exclusivement. Analysez d'abord méticuleusement l'image de référence que je partagerai, en identifiant les éléments visuels clés, la composition, l'éclairage, l'ambiance, la perspective, les mouvements potentiels et tout détail visuel significatif. Intégrez ensuite ces observations visuelles dans votre formulation du prompt, en utilisant un vocabulaire précis et évocateur qui permettra au modèle de générer une vidéo fidèle à l'intention originale. Structurez votre prompt en commençant par les éléments centraux de la scène, puis en détaillant l'environnement, l'atmosphère, et enfin les aspects techniques souhaitables (cadrage, style visuel, effets spéciaux). Incluez des modifieurs de style pertinents et des indications temporelles si nécessaire. Votre prompt final doit être concis mais complet, d'une longueur optimale de 75 à 125 mots, construit en phrases déclaratives directes, sans utiliser de formulations négatives qui pourraient confondre l'algorithme. Présentez votre résultat final sous forme d'un texte entre guillemets, prêt à être copié et utilisé directement avec Kling AI.
So, verdict? The result is more aesthetic but the scenario is less well respected. Likewise, the whole scene changes on each sequence, just like the style of the main characters. The result is different, more photorealist but with many artifacts which still limit global credibility.
After a small package of hours devoted to the work and some triturations of meninges, we hold our answer: no, it is not possible for the moment to use the Grand Public to make films, including the very first of them. But video AI, too, is only in its infancy …