The model of AI Gemini is now able to generate more loyal images to the prompt. Above all, it allows you to very precisely edit already existing images.
It was a long -standing feature. Reserved since December at Trusted Testers Google, the native generation in Gemini is now available to all users of Google AI Studio. The native management of images in input and now in output allows the model of AI to produce images more faithful to the initial user demand and opens the possibility to the editing of advanced images. Our first tests are quite conclusive.
Native generation: what difference?
The integration of the generation of native images in the Gemini 2.0 Flash model represents a fairly major technical change. Unlike the two-model systems where an LLM generates a textual description which is then transmitted to a separate diffusion model (such as the Chatgpt/Dall-E torque or the previous versions of Gemini/Imagen), Gemini 2.0 Flash uses a unified transforming architecture capable of directly generating visual and textual tokens. This unified approach allows a better understanding of the context and nuances because there is no more interpretation or translation between two distinct models. This translates into reality by greater loyalty to prompts and a global coherence of the images produced. The native architecture also offers advantages in terms of image editing by allowing precise incremental changes without having to regenerate the whole image.
With this deployment, Google takes, once is not customary, one step ahead of Openai whose GPT-4O model had presented similar capacities of native images during a demonstration in May 2024, but without ever deploying them publicly.
In terms of image generation, Gemini does not, however, outdo the flow models or even Midjourney, despite its more recent character. To obtain coherent and realistic results, Gemini requires much more detailed prompt, probably because no system automatically is prompt in the background, unlike Dall-E in Chatgpt or Imagen in the previous versions of Gemini. The implementation of a slightly more sophisticated architecture or a simple fine-tuning would certainly allow the model to gain relevance and efficiency, thus reducing the required complexity of user instructions.
Prompt:
Un bateau de sauvetage rouge traverse la Seine à grande vitesse à Paris. La scène se déroule de nuit et les lumières bleues clignotantes éclairent une partie du fleuve. En arrière-plan, des feux d'artifice jaillissent de la Tour Eiffel. Une scène d'action cinématographique dans le style de Michael Bay.
The details specified in the initial prompt are not all faithfully found in the generated image. Even more surprising, the model apparently applied to an artificial vagueness on certain images, such as that presented above, thus compromising the clarity and the expected precision of the final rendering.
But the true force of Gemini 2.0 Flash Experimental, of its full name, is revealed in the image edition. According to our tests, the model excels in modifying by successive iteration the majority of the image styles that are submitted to it. Its particularity lies in its approach by targeted retouching rather than by complete transformation of the visual, thus preserving the fundamental identity of the original image while providing the requested modifications.
It is possible for example to colorize black and white photographs. Gemini then perfectly performs the operation without modifying the elements present in the image.
The model also offers other features, such as modifying the color of a garment or even its complete replacement by another. In these situations, AI performs the request with remarkable precision, retaining the authenticity of the image while perfectly incorporating the required changes.
Another very useful use case, it is possible to merge several images to create new ones. For example a main subject and a background. Note that Gemini can also generate an entirely new background.
Prompt:
Superpose le robot de l'image 2 sur le décor de l'image 1, en veillant à respecter les proportions et à intégrer harmonieusement le robot dans l'environnement. Assure-toi que l'éclairage et les ombres du robot correspondent à ceux du décor pour un rendu réaliste.
Finally last use case, it is possible to change your point of view in an image. Gemini manages to generate a differently fictitious but surprisingly true perspective faithful to reality.
A new paradigm for image edition?
The image editing possibilities offered by Gemini are largely exceeded the few use cases tested. The unified architecture of the model allows visual manipulations which still seem almost magical: targeted retouching, preservation of the original identity of the image, integration of almost imperceptible modifications. The boundaries between original image and generated image become more and more vague, revealing an almost unlimited visual transformation potential.
This first version of Gemini is only a shy overview of what is coming. The next versions will probably completely upset the image edition, making current tools obsolete. We can safely bet that platforms like Canva will quickly integrate this type of technology. Openai could also soon present similar capacities in Chatgpt.




