Technology

Imagen 2

Our most advanced text-to-image technology

Imagen 2 is our most advanced text-to-image diffusion technology, delivering high-quality, photorealistic outputs that are closely aligned and consistent with the user’s prompt. It can generate more lifelike images by using the natural distribution of its training data, instead of adopting a pre-programmed style.

Imagen 2’s powerful text-to-image technology is available in Gemini, Search Generative Experience and a Google Labs experiment called ImageFX. This offers an innovative interface that allows users to quickly explore alternative prompts and expand the bounds of their creativity.

The Google Arts and Culture team is also deploying our Imagen 2 technology in their Cultural Icons experiment, allowing users to explore, learn and test their cultural knowledge with the help of Google AI.

Developers and Cloud customers can access it via the Imagen API in Google Cloud Vertex AI.

A long haired, black and brown miniature dachshund. It's head is resting on the top of a couch. — Prompt: A long haired miniature dachshund on a couch

Three orange and white jellyfish swimming against a black background — Prompt: A jellyfish on a black background

Painting of orange slices on a wooden board — Prompt: Small canvas oil painting of an orange on a chopping board. Light is passing through orange segments, casting an orange light across part of the chopping board. There is a blue and white cloth in the background. Caustics, bounce light, expressive brush strokes

Improved image-caption understanding

Text-to-image models learn to generate images that match a user’s prompt from details in their training datasets’ images and captions. But the quality of detail and accuracy in these pairings can vary widely for each image and caption.

To help create higher-quality and more accurate images that better align to a user’s prompt, we added further descriptions to image captions in Imagen 2’s training dataset, helping Imagen 2 learn different captioning styles and generalize to better understand a broad range of user prompts.

These enhanced image-caption pairings help Imagen 2 better understand the relationship between images and words — increasing its understanding of context and nuance.

Here are examples of Imagen 2’s prompt understanding:

AI Image generated from prompt "Soft purl the streams the birds renew their notes and through the air their mingled music floats" (A Hymn to the Evening by Phillis Wheatley) — *Prompt: “Soft purl the streams, the birds renew their notes, And through the air their mingled music floats.” (*A Hymn to the Evening by Phillis Wheatley)

Painting of green, blue waters with shapes of whales visible — Prompt: “*Consider the subtleness of the sea; how its most dreaded creatures glide under water, unapparent for the most part, and treacherously hidden beneath the loveliest tints of azure*." (Moby-Dick by Herman Melville)

Robin on a wall, beak open — Prompt: ”The robin flew from his swinging spray of ivy on to the top of the wall and he opened his beak and sang a loud, lovely trill, merely to show off. Nothing in the world is quite as adorably lovely as a robin when he shows off - and they are nearly always doing it." (The Secret Garden by Frances Hodgson Burnett)

More realistic image generation

Imagen 2’s dataset and model advances have delivered improvements in many of the areas that text-to-image tools often struggle with, including rendering realistic hands and human faces and minimizing distracting visual artifacts.

A grid of hands and faces — Examples of Imagen 2 generating realistic hands and human faces.

We trained a specialized image aesthetics model based on human preferences for qualities like good lighting, framing, exposure, sharpness, and more. Each image was given an aesthetics score which helped condition Imagen 2 to give more weight to images in its training dataset that align with qualities humans prefer. This technique improves Imagen 2’s ability to generate higher-quality images.

AI generated images of flowers with different aesthetics scores — AI-generated images using the prompt “Flower”, with lower aesthetics scores (left) to higher scores (right).

Fluid style conditioning

Imagen 2’s diffusion-based techniques provide a high degree of flexibility, making it easier to control and adjust the style of an image. By providing reference style images in combination with a text prompt, we can condition Imagen 2 to generate new imagery that follows the same style.

A visualization of how Imagen 2 makes it easier to control the output style by using reference images alongside a text prompt.

Advanced inpainting and outpainting

Imagen 2 also enables image editing capabilities like ‘inpainting’ and ‘outpainting’. By providing a reference image and an image mask, users can generate new content directly into the original image with a technique called inpainting, or extend the original image beyond its borders with outpainting. These capabilities are now available in Google Cloud’s Vertex AI, together with an expanded list of aspect ratio options: 16:9, 9:16, 4:3 and 3:4.

Example of how Imagen 2 can generate new content directly into the original image with inpainting. — Imagen 2 can generate new content directly into the original image with inpainting.

Example of how Imagen 2 can extend the original image beyond its borders with outpainting. — Imagen 2 can extend the original image beyond its borders with outpainting.

Responsible by design

To help mitigate the potential risks and challenges of our text-to-image generative technology, we set robust guardrails in place, from design and development to deployment in our products.

Imagen 2 is integrated with SynthID, our cutting-edge toolkit for watermarking and identifying AI-generated content, enabling allowlisted Google Cloud customers to add an imperceptible digital watermark directly into the pixels of the image, without compromising image quality. This allows the watermark to remain detectable by SynthID, even after applying modifications like filters, cropping, or saving with lossy compression schemes.

Before we release capabilities to users, we conduct robust safety testing to minimize the risk of harm. From the outset, we invested in training data safety for Imagen 2, and added technical guardrails to limit problematic outputs like violent, offensive, or sexually explicit content. We apply safety checks to training data, input prompts, and system-generated outputs at generation time. For example, we’re applying comprehensive safety filters to avoid generating potentially problematic content, such as images of named individuals. As we are expanding the capabilities and launches of Imagen 2, we are also continuously evaluating them for safety.

Acknowledgements

This work was made possible by key research and engineering contributions from:

Aäron van den Oord, Ali Razavi, Benigno Uria, Çağlar Ünlü, Charlie Nash, Chris Wolff, Conor Durkan, David Ding, Dawid Górny, Evgeny Gladchenko, Felix Riedel, Hang Qi, Jacob Kelly, Jakob Bauer, Jeff Donahue, Junlin Zhang, Mateusz Malinowski, Mikołaj Bińkowski, Pauline Luc, Robert Riachi, Robin Strudel, Sander Dieleman, Tobenna Peter Igwe, Yaroslav Ganin, Zach Eaton-Rosen.

Thanks to: Ben Bariach, Dawn Bloxwich, Ed Hirst, Elspeth White, Gemma Jennings, Jenny Brennan, Komal Singh, Luis C. Cobo, Miaosen Wang, Nick Pezzotti, Nicole Brichtova, Nidhi Vyas, Nina Anderson, Norman Casagrande, Sasha Brown, Sven Gowal, Tulsee Doshi, Will Hawkins, Yelin Kim, Zahra Ahmed for driving delivery; Douglas Eck, Nando de Freitas, Oriol Vinyals, Eli Collins, Demis Hassabis for their advice.

Thanks also to many others who contributed across Google DeepMind, including our partners in Google.