Google's StyleDrop - Text-to-image generation with custom styles

Let's take a look at Google's StyleDrop - custom style text-to-image generation.

StyleDrop can generate images faithful to specific styles and is powered by Muse, a generative visual transformer for text-to-image generation. StyleDrop is highly flexible, capable of capturing the nuances and details of user-provided styles, such as color schemes, shading, design patterns, and both local and global effects. By efficiently learning new styles with minimal trainable parameters (less than 1% of the total model parameters) and through iterative training with human or automatic feedback, StyleDrop improves quality. Even when users provide only one image specifying the style, StyleDrop delivers impressive results.

Examples

Single-image stylized text-to-image generation

StyleDrop can generate high-quality text-prompted images based on a single reference image. Style descriptors are appended in natural language form (e.g., "in a melted gold 3D render style") to content descriptors during both training and generation.

Stylized character rendering

StyleDrop can generate alphabet images that are stylistically consistent with the description of a single reference image. The style descriptor is attached to the content descriptor in natural language form (e.g., "abstract rainbow flowing smoke wave design") during both training and generation.

Collaborate with style assistants

StyleDrop is easy to train with your own brand assets, helping you quickly prototype creative designs within your own style. The style descriptor is attached to the content descriptor in natural language form during both training and generation.

Comparison

StyleDrop shows significantly better performance in style tuning on Muse (a discrete token visual transformer) compared to diffusion model-based methods (Imagen, Stable Diffusion).

Reference image

Comparison of different technologies

Technology

StyleDrop is built upon Muse, which is an advanced text-to-image synthesis model based on MaskGIT, a masked image generation transformer.

It has two technical keys:

Efficient parameter fine-tuning generative vision transformer
Iterative training with feedback

Finally, synthesize images from the two fine-tuned models.

I can't understand this formula either.。