Let's take a look at Google's StyleDrop - custom style text-to-image generation.
StyleDrop can generate images faithful to specific styles and is powered by Muse, a generative visual transformer for text-to-image generation. StyleDrop is highly flexible, capable of capturing the nuances and details of user-provided styles, such as color schemes, shading, design patterns, and both local and global effects. By efficiently learning new styles with minimal trainable parameters (less than 1% of the total model parameters) and through iterative training with human or automatic feedback, StyleDrop improves quality. Even when users provide only one image specifying the style, StyleDrop delivers impressive results.
Examples
Single-image stylized text-to-image generation
StyleDrop can generate high-quality text-prompted images based on a single reference image. Style descriptors are appended in natural language form (e.g., "in a melted gold 3D render style") to content descriptors during both training and generation.
Stylized character rendering
StyleDrop can generate alphabet images that are stylistically consistent with the description of a single reference image. The style descriptor is attached to the content descriptor in natural language form (e.g., "abstract rainbow flowing smoke wave design") during both training and generation.
Collaborate with style assistants
StyleDrop is easy to train with your own brand assets, helping you quickly prototype creative designs within your own style. The style descriptor is attached to the content descriptor in natural language form during both training and generation.
Comparison
StyleDrop shows significantly better performance in style tuning on Muse (a discrete token visual transformer) compared to diffusion model-based methods (Imagen, Stable Diffusion).
Reference image
Comparison of different technologies
Technology
StyleDrop is built upon Muse, which is an advanced text-to-image synthesis model based on MaskGIT, a masked image generation transformer.
It has two technical keys:
Efficient parameter fine-tuning generative vision transformer Iterative training with feedback
Finally, synthesize images from the two fine-tuned models.
I can't understand this formula either.。