Google's Still-Moving: Generating personalized video content through a few static reference images

It feels like everything I've seen this week is Google's products 😓. Most of these Google products only have papers and demonstrations; the code isn't open-sourced, and there's no place to download the models. However, we can take a look at their general effects. Below is a paper released by Google last week - Still-Moving, which can generate personalized video content through a small number of static reference images.

Video example

Personalized video generation

Based on text-to-video (T2V) models and text-to-image (T2I) models, Still-Moving can adjust any customized T2I weights to align with the T2V model. This adaptation requires only a few static reference images while preserving the motion priors of the T2V model. Below is an example of personalized video generation achieved by adapting personalized T2I models (e.g., DreamBooth, [Ruiz et al. 2022]).

Stylized video generation

Still-Moving can also be used with pre-trained stylized T2I models (e.g., StyleDrop, [Sohn et al. 2023]) to generate videos with consistent styles. Each row contains a set of diverse videos that follow the style of the reference image on the left while showcasing the natural motions of the T2V model.

ControlNet + Personalized Video Generation

The video below is generated by combining the fine-grained control and structure-preserving capabilities of ControlNet with the personalization abilities of Still-Moving.

ControlNet + Stylized Video Generation

The customized Still-Moving model can be combined with ControlNet [Zhang et al. 2023] to enable the generation of videos that conform to the style of a given T2I model but have structures and dynamics determined by a given reference video.

Research Method

the method seamlessly integrates the spatial prior of the custom T2I model with the motion prior provided by the T2V model.

Comparison

Comparison with AnimateDiff

shows the results of applying the Still-Moving method to the AnimateDiff T2V model (introduced earlier), using the same seed and prompt, to demonstrate the robustness of the Still-Moving approach. Naive Injection often fails to follow custom data well or introduces significant artifacts. For example, the "melted gold" style (top row) shows distorted backgrounds and lacks the characteristic melting dripping effect of this style. The features of the hamster (bottom row) are not accurately captured (e.g., the colors of the cheeks and forehead). Additionally, the identity of the hamster changes between frames. In contrast, when using the Still-Moving method, the "melted gold" background matches the reference image, and the model generates dripping motion. Similarly, the hamster maintains a consistent identity that aligns with the reference image.

Qualitative comparison with baseline methods

Qualitative comparison of Still-Moving with baseline methods