SMooDi - AI Generates Realistic and Stylized Human Motions

whose source code has not yet been made public. This research was jointly completed by the Northeastern University, Stability AI, and Google Research teams.

SMooDi is a diffusion model for generating stylized motions driven by content text and style motion sequences. Unlike existing methods that can only generate motions of various contents or transfer styles from one sequence to another, SMooDi can quickly generate motions with various contents and multiple styles.

It can generate realistic and stylized human motions based on content text and style motion sequences. It also accepts motion sequences as content input. The darker parts in the video indicate later frames in the sequence. To better demonstrate stylized motion generation, the SMooDi team added style tags to each style motion sequence. Please note that these style tags are not used as model inputs but only for visualization purposes.

Method

Overview of SMooDi

The SMooDi model generates stylized human motions from content text and style motion sequences. In the denoising stept, the SMooDi model takes content textc, style motions, and noise latent variablesztas inputs and predictsϵt, then passes it toz. This denoising step is repeatedTtimes to obtain the noise-free motion latent variablez0, which is fed into the motion decoderDto generate stylized motions.

Detailed explanation of the style adapter

SMooDi's style adapter connects to the motion diffusion model through zero linear layers. The style adapter adds results from each Transformer encoder output to the motion diffusion model to guide the predicted noise towards the target style.

Visual illustration of classifier-free and classifier-guided style guidance

show classifier-free content and style guidance respectively;
;
displays the stylized motion after correction by classifier-guided style guidance.

Comparison

A comparison between the SMooDi method and baseline methods for style-driven motion generation using content text, utilizing the 100STYLE dataset (providing style) and the HumanML3D dataset (providing content).

Below is a qualitative comparison between the SMooDi method and baseline methods in two tasks of stylized motion generation.