. Today, let's take a look at their paper from April, VASA. VASA is a framework capable of generating realistic talking faces with engaging visual affective skills (VAS) based on a single static image and speech audio clips. The first model, VASA-1, not only achieves precise audio-lip synchronization but also captures extensive facial nuances and natural head movements, enhancing realism and vividness.

Scene:

Realism and Vividness

An example with one minute of audio input.
More short examples with diverse audio inputs.

Controllability of the generation

Generation results with different primary gaze directions (front, left, right, and upward).
Generation results with different head distance ratios.
Generation results with different emotional offsets (neutral, happy, angry, and surprised).

Out-of-distribution generalization ability

Decoupling capability

Results of different motion sequences under the same input photo.
Results of different photos under the same motion sequence.
Pose and Expression Editing

Real-time efficiency

The method of VASA-1 generates video frames of size 512x512 at 45fps in offline batch mode; in online streaming mode, the frame rate can reach 40fps with a front-end latency of only 170ms. These results were evaluated on a desktop computer equipped with a single NVIDIA RTX 4090 GPU.

Overall Framework

The method of VASA-1 does not directly generate video frames but instead generates overall facial dynamics and head movements in latent space based on audio and other signals. Based on these motion latent codes, VASA-1 generates video frames through a face decoder, which also takes appearance and identity features extracted from the input image as input.

To achieve this goal, we first built a facial latent space and trained the facial encoder and decoder. An expressive and decoupled facial latent learning framework was carefully designed and trained on real-life facial videos. Then, we trained a simple yet powerful diffusion transformer to model the motion distribution and generate motion latent codes at test time based on audio and other conditions.

Main Features and Advantages of the Model

: The generated mouth movements are perfectly synchronized with the audio.
: Captures subtle facial changes and natural head movements, enhancing realism and vividness.
Generates high-quality videos with realistic facial and head dynamics.
: Supports online generation of 512x512 resolution videos with up to 40 FPS frame rates and extremely low startup latency.
: Through extensive experiments and evaluations using new metrics, VASA-1 significantly outperforms previous methods in all dimensions.

VASA-1 Microsoft's Realistic Audio-Driven Talking Faces in Real Time

Scene:

Real-time efficiency

Overall Framework

Main Features and Advantages of the Model