"2024 Artificial Intelligence Index Report" - 2.5 Video

This chapter introduces two models, LDM and Emu.

LDM vs. LVG

A research team from LMU Munich and NVIDIA applied the traditional latent diffusion model (LDM), which is used for generating high-quality images, to successfully create high-resolution videos.

LDM significantly outperforms other advanced methods released in 2022 in terms of resolution quality, such as Long Video Generative Adversarial Network (Long Video GAN, LVG).

LDM Paper https://arxiv.org/pdf/2304.08818.pdf。

Emu Video

At the end of last year, Meta researchers developed a new transformer-based video generation model called Emu Video.

Emu Video surpasses previously released state-of-the-art video generation methods in terms of image quality and fidelity, as evaluated by the preference ratio of human evaluators in comparative tests.

Emu supports many generative AI experiences, including some AI image editing tools on Instagram that allow users to take photos and change their visual styles or backgrounds. In addition, the Imagine feature of Meta AI allows users to generate realistic images directly in conversations with assistants or in group chats within our app family.

In Emu's video generation technology, the process first relies on a text prompt to generate an image I, then uses stronger conditioning—both the generated image and the text—to generate a video V. To incorporate images as conditioning in the Emu model F, the image is zero-padded over time and concatenated with a binary mask indicating which frames are zero-padded, along with noise input.

Emu Paper https://arxiv.org/pdf/2311.10709.pdf。