RERENDER A VIDEO - Maintain temporal coherence of global styles and local textures in videos.

The paper I read today is "Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation".

https://arxiv.org/abs/2306.07954

The code has not been open-sourced yet, but you can take a look at the paper first.

Large-scale text-to-image diffusion models have become quite good at generating high-quality images, such as Stable Diffusion which is increasingly being used by more people. However, when applying these models to the video domain, ensuring temporal consistency between video frames remains a significant challenge. The paper "RERENDER A VIDEO: ZERO-SHOT TEXT-GUIDED VIDEO-TO-VIDEO TRANSLATION" proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models for videos.

The framework consists of two parts: keyframe translation and full video translation:

An adapted diffusion model is used to generate keyframes, applying hierarchical cross-frame constraints to enforce coherence in shape, texture, and color.
Keyframes are propagated to other frames through time-sensitive patch matching and frame blending.

This framework achieves global style and local texture temporal consistency at low cost (without retraining or optimization). This adaptation is compatible with existing image diffusion techniques, allowing our framework to leverage them, such as using LoRA for customizing specific subjects and ControlNet for control. Extensive experimental results demonstrate that our proposed framework is more effective than existing methods in presenting high-quality and temporally coherent videos.

This article introduces a novel hierarchical cross-frame constraint for pre-trained image diffusion models to produce coherent video frames. Our key idea is to apply dense cross-frame constraints using optical flow, where previously rendered frames serve as low-level references for the current frame, while the first rendered frame acts as an anchor to regulate the rendering process and prevent deviation from the initial appearance. Hierarchical cross-frame constraints are implemented at different stages of diffusion sampling. In addition to global style consistency (cross-frame attention), our method enforces consistency in shape (shape-aware cross-frame latent fusion), texture (pixel-aware cross-frame latent fusion), and color (color-aware adaptive latent adjustment) at the early, middle, and late stages respectively. This innovative and lightweight modification achieves both global and local temporal consistency.