Git Product home page Git Product logo

stem-inv's Introduction

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing, CVPR 2024

Maomao Li, Yu Li, Tianyu Yang, Yunfei Liu, Dongxu Yue, Zhihui Lin, Dong Xu

GitHub

teaser Reconstruction comparison between DDIM and STEM inversion. DDIM inversion in existing video editing methods usually exploits 1-frame or 2-frame context to invert each frame. Thus, we design a more radical inflated DDIM inversion that uses all-frame context as reference. Here, we use the typical DDIM reconstruction method to provide a video reconstruction comparison, where both our STEM inversion and the radical inflated DDIM one can explore context from the entire video, while the resource-consuming latter yields inferior performance.

🦴 Abstract

TL; DR: STEM inversion is a efficient video inversion method for text-guided video editing.

Click for the full abstract

We present a video inversion approach for zero-shot video editing, which aims to model the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods.

πŸš€ Method Overview

The illustration of the proposed STEM inversion method. We estimate a more compact representation (bases $\mathbf{\mu}$) for the input video via the EM algorithm. The ST-E step and ST-M step are executed alternately for R times until convergence. The Self-attention (SA) in our STEM inversion are denoted as STEM-SA, where the $\rm{Key}$ and $\rm{Value}$ embeddings are derived by projections of the converged $\mathbf{\mu}$.

πŸ“‹ Changelog

  • 2023.12.11 Paper is released!
  • 2024.05.01 The code based on TokenFlow editing is released!

πŸ—οΈ Todo

  • Release the STEM inversion code
  • Release the code based on FateZero editing

▢️ Quick Start for TokenFlow video editing using STEM inversion

Environment

Prepare the Conda environment using the following commands:

git clone https://github.com/STEM-Inv/stem-inv
cd STEM-Inv
cd TokenFlow-Edit
conda create -n stem-tf python=3.9
conda activate stem-tf
pip install -r requirements.txt

Video Editing

We provide demo source videos in the data folder. The corresponding config for STEM Inversion and Editing is in the configs folder. Below are the instructions for performing video editing on the provided source videos. You can run the following command to perform inversion and editing process at once:

bash run_editing.sh

The inversion results are saved in Stem_Inv_Latents/base_256_iter_5, and the editing results are saved in STEM_TF_results.

If you are only interested the reconstruction results of STEM inversion, please run:

bash run_inversion.sh

Note that our default setting is to use 256 bases to represent the whole input video, where 5 iterations are applied for EM algorithm convergence. You can also consider other configurations by modifying the values of β€œnum_bases” and "n_iters" in line 88 of tokenflow_utils.py.

πŸ“° Editing result

Source prompt: A man is playing tennis. Β Β Β Β  Target prompt: Spider-Man is playing tennis.

πŸ“Ž Citation

@article{li2023video,
  title={A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing},
  author={Li, Maomao and Li, Yu and Yang, Tianyu and Liu, Yunfei and Yue, Dongxu and Lin, Zhihui and Xu, Dong},
  journal={arXiv preprint arXiv:2312.05856},
  year={2023}
}

πŸ“£ Disclaimer

This is official code of STEM Inversion. All the copyrights of the demo images and audio are from community users. Feel free to contact us if you would like remove them.

πŸ’ž Acknowledgements

The code is built upon the below repositories, we thank all the contributors for open-sourcing.

stem-inv's People

Contributors

lmm077 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.