rerv / vdt Goto Github PK

[ICLR2024] The official implementation of paper "VDT: General-purpose Video Diffusion Transformers via Mask Modeling", by Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding.

License: Other

Python 0.35% Jupyter Notebook 99.65%

vdt's People

Contributors

Stargazers

Watchers

Forkers

johndpope judithcaldes mingxiao2 movingforward100 hawksilent abapplo runngezhang-jx waylon620 jjd1123 xuanlongorz

vdt's Issues

test

When running physion_sample.py, an error occurs: AttributeError: 'DiagonalGaussianDistribution' object has no attribute 'latent_dist'.

Some confusion about the code.

Is the network model framework in the inference code you provide consistent with that during training? Are there inconsistencies between the training and inference codes?
Great job and looking forward to your reply, thanks in advance.

Training Code and Dataset format?

Fantastic work! Since a few month have passed and the paper is accepted by ICLR(congrats!), would you please release the training code? Also some instructions on how to prepare the dataset would be great!

Related: #1 (comment)

Question about the `VDT`

Hello, I would like to know what num_frames means in the VDT? Looking forward to hearing from you

Physion inference with less than 8 condition frames

Congratulations on the impressive paper. When I tried running inference with your pre-trained physion model, the results significantly degraded as the number of condition frames was reduced. For example, using 4 condition frames (rather than the default of 8) produces only noise - see attached image.

Does this match your expectation? It seems at odds with the paper's discussion, which states "our VDT can still take any length of conditional frame as input and output consistent predicted features".

Thank you!

Edit: I see in Figure 8 that you tried using more than 8 conditional frames, but not less. Do you have a sense how well the forward prediction can perform with only 1 conditioning frame using VDT? Would the model need to be trained with only 1 conditioning frame?

How to make text to video diffusion network?

怎么改造这个网络，可以实现文生视频呢？

Cityscapes pretrained model

nice work, can you publish your cityscapes pretrained model?

More Physion evaluation results

Thanks for the great work! One question I have is about the evaluation results on Physion data. In both the paper and code, there seems to be only results and model checkpoint for collision. Wondering if VDT is evaluated on 7 other Physion scenarios? If there is, it would be great if both the evaluation results and checkpoints could be shared. Thanks in advance

对比个人在OpenAI Sora发布当天的一个技术分析，我觉得从系统/算法流程，主要的功能模块上看，核心虽然都是VisionTransformer+Diffution，相同的主要也在这部分，但差别还是不小。
https://github.com/yuedajiong/super-ai/blob/main/superai-20240216-sora.png
https://github.com/yuedajiong/super-ai
Sora的主要条件是文本（还有图片等）,那个captioner/recaption部分的作用很大，对视频很内容来说代表很强的语义级的约束，对用户来说有丰富的可描述的对象和对象描述。
Sora的Generate在VAE的En之后跑，应该可以冻结VAE，核心拟合跑在laten空间的，压缩后从像素空间应该至少(20+)^3的压缩，那个patchs也有助于从像素级的一致性简化难度到patchs的更小空间。
Sora在生成是条件的，那个GPT扩展描述很有用；总体来说，其实比无条件要难一些。复杂的内容准确控制除了技术，还有数据量上挑战很大。
Sora在Laten Patchs部分组织为grid，个人不清楚是有什么转换没有，还是简单的cat为grid结构，这个地方可能有一定帮助。
当然，结合Google Genie的出现，往视觉终极任务“视觉：立体+动态+交互+世界生成”，甚至“训练时：相机位置自由；diffhash式增量内容丰富”等要求下，最终极的走，现在还是挑战大。个人觉得，就算Sora+Genie“有机”的合体后，还是有个重要问题要解决：要不要对复杂场景中，中近景，给出显式的3D[4D/5D]表述，我觉得目前大家还在relax终极约束/需求，都想走简单的路用neural/implicit的方式走，我个人觉得：死路一条。（至少付出很高的成本/算力和数据量，也只能逼近好，而不能足够好。）如果显式的真立体被表示，那么，我们假设：进入场景内部自由移动的交互式电影中，我们可以围绕主角观看，可以和主角互动后，电影继续按需演绎。
graphdeco-inria/gaussian-splatting#658
个人观点，Genie比Sora在技术上更有突破。interactive比“视觉效果”最终看更重要。
显然，本文也在Collision上对interactive做某种程度的实现，我觉得是一大步，同样interactive比“视觉效果”最终看更重要。生成的对象丰富，细节丰富，分辨率高，不模糊等，更偏量变一些。（类似这种做互作的还有：2309.16237 Object Motion Guided Human Motion Synthesis）
我觉得，“视觉+生成”，甚至"+[立体+交互]"，这个大方向，VDT/ViDT，学术机构最大的挑战，就是没有海量算力和海量数据，一个10+人的专业团队，在质量上面爬坡一年以上，搞出公众能够感知到的靓丽。换句话说：VDT之类的论文，如果有这种资源，不止步于论文，也可以到Sora类似甚至更好的效果。
有大算力，有钱弄大数据，一个必要的专业团队长期优化，才有可能走到视觉终极任务：简单的说，就是生成出UE5类似的游戏交互场景片段。

无论如何，好文章，点个赞。

(在GFW的封锁下，AutoEncoderKL之类的用huggingface，真的不方便啊，作者这是欺负墙内的同行啊。)

GPU computer capability

Thank you for sharing such a great work with us.Can you tell me how long the VDT has been training on the A100 GPU?Waiting for your reply

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.