What is the difference between using the video_folder.yaml and using the my_config.yaml? about text-to-video-finetuning HOT 7 CLOSED

ImBadAtNames2019 commented on August 15, 2024

What is the difference between using the video_folder.yaml and using the my_config.yaml?

from text-to-video-finetuning.

Comments (7)

ExponentialML commented on August 15, 2024

I want to finetune the model using multiple videos, same prompt each video. Which .yaml file should i use?

This will be a bit different in the next version release, but for now use video_folder.yaml.

You should be good to go by using a folder containing .mp4 files paired with '.txt', same as image training.

from text-to-video-finetuning.

ImBadAtNames2019 commented on August 15, 2024

I want to finetune the model using multiple videos, same prompt each video. Which .yaml file should i use?

This will be a bit different in the next version release, but for now use video_folder.yaml.

You should be good to go by using a folder containing .mp4 files paired with '.txt', same as image training.

Great, thank you. Do you think 10 short videos (around 10-30 seconds each) is decent?

from text-to-video-finetuning.

ImBadAtNames2019 commented on August 15, 2024

Also, the steps are set to 50k by default, this will take ages even on a a100, isnt this overkill? is 2500 enough?

from text-to-video-finetuning.

ExponentialML commented on August 15, 2024

@ImBadAtNames2019 You can choose an arbitrary amount of steps for your specific use case (it totally depends on your train data).

Length should be relevant to the data you want to capture. Here's an example of how you should be looking at your train set.

If you have a 10 second clip of a car driving down a road, then there's a cat crossing the street 5 seconds into the video, and your prompt is "a car driving down the road", you just told the model that this clip is entirely about a car 🙂 .

Make sure your prompts are relevant to the clip. The training script will handle the rest.

from text-to-video-finetuning.

ImBadAtNames2019 commented on August 15, 2024

@ImBadAtNames2019 You can choose an arbitrary amount of steps for your specific use case (it totally depends on your train data).

Length should be relevant to the data you want to capture. Here's an example of how you should be looking at your train set.

If you have a 10 second clip of a car driving down a road, then there's a cat crossing the street 5 seconds into the video, and your prompt is "a car driving down the road", you just told the model that this clip is entirely about a car 🙂 .

Make sure your prompts are relevant to the clip. The training script will handle the rest.

Thanks for your help and amazing work. Do you have other general advice on this?

from text-to-video-finetuning.

JCBrouwer commented on August 15, 2024

I'll chime in, another way of prompting is just to use a single fixed prompt for all the videos (similar to textual inversion/dreambooth). Then you can invoke your style at inference time by adding the prompt you finetuned with. This strategy can definitely overfit though, so it's probably best to keep the learning rate low.

Some other advice from my experiments:

if you change the resolution or fps from their base (256 and 8 respectively), you'll need to unfreeze more relevant parts of the model
Higher n_sample_frames leads to better temporal consistency

from text-to-video-finetuning.

ImBadAtNames2019 commented on August 15, 2024

I'll chime in, another way of prompting is just to use a single fixed prompt for all the videos (similar to textual inversion/dreambooth). Then you can invoke your style at inference time by adding the prompt you finetuned with. This strategy can definitely overfit though, so it's probably best to keep the learning rate low.

Some other advice from my experiments:

if you change the resolution or fps from their base (256 and 8 respectively), you'll need to unfreeze more relevant parts of the model

Higher n_sample_frames leads to better temporal consistency

Amazing, thank you.

from text-to-video-finetuning.

What is the difference between using the video_folder.yaml and using the my_config.yaml? about text-to-video-finetuning HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent