Git Product home page Git Product logo

Comments (7)

ExponentialML avatar ExponentialML commented on August 15, 2024

I want to finetune the model using multiple videos, same prompt each video. Which .yaml file should i use?

This will be a bit different in the next version release, but for now use video_folder.yaml.

You should be good to go by using a folder containing .mp4 files paired with '.txt', same as image training.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on August 15, 2024

I want to finetune the model using multiple videos, same prompt each video. Which .yaml file should i use?

This will be a bit different in the next version release, but for now use video_folder.yaml.

You should be good to go by using a folder containing .mp4 files paired with '.txt', same as image training.

Great, thank you. Do you think 10 short videos (around 10-30 seconds each) is decent?

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on August 15, 2024

Also, the steps are set to 50k by default, this will take ages even on a a100, isnt this overkill? is 2500 enough?

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on August 15, 2024

@ImBadAtNames2019 You can choose an arbitrary amount of steps for your specific use case (it totally depends on your train data).

Length should be relevant to the data you want to capture. Here's an example of how you should be looking at your train set.

If you have a 10 second clip of a car driving down a road, then there's a cat crossing the street 5 seconds into the video, and your prompt is "a car driving down the road", you just told the model that this clip is entirely about a car 🙂 .

Make sure your prompts are relevant to the clip. The training script will handle the rest.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on August 15, 2024

@ImBadAtNames2019 You can choose an arbitrary amount of steps for your specific use case (it totally depends on your train data).

Length should be relevant to the data you want to capture. Here's an example of how you should be looking at your train set.

If you have a 10 second clip of a car driving down a road, then there's a cat crossing the street 5 seconds into the video, and your prompt is "a car driving down the road", you just told the model that this clip is entirely about a car 🙂 .

Make sure your prompts are relevant to the clip. The training script will handle the rest.

Thanks for your help and amazing work. Do you have other general advice on this?

from text-to-video-finetuning.

JCBrouwer avatar JCBrouwer commented on August 15, 2024

I'll chime in, another way of prompting is just to use a single fixed prompt for all the videos (similar to textual inversion/dreambooth). Then you can invoke your style at inference time by adding the prompt you finetuned with. This strategy can definitely overfit though, so it's probably best to keep the learning rate low.

Some other advice from my experiments:

  • if you change the resolution or fps from their base (256 and 8 respectively), you'll need to unfreeze more relevant parts of the model
  • Higher n_sample_frames leads to better temporal consistency

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on August 15, 2024

I'll chime in, another way of prompting is just to use a single fixed prompt for all the videos (similar to textual inversion/dreambooth). Then you can invoke your style at inference time by adding the prompt you finetuned with. This strategy can definitely overfit though, so it's probably best to keep the learning rate low.

Some other advice from my experiments:

  • if you change the resolution or fps from their base (256 and 8 respectively), you'll need to unfreeze more relevant parts of the model
  • Higher n_sample_frames leads to better temporal consistency

Amazing, thank you.

from text-to-video-finetuning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.