exponentialml / text-to-video-finetuning Goto Github PK
View Code? Open in Web Editor NEWFinetune ModelScope's Text To Video model using Diffusers ๐งจ
License: MIT License
Finetune ModelScope's Text To Video model using Diffusers ๐งจ
License: MIT License
I see the following after Requirements and Installation
step
Encountered 2 file(s) that may not have been copied correctly on Windows:
unet/diffusion_pytorch_model.safetensors
unet/diffusion_pytorch_model.bin
Could this break something during finetuning ?
In which folder to put git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b
models?
Is there a rough count on how many images to train a concept if not using a video? I know for LORA it can be as few as 9-10 but for DreamBooth, usually 2-3x that amount.
So, i put all the .mp4 videos in a folder, and each video needs to be paired with a .txt file named like the video it was paired with, and that contains the prompt, in the same folder. Is this right?
I want to finetune the model using multiple videos, same prompt each video. Which .yaml file should i use?
> will it be difficult to modify the code to support multi-gpu training?
I've never tried multiple GPU training, but you may be able to do it naively with accelerate.
accelerate config
You should be prompted to configure your setup, including multiple GPU training.
Then it should be as simple as:
accelerate launch train.py --config ./configs/my_config_hq.yaml
Let us know how it goes if you decide to try! If it doesn't I could try to implement it, but I don't have multiple GPUs and would probably need to rent out a server to do so.
Originally posted by @ExponentialML in #14 (comment)
Is it possible?
Is there information that I should refer to anywhere?
More of a question really, but do you know why the num_attention_heads and attention_head_dim are opposite when initialising Transformer2D blocks?
It is opposite in unit_2d_blocks.py
https://github.com/huggingface/diffusers/blob/5439e917cacc885c0ac39dda1b8af12258e6e16d/src/diffusers/models/unet_2d_blocks.py#L872
Hi, I am trying to run your script but it always shows me this error.
Another thing is that it's not possible for me to install triton. It's like the repo doesn't exist anymore.
Error caught was: No module named 'triton'
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ C:\Users\Life\Text-To-Video-Finetuning\train.py:43 in <module> โ
โ โ
โ 40 # Will error if the minimal version of diffusers is not installed. Remove at your own ri โ
โ 41 check_min_version("0.10.0.dev0") โ
โ 42 โ
โ โฑ 43 logger = get_logger(__name__, log_level="INFO") โ
โ 44 โ
โ 45 def create_logging(logging, logger, accelerator): โ
โ 46 โ logging.basicConfig( โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
TypeError: get_logger() got an unexpected keyword argument 'log_level'
is there a specific version of accelerate that will work? I recently had to reinstall my requirements, and what worked before, doesn't work anymore. I think accelerate changed something on their end that caused this error message. I am using a fresh install at the moment, and everything works up until saving the first checkpoint....
Configuration saved in ./outputs\train_2023-07-02T07-13-31\checkpoint-100\model_index.json
Traceback (most recent call last):
File "F:\AI\Text-to-Video-Finetuning\train.py", line 994, in <module>
main(**OmegaConf.load(args.config))
File "F:\AI\Text-to-Video-Finetuning\train.py", line 899, in main
save_pipe(
File "F:\AI\Text-to-Video-Finetuning\train.py", line 506, in save_pipe
unet, text_encoder = accelerator.prepare(unet, text_encoder)
File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1182, in prepare
result = tuple(
File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1183, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1022, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1308, in prepare_model
model.forward = MethodType(torch.cuda.amp.autocast(dtype=torch.float16)(model.forward.__func__), model)
AttributeError: 'function' object has no attribute '__func__'
I dont think there is anything like this out there, this has a ton of potential.
What is the effect of the offset noise in training?
Hi, I meet the following error when I run finetune training.
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 617, in main
create_logging(logging, logger, accelerator)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 64, in create_logging
logger.info(accelerator.state, main_process_only=False)
AttributeError: 'str' object has no attribute 'info'
I have tried different version of accelerate, but cannot slove this error.
the following of my main pip list:
Package Version
accelerate 0.20.3
tensorboard 2.10.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tokenizers 0.13.3
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
transformers 4.30.2
triton 2.0.0
link๏ผ
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cached_latent = torch.load(self.cached_data_list[index], map_location=device)
Otherwise, in multi-GPU distributed training, the first GPU may occupy excessive VRAM compared to the other GPUs.
Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model?
https://research.nvidia.com/labs/toronto-ai/VideoLDM/
i saw this unofficial implementation, but not sure where it goes?
https://github.com/srpkdyy/VideoLDM
is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?
Some weights of the model checkpoint were not used when initializing UNet3DConditionModel:
This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Has anyone else had similar issues. I believe it has to do with the Lora Training because I only notice such behavior on models created while also training the new webui lora. The most recent model did not use the Loras, and had no such issues.
I fine-tuned the model via single video fine tuning, but the watermark is still there. Would like to know fine-tune detail that can remove the watermark
Many thanks
code๏ผfrom the network of unet3d, attn3 will do nothing?
I am a lazy person.
Has anyone managed to run the finetune on Colab?
Get this using both my finetuned model and the original 1.7b model
โ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:192 in <module> โ
โ โ
โ 189 โ โ init = interpolate(init, size=(args["num_frames"], args["heigh โ
โ 190 โ โ args["init_video"] = init โ
โ 191 โ โ
โ โฑ 192 โ videos = inference(**args) โ
โ 193 โ โ
โ 194 โ os.makedirs(output_dir, exist_ok=True) โ
โ 195 โ out_stem = f"{output_dir}/" โ
โ โ
โ /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115 in โ
โ decorate_context โ
โ โ
โ 112 โ @functools.wraps(func) โ
โ 113 โ def decorate_context(*args, **kwargs): โ
โ 114 โ โ with ctx_factory(): โ
โ โฑ 115 โ โ โ return func(*args, **kwargs) โ
โ 116 โ โ
โ 117 โ return decorate_context โ
โ 118 โ
โ โ
โ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:120 in โ
โ inference โ
โ โ
โ 117 โ lora_rank=64 โ
โ 118 ): โ
โ 119 โ with torch.autocast(device, dtype=torch.half): โ
โ โฑ 120 โ โ pipeline = initialize_pipeline(model, device, xformers, sdp) โ
โ 121 โ โ inject_inferable_lora(pipeline, lora_path, r=lora_rank) โ
โ 122 โ โ prompt = [prompt] * batch_size โ
โ 123 โ โ negative_prompt = ([negative_prompt] * batch_size) if negative โ
โ โ
โ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:33 in โ
โ initialize_pipeline โ
โ โ
โ 30 โ โ unet=unet.to(device=device, dtype=torch.half), โ
โ 31 โ ) โ
โ 32 โ pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipel โ
โ โฑ 33 โ unet._set_gradient_checkpointing(value=False) โ
โ 34 โ handle_memory_attention(xformers, sdp, unet) โ
โ 35 โ vae.enable_slicing() โ
โ 36 โ return pipeline โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
TypeError: UNet3DConditionModel._set_gradient_checkpointing() missing 1 required
positional argument: 'module'```
Do you have any knowledge of VideoLDM, and is it possible to integrate its algorithms to further enhance the capabilities of current models, such as generating longer videos?
There was a new version of modelscope released recently, it was trained for a month longer and it can generate better videos, is this repo using the new model or the old one?
After finetuning, the output video dosent move, it just stays still. It looks good but there is no movement.
How about sharing text2video/fine-tuned weights here?
The two working weights I have so far found are these two:
damo-vilab/text-to-video-ms-1.7b
strangeman3107/animov-0.1.1
use a lot of data or with more layers unfrozen could make it ?
Hello sir after training the model then how to test my model giving text as input please help me in this issue
while the validation output during training seems to be good. Any bugs in the inference code ? Or it is due to different diffuser version?
During the training of version2, the step loss easily becomes NaN, even if the learning rate is lowered. Have you encountered this issue before?
After running the train.py I get this RuntimeError:
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py:249: FutureWarning:
logging_diris deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Use
project_dir` instead.
warnings.warn(
03/24/2023 08:52:48 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu
Mixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values.
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
03/24/2023 08:52:50 - INFO - main - ***** Running training *****
03/24/2023 08:52:50 - INFO - main - Num examples = 1
03/24/2023 08:52:50 - INFO - main - Num Epochs = 1200
03/24/2023 08:52:50 - INFO - main - Instantaneous batch size per device = 1
03/24/2023 08:52:50 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
03/24/2023 08:52:50 - INFO - main - Gradient Accumulation steps = 1
03/24/2023 08:52:50 - INFO - main - Total optimization steps = 1200
Steps: 0%| | 0/1200 [00:00<?, ?it/s]Traceback (most recent call last):
File "D:\Art\Text-To-Video-Finetuning\train.py", line 498, in
main(**OmegaConf.load(args.config))
File "D:\Art\Text-To-Video-Finetuning\train.py", line 394, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 339, in finetune_unet
latents = tensor_to_vae_latent(pixel_values, vae)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 157, in tensor_to_vae_latent
latents = vae.encode(t).latent_dist.sample()
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\autoencoder_kl.py", line 164, in encode
h = self.encoder(x)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\vae.py", line 109, in forward
sample = self.conv_in(sample)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
Steps: 0%| | 0/1200 [00:00<?, ?it/s]`
I have 16GB available at first place. <12GB is used for training. Then I encounter OOM when saving the weights. This feels a bit insane...
Is there possibility of having custom resolution for training/inference ?
It seems like the models finetuned for Diffusers are referring to the latest beta version and not the latest official release of Diffusers, making the models error out when trying to load them with the official version of Diffusers. Could it be change to official releases referred to instead?
After several unsuccessful attempts at fine-tuning where the output was a still frame of noise or a green field, I followed instructions and skipped to the inference to test the base model. It reacted the same way.
Am I not pointing to the model directory correctly?
!cd /content/Text-To-Video-Finetuning && python inference.py --model /content/Text-To-Video-Finetuning/models/model_scope_diffusers --prompt "cat in a space suit"
cloned this model git clone https://huggingface.co/camenduru/potat1
the command
python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48
(venv) F:\potat text to video>python check.py
CUDA is available on your system.
CUDA device count: 2
CUDA device name: NVIDIA GeForce RTX 3090 Ti
(venv) F:\potat text to video>cd Text-To-Video-Finetuning
(venv) F:\potat text to video\Text-To-Video-Finetuning>python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48
Traceback (most recent call last):
File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 194, in <module>
videos = inference(**args)
File "F:\potat text to video\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 122, in inference
pipeline = initialize_pipeline(model, device, xformers, sdp)
File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 24, in initialize_pipeline
unet.disable_gradient_checkpointing()
File "F:\potat text to video\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 216, in disable_gradient_checkpointing
self.apply(partial(self._set_gradient_checkpointing, value=False))
File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 884, in apply
module.apply(fn)
File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 885, in apply
fn(self)
TypeError: UNet3DConditionModel._set_gradient_checkpointing() got multiple values for argument 'value'
(venv) F:\potat text to video\Text-To-Video-Finetuning>
I want to train my own video model, please give me some help
How long should I cut each video into? How many frames per video? How many videos are needed?
After I have finished training, how to call the model in the webui?
How to enable multi-GPU training? No matter how many GPUs I use, only one process starts.
Generated on finetuned. Unet is null
{
"_class_name": "TextToVideoSDPipeline",
"_diffusers_version": "0.15.0.dev0",
"scheduler": [
"diffusers",
"DDIMScheduler"
],
"text_encoder": [
"transformers",
"CLIPTextModel"
],
"tokenizer": [
"transformers",
"CLIPTokenizer"
],
"unet": [
null,
"UNet3DConditionModel"
],
"vae": [
"diffusers",
"AutoencoderKL"
]
}
text-to-video-ms-1.7b correct
{
"_class_name": "TextToVideoSDPipeline",
"_diffusers_version": "0.15.0.dev0",
"scheduler": [
"diffusers",
"DDIMScheduler"
],
"text_encoder": [
"transformers",
"CLIPTextModel"
],
"tokenizer": [
"transformers",
"CLIPTokenizer"
],
"unet": [
"diffusers",
"UNet3DConditionModel"
],
"vae": [
"diffusers",
"AutoencoderKL"
]
}
When trying to run inference using --lora_path parameter, getting :
LoRA rank 64 is too large. setting to: 4
list index out of range
Couldn't inject LoRA's due to an error.
0%| | 0/50 [00:00<?, ?it/s]
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 194, in <module>
videos = inference(**args)
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 141, in inference
videos = pipeline(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py", line 646, in __call__
noise_pred = self.unet(
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/models/unet_3d_condition.py", line 399, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 192, in forward
sample = self.linear_1(sample)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/utils/lora.py", line 60, in forward
+ self.dropout(self.lora_up(self.selector(self.lora_down(input))))
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (6x320 and 1280x16)
I'm running it on a Colab
How to use the vid2vid function? Do I only need to provide an initial video?
After the fine-tuning of version 2 is completed, how to perform model inference? version1 is as the following:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
my_trained_model_path = "./trained_model_path/"
pipe = DiffusionPipeline.from_pretrained(my_trained_model_path, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "Your prompt based on train data"
video_frames = pipe(prompt, num_inference_steps=25).frames
out_file = "./my_video.mp4"
video_path = export_to_video(video_frames, out_file)
Should i train it or not? I tested it and i didnt see any difference, is it better to just keep it on all the time?
after I comment the code about get_logger, I can run as the following output, but meet the other error.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
{'variance_type', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
LoRA rank 16 is too large. setting to: 4
LoRA rank 16 is too large. setting to: 4
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
Caching Latents.: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 38/38 [00:14<00:00, 2.68it/s]
Steps: 0%| | 0/10000 [00:00<?, ?it/s]2628 params have been unfrozen for training.
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 908, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 810, in finetune_unet
use_offset_noise = use_offset_noise and not rescale_schedule
UnboundLocalError: local variable 'use_offset_noise' referenced before assignment
Steps: 0%| | 0/10000 [00:55<?, ?it/s]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Is there more guideline of finetune training, much thanks.
Thank you, for making this. It seems to work, and I have a model.
I wanted to ask if there is:
Thank you!
After i run the script train_config.yaml i get this error below:
2023-04-09 13:40:38.702636: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Use project_dir
instead.
warnings.warn(
04/09/2023 13:40:40 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
โญโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโฎ
โ /content/Text-To-Video-Finetuning/train.py:915 in โ
โ โ
โ 912 โ parser.add_argument("--config", type=str, default="./configs/my_co โ
โ 913 โ args = parser.parse_args() โ
โ 914 โ โ
โ โฑ 915 โ main(**OmegaConf.load(args.config)) โ
โ 916 โ
โ โ
โ /content/Text-To-Video-Finetuning/train.py:582 in main โ
โ โ
โ 579 โ ) โ
โ 580 โ โ
โ 581 โ # Get the training dataset based on types (json, single_video, ima โ
โ โฑ 582 โ train_datasets = get_train_dataset(dataset_types, train_data, toke โ
โ 583 โ โ
โ 584 โ # Extend datasets that are less than the greatest one. This allows โ
โ 585 โ attrs = ['train_data', 'frames', 'image_dir', 'video_files'] โ
โ โ
โ /content/Text-To-Video-Finetuning/train.py:86 in get_train_dataset โ
โ โ
โ 83 โ for DataSet in [VideoJsonDataset, SingleVideoDataset, ImageDataset โ
โ 84 โ โ for dataset in dataset_types: โ
โ 85 โ โ โ if dataset == DataSet.getname(): โ
โ โฑ 86 โ โ โ โ train_datasets.append(DataSet(**train_data, tokenizer= โ
โ 87 โ โ
โ 88 โ if len(train_datasets) > 0: โ
โ 89 โ โ return train_datasets โ
โ โ
โ /content/Text-To-Video-Finetuning/utils/dataset.py:487 in init โ
โ โ
โ 484 โ โ โ
โ 485 โ โ self.fallback_prompt = fallback_prompt โ
โ 486 โ โ โ
โ โฑ 487 โ โ self.video_files = glob(f"{path}/*.mp4") โ
โ 488 โ โ โ
โ 489 โ โ self.width = width โ
โ 490 โ โ self.height = height โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
NameError: name 'glob' is not defined
Also, what layers should i unfreeze the get the best possible quality? Even if it consumes a ton of vram.
So for example, if i have set it to 4, it will finetune using only 4 frames of the video, no matter the length of the video?
will it be difficult to modify the code to support multi-gpu training?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.