Hi, Thank you for your excellent work.
I tried to inference LGVI for referring expressions using the repo on https://huggingface.co/jianzongwu/lgvi.
I cloned this repo in ./checkpoints and run the following command as described in README:
python -m inference_referring
--video_path videos/two-birds
--ckpt_path checkpoints/lgvi
--expr "remove the bird on left" \
But I encountered Error like this:
The config attributes {'st_attn': False} were passed to RoviModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Traceback (most recent call last):
File "/home/hboh/anaconda3/envs/rovi/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hboh/anaconda3/envs/rovi/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hboh/neurips24/Language_Driven_Video_Inpainting/inference_referring.py", line 86, in
main(args)
File "/home/hboh/anaconda3/envs/rovi/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/hboh/neurips24/Language_Driven_Video_Inpainting/inference_referring.py", line 45, in main
unet = RoviModel.from_pretrained(args.ckpt_path, subfolder='unet')
File "/home/hboh/anaconda3/envs/rovi/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
File "/home/hboh/anaconda3/envs/rovi/lib/python3.9/site-packages/diffusers/models/modeling_utils.py", line 660, in from_pretrained
raise ValueError(
ValueError: Cannot load <class 'rovi.models.unet.RoviModel'> from /home/hboh/neurips24/Language_Driven_Video_Inpainting/checkpoints/lgvi because the following keys are missing:
condition_conv_in.bias, condition_conv_in.weight.
Please make sure to pass low_cpu_mem_usage=False
and device_map=None
if you want to randomly initialize those weights or else make sure your checkpoint file is correct.
I think it's because lgvi checkpoint is something wrong.
When I do the same inference for referring expressions with lgvi-i checkpoint on repo https://huggingface.co/jianzongwu/lgvi-i, it works fine with the provided videos (two-birds and city-bird).
Thank you for replying in advance! :)
p.s. Additionally, I would appreciate your insights on how well the model generalizes to other in-the-wild videos. Your honest feedback would be very helpful.