showlab / all-in-one Goto Github PK

View Code? Open in Web Editor NEW

277.0 6.0 16.0 1.57 MB

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training

Home Page: https://arxiv.org/abs/2203.07303

Python 100.00%

codebase pre-training video-language pytorch

all-in-one's Introduction

All-in-one

Code for the paper: All in One: Exploring Unified Video-Language Pre-training Arxiv

News

2022.03.25 Update Readme.
2022.06.07 Release the model AllInOne+ pre-trained on Eight Dataset (YTT+WebVid+HowTo+CC3+CC12+CoCo+VG+SBU).
2022.05.07 AllInOne+ is released. The main different between AllInOne is the Image and Video Co-train.
2022.03.14 The first version of AllInOne is released.

Install

1. PytorchLighting

In this work, we use PytorchLighting for distributed training with mixed precision. Install pytorch and PytorchLighting first.

conda create -n allinone python=3.7
source activate allinone
cd [Path_To_This_Code]
pip install -r requirements.txt

If all packages include ffmpeg installed, please skip step 2.

2. On-the-fly decode (may skip)

To speed up the pre-training, we adopt on-the-fly decode for fast IO. Install ffmpeg as below.

1. ffmpeg

sudo conda install -y ffmpeg

Please install the required packages if not included in the requirements.txt.

If you server cannot connect to http or install ffmpeg slowly. Please download static binary file from FFmpeg Static Builds and then add to path variable, as follows:

export PATH=[PATH_TO_Dir/]ffmpeg-git-20220108-amd64-static:$PATH

2. pytorch video

Install pytorchvideo (for data augmentation) as below:

pip install ffmpeg-python
pip install pytorchvideo

Download Pretrained Weights

We provide three pretrained weights in google driver.

Model	PT Data	Parameter	Pretrained Weight	Trained Log	Hparams
All-in-one-Ti	Webvid+HowTo	12M	Google Driver	Google Driver	Google Driver
All-in-one-S	Webvid+HowTo	33M	Google Driver	Google Driver	Google Driver
All-in-one-B	Webvid+HowTo	110M	Google Driver	Google Driver	Google Driver
All-in-one-B+	Webvid+HowTo+ CC3	110M	Google Driver	Google Driver	Google Driver
All-in-one-B+	Webvid+YTT+HowTo+ CC3+CC12+Coco+VG+SBU	110M	Google Driver	Google Driver	Google Driver

After downloaded these pretrained weights, move them into pretrained dir.

mkdir pretrained
cp *.ckpt pretrained/

Compare with state-of-the-arts

Model	Param	Data	Frames	TGIF-Action	TGIF-Frame	MSR R@5	MSR R@10
ClipBERT	137M	I:Coco+VG	8 x 2	82.9	59.4	49.2	63.5
VIOLET	198M	V:Webvid+ I:CC3	16	87.1	-	63.0	73.4
All-in-one-S	33M	V:WebVid+Howto	3	91.2	64.0	61.5	70.9
All-in-one-B	110M	V:WebVid+Howto	3	92.9	64.2	67.0	77.1
All-in-one-B+	110M	V:Webvid+ I:CC3	3	95.4	67.2	68.1	77.3
All-in-one-B+	110M	V:Webvid+YTT+HowTo+ I:CC3+CC12+Coco+VG+SBU	3	96.3	68.5	70.3	79.2

I is short for Image and V is short for Video in this table.

Dataset Preparation

See DATA.md

Pre-training

Full Video Pre-training

See TRAIN.md

Co-training with Image Dataset (All-in-one+)

See COTRAIN.md

Evaluation on Downstream Tasks

See EVAL.md

By unified design and sparse sampling, AllInOne show much small flops.

Citation

If you find our work helps, please cite our paper.

@article{wang2022allinone,
  title={All in One: Exploring Unified Video-Language Pre-training},
  author={Wang, Alex Jinpeng and Ge, Yixiao and Yan, Rui and Ge Yuying and Lin, Xudong and Cai, Guanyu  and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

Contact

Email: awinyimgprocess at gmail dot com

If you have any problem or have difficult in reproducing the results reported in this code, you can email to me or open a question in issues. We are also willing to merge the code if transfer our All-in-one to different tasks or datasets.

Acknowledgement

This work is mainly based on ViLT, Frozen and Merlot.

License

MIT

all-in-one's People

Contributors

Stargazers

Watchers

Forkers

linyq17 daiguangzhao mengqidyangge zfw1111 geyuying albertmundu leo-bright kenn-san yanzehong sahill-rawat lianglili jacobswan1 dumpmemory iamleon121 pmhex arcsincode

all-in-one's Issues

About setting unfilterd=False when computing accuracy

unfiltered is set to False in

all-in-one/AllInOne/modules/objectives.py

Lines 759 to 761 in a9c1d45

 acc = getattr(pl_module, f"{phase}_openend_vqa_accuracy")( 

 ret["vqa_logits"], ret["vqa_labels"], unfilterd=False # if remove unknown classes 

 )

It looks like unseen classes are skipped rather than counted as wrong as mentioned in #7 (comment)

Details of fine-tuning on MSRVTT-QA

Hi,

I am wondering about some of your experimental settings of MSRVTT-QA. Could you please clarify it?

what's the image resolution, 224x224?
how do you deal with open-ended VQA like MSRVTT-QA? the paper only mentioned that you converted it to a classification task. Did you choose top-k answers? what's k then?

Thanks!

all-in-one/config.py what is 'msrvttqa_label_size'?

This one looks resonable.

However, msrvttqa_label_size is used 5 times including where the def is not related to 'msrvttqa'.

Is 'msrvttqa_label_size' has special meaning?

Video Retrieval MSRVTT train/test split.

Hello!

Could you please tell me which train/test split you used when reporting results in the paper.
I see hardcoded using of jsfusion split in AllInOne/datasets/msrvtt.py. So did you use only jsfusion test for both train-7k and train-9k?

Also note that when you report here you should use 'full' split

LSMDC Fib

Hi, is there any dataset for LSMDC Filling the blank task?

add model to Huggingface

Hi, would you be interested in adding all-in-one to Hugging Face Hub? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. We can setup an organization or a user account under which all-in-one can be added similar to github.

Example from other organizations:
Keras: https://huggingface.co/keras-io
Microsoft: https://huggingface.co/microsoft
Facebook: https://huggingface.co/facebook

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/akhaliq/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

and here are guides for adding spaces/models/datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
how to add models: https://huggingface.co/docs/hub/adding-a-model
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

install conflicts. pip install -r requirements.txt

I tried what the README.md have stated.

conda create -n allinone python=3.7
source activate allinone
cd [Path_To_This_Code]
pip install -r requirements.txt

However when pip install -r requirements.txt, the following error occurs.

$ pip install -r requirements.txt
Collecting absl-py==0.13.0
Downloading absl_py-0.13.0-py3-none-any.whl (132 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.1/132.1 kB 5.2 MB/s eta 0:00:00
Collecting addict==2.4.0
Downloading addict-2.4.0-py3-none-any.whl (3.8 kB)
Collecting aiohttp==3.8.1
Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.9 MB/s eta 0:00:00
Collecting aiosignal==1.2.0
Downloading aiosignal-1.2.0-py3-none-any.whl (8.2 kB)
ERROR: Could not find a version that satisfies the requirement apex==0.1 (from versions: 0.9.8dev.linux-i686, 0.9.8.dev0, 0.9.8a0.dev0, 0.9.9.dev0, 0.9.10.dev0)
ERROR: No matching distribution found for apex==0.1

Can I modify the requirements.txt file?

Fine-tuning TGIF-QA FrameQA

TGIF dataset folder is flawless which is checked by md5.
However, I have an error.

$ python run.py with data_root=DATAROOT num_gpus=1 num_nodes=1 num_frames=3 per_gpu_batchsize=8 task_finetune_tgifqa load_path="pretrained/all-in-one-plus-224.ckpt"

initalize data augmentation for a100 gpus
convert to numpy
^MValidation sanity check: 0it [00:00, ?it/s]ERROR - AllInOne - Failed after 0:00:05!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 84, in main
trainer.fit(model, datamodule=dm)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
results = self.train_or_test()
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
results = self.trainer.train()
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 495, in train
self.run_sanity_check(self.get_model())
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 693, in run_sanity_check
_, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 596, in run_evaluation
for batch_idx, batch in enumerate(dataloader):
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 235, in getitem
return self.datasets[dataset_idx][sample_idx]
File "/myhome/all-in-one/AllInOne/datasets/tgif.py", line 87, in getitem
image_tensor = self.get_video(sample)
File "/myhome/all-in-one/AllInOne/datasets/video_base_dataset.py", line 107, in get_video
imgs = self.get_raw_video(sample).permute(1, 0, 2, 3) # to cthw
File "/myhome/all-in-one/AllInOne/datasets/tgif.py", line 55, in get_raw_video
imgs, idxs, vlen = read_frames_gif(abs_fp, self.num_frames, mode=self.split)
File "/myhome/all-in-one/AllInOne/datasets/video_base_dataset.py", line 292, in read_frames_gif
gif = imageio.get_reader(video_path)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/imageio/core/functions.py", line 186, in get_reader
return format.get_reader(request)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/imageio/core/format.py", line 170, in get_reader
return self.Reader(self, request)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/imageio/core/format.py", line 221, in init
self._open(**self.request.kwargs.copy())
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/imageio/plugins/pillowmulti.py", line 60, in _open
return PillowFormat.Reader._open(self)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 138, in _open
as_gray=as_gray, is_gray=_palette_is_grayscale(self._im)
File "/myhome/.conda/envs/allinone/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 689, in _palette_is_grayscale
palette = np.asarray(pil_image.getpalette()).reshape((256, 3))
ValueError: cannot reshape array of size 96 into shape (256,3)

How to test the model?

Hi! I want to test video retrieval with all-in-one-base.ckpt on MSR-VTT and see the metrics to compare with the paper. Can you please help with the command?
I tried the following command but it started to train the model and I only need testing.
python run.py with data_root=data/ num_gpus=2 num_nodes=1 per_gpu_batchsize=32 task_finetune_only_ind_itc_msrvtt_randaug num_frames=3 load_path="pretrained/all-in-one-base.ckpt"

meta_data folder is missing

Error

/all-in-one$ python run.py with data_root/datasets/msvd/data num_gpus=2 num_nodes=1 num_frames=3 per_gpu_batchsize=16 task_finetune_msvdqa load_path="pretrained/all-in-one-base.ckpt"

....

video datasets: ['msvd_qa_train']
frames for base dataset is: 3
no arrow available for msvd_qa_train, load from disk
initalize data augmentation for a100 gpus
initalize data augmentation for a100 gpus
convert to numpy
ERROR - AllInOne - Failed after 0:00:11!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 84, in main
trainer.fit(model, datamodule=dm)
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 268, in ddp_train
self.trainer.call_setup_hook(model)
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in call_setup_hook
self.datamodule.setup(stage_name)
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/home/all-in-one/AllInOne/datamodules/multitask_datamodule.py", line 34, in setup
dm.setup(stage)
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/home/all-in-one/AllInOne/datamodules/msvdqa_datamodule.py", line 19, in setup
super().setup(stage)
File "/home/all-in-one/AllInOne/datamodules/datamodule_base.py", line 157, in setup
self.set_train_dataset()
File "/home/all-in-one/AllInOne/datamodules/datamodule_base.py", line 92, in set_train_dataset
backend=self.backend
File "/home/all-in-one/AllInOne/datasets/msvdqa.py", line 28, in init
self._load_metadata()
File "/home/all-in-one/AllInOne/datasets/msvdqa.py", line 41, in _load_metadata
with open(os.path.join(metadata_dir, 'msvd_youtube_mapping.txt')) as f:
FileNotFoundError: [Errno 2] No such file or directory: './meta_data/msvd/msvd_youtube_mapping.txt'

Hi.

I think meta_data folder is not uploaded. When I check the code, there is no writing/generation for meta_data files, so I guess meta_data files would be included in the repository. As the DemoVLP does.

DemoVLP has meta_data folder. https://github.com/showlab/DemoVLP/tree/master/meta_data

downstream task checkpoint

Thank you very much for your excellent code sharing, and can you share your downstream tasks checkpoints?

What is in tgif.tar.gz ?

Thanks for your wonderful work!

Could you please tell me what is in https://drive.google.com/u/0/uc?id=11wdvsTYIPcSTRMVry1tufILiNE4aAMp5&export=download

Are they download scripts or data?

How to use the fine-tune model for inference

How to use the fine-tune model for inference，Is there a demo？
thank you，best regards!

401 Client Error: Unauthorized for url: https://huggingface.co/pretrained/bert-base-uncased/resolve/main/vocab.txt

$ python run.py with data_root=/datasets/msvd/data num_gpus=2 num_nodes=1 num_frames=3 per_gpu_batchsize=16 task_finetune_msvdqa load_path="pretrained/all-in-one-base.ckpt"

WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - AllInOne - No observers have been added to this run
INFO - AllInOne - Running command 'main'
INFO - AllInOne - Started
Global seed set to 0
INFO - lightning - Global seed set to 0

ERROR - AllInOne - Failed after 0:00:01!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 15, in main
dm = MTDataModule(_config, dist=True)
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 49, in call
obj = type.call(cls, *args, **kwargs)
File "/home/all-in-one/AllInOne/datamodules/multitask_datamodule.py", line 19, in init
self.dm_dicts = {key: _datamoduleskey for key in datamodule_keys}
File "/home/all-in-one/AllInOne/datamodules/multitask_datamodule.py", line 19, in
self.dm_dicts = {key: _datamoduleskey for key in datamodule_keys}
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 49, in call
obj = type.call(cls, *args, **kwargs)
File "/home/all-in-one/AllInOne/datamodules/msvdqa_datamodule.py", line 8, in init
super().init(*args, **kwargs)
File "/home/all-in-one/AllInOne/datamodules/datamodule_base.py", line 57, in init
self.tokenizer = get_pretrained_tokenizer(tokenizer)
File "/home/all-in-one/AllInOne/datamodules/datamodule_base.py", line 20, in get_pretrained_tokenizer
from_pretrained, do_lower_case="uncased" in from_pretrained
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1752, in from_pretrained
raise err
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1745, in from_pretrained
use_auth_token=use_auth_token,
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/transformers/file_utils.py", line 1056, in cached_path
local_files_only=local_files_only,
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/transformers/file_utils.py", line 1186, in get_from_cache
r.raise_for_status()
File "/home/miniconda3/envs/allinone/lib/python3.7/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/pretrained/bert-base-uncased/resolve/main/vocab.txt

There are 2 reasons for 401 Client Error for hugging face transformer.

The repository does not exist.
The repository is private.

bert base uncased model is definitely not private.
.

How could I resolve this issue?

python == 3.7.13
transformers == 4.2.1

config file set

If an all-in-one-tiny checkpoint is used, how to update the config file to fit the parameter size?

video captioning checkpoint

Thanks to the awesome work!
I'm interested in video captioning, and can you share the captioning checkpoint?
Thanks a lot

Drive access

Hi! Could you please provide the access to google drive with dataset annotations and model weights?

A minor question on Annotation files

https://drive.google.com/drive/folders/1XTAAvx-d3BOyxkEzzN61tnbUkRH5AFi5

Annotation>msvd

Is 'msrvtt_mc_text.jsonl' a typo? Or do I miss something?

As the folder name is 'msvd', it seems like 'msvd_mc_text.josnl' is a resonable name.

zero-shot on custom dataset

Hi,
I want to know if we could evaluate this model on our custom dataset without fintuning? Or could you show me how to do the inference based on pre-trained ckpt? My task is to do VideoQA and Video-text Retrieval and the format of dataset is quiet similar to MSRVTT. Thanks a lot !

video captioning finetuning datasets and codes

您好，我没有找到有关于video captioning的数据准备和微调代码，希望是由于我的疏忽导致的，请问您方便指导一下有关于TVC 和 YouCook2的数据准备和模型微调训练和测试的代码在什么位置吗？十分感谢。

Annotation (Caption file) missing for HowTo100M dataset

Dear author,

Thank you for your great effort, especially the very neat & organized code for data loading! That helps a lot for our research. In the meanwhile, I'm wondering do you plan to release the processed caption file for the HowTo100M dataset? As it seems it's not released in the Google Driver. I understand that you must use the official provided caption file but just would like to check with you first as your processed file might be fitting the data loader more easily.

Appreciate your help in advance.

Regards

	acc = getattr(pl_module, f"{phase}_openend_vqa_accuracy")(
	ret["vqa_logits"], ret["vqa_labels"], unfilterd=False # if remove unknown classes
	)