showlab / egovlp Goto Github PK

View Code? Open in Web Editor NEW

221.0 221.0 19.0 2.02 MB

[NeurIPS2022] Egocentric Video-Language Pretraining

Home Page: https://arxiv.org/pdf/2206.01670.pdf

Python 100.00%

egocentric-vision pretraining pytorch video-language

egovlp's People

Contributors

Stargazers

Watchers

Forkers

pzzhang happyccf kylemin dangeng beasteers iranroman satwikkottur sichengmo sega-hsj wolfworld6 chuhanxx truongthanhdat fbderek binahhu ltancock aspnetcs unbalancedvariance chenlzk nam1410

egovlp's Issues

Video Compression is Very Slow

Hi,
I am trying to use the proposed EgoCLIP dataset, and following your video compression code. The 1st step is resizing the frames.
Resize the source videos with a short size equal to 256 by script ./utils/video_resize.py.
I have found this step to be extremely slow with an ETA of 10 days. Have you also spent similar time on this, or do you have a faster way for resizing and computing the chunks?

Which version of ffmpeg did you use to pre-process the Ego4d dataset

Negative sampling disabled in EgoNCE?

According to this (

EgoVLP/configs/pt/egoclip.json

Line 45 in f3e8895

"neg_param": false

) line, negative sampling is disabled for EgoVLP pretraining. Is this the case or is this a typo?

checkpoint of Charades-Ego EgoVLP Zero-shot model does not exist.

https://drive.google.com/file/d/108BR5TmIA-sfX3cXOW_wxtJtc4XhglO6/view?usp=sharing

EPIC-Kitchens MIR dataset

Thanks for your great work! Are you directly downloading the rgb features of the EPIC100 dataset? without using other features?

How to extract the EgoVLP features from an Ego4D video?

Hi, I want to extract EgoVLP features from Ego4D videos for the task of action recognition.

You have provided features for the NLQ task: https://github.com/showlab/EgoVLP?tab=readme-ov-file#nlq--ego4d and the MQ task: https://github.com/showlab/EgoVLP?tab=readme-ov-file#mq--ego4d

I have downloaded the Ego4D videos and preprocessed them accordingly:

Resize the source videos with a short size equal to 256 by script utils/video_resize.py.

Chunk the resized videos into multiple segments (up to 600 sec) by script utils/video_chunk.py.

What other steps do I need to perform action recognition?

Two questions on EgoVLP and EPIC-Kitchens 100

Hi there, great work! I'm trying to use the video backbone of EgoVLP alone to extract intermediate feature maps (for a downstream task) on EPIC-Kitchens 100 videos. Two questions:

Any demo code available to load just the video weights and extract embeddings without worrying about text? I only have the videos to start with.
How is LOCAL_RANK set? When running python -m run.test_epic -r pretrained/egovlp_ek100_zs.pth -d 0, I'm finding that LOCAL_RANK isn't actually set even if it's supposed to be. What parameters might I be missing? (the guide indicates I only need to do python run/test_epic but this runs into package import problems)

Edited second question b/c solved previous second question by getting captions from https://github.com/mwray/Joint-Part-of-Speech-Embeddings

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3

I am encountering the following issue while running the pretraining code

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 407727) of binary: /fs/nexus-scratch/sanjoyc/anaconda/envs/egovlp/bin/python


raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I am trying the run the code on a single A5000 Node with 4 GPUS

Any help is appreciated.

Pre-trained TimesFormer or backbone ViT model

Hi!

I would like to know if you have released a separate video TimesFormer encoder or backbone ViT model that can be directly used for further fine-tuning on other egocentric tasks. Is there a way to extract pre-trained features for loading into existing ViT models?

Additionally, can you please clarify which Timesformer architecture are you using? Is it vanilla TimeSformer, TimeSformer-HR, or TimeSformer-L?

Thanks!

On the setting of `num_frames`

Thanks for your great work!

I am curious about the setting of num_frames of the pretrained model EgoVLP_PT_BEST.

I noticed that, in one of the closed issues, you clarified that num_frames=4 is used to train EgoVLP_PT_BEST. However, in configs/pt/egoclip.json, num_frames=16 is used.

Also, when loading EgoVLP_PT_BEST using a config file with num_frames=4, I get an error of size mismatch (shown in the attached image). It seems that the num_frames is 16 in EgoVLP_PT_BEST.

Could you please further explain how the num_frames is defined across the different configs and checkpoints?

Thanks for your help in advance!

Resaving `egovlp.pth` to only contain `state_dict`

Can the pretrained model (i.e. egovlp.pth) be saved with just state_dict? I reimplemented this model, but was having a hard time loading the checkpoint in my implementation due to config I believe. Besides, I don't believe that those are needed at inference time either. Thanks!

Question about Ego4D annotation

Hi, thanks for your great work! I have a small question about Ego4D annotation.

I notice that in narration.json file, the i-th narration for a video is labeled like :
{timestamp_sec: 19.2, timestamp_frame: 1823, narration_text: "hello"}

I understand 'timestamp_sec' is the start timestamp of i-th narration, but what is its end timestamp? the i+1-th narration's timestamp_sec? I notice it's quite often that i+1-th timestamp_sec < i-th timestamp_sec, does it mean the annotation fault?

How did you generate egoclip.csv using the narration?

NameError in the video_chunk.py

NameError: name 'num_partition' is not defined

Installing with the new environment.yaml

Could not install the environment for a while but fixed it by adding this line in the pip section:

- --extra-index-url https://download.pytorch.org/whl/cu116

It was inspired from this post https://stackoverflow.com/questions/73287475/how-to-specify-pip-extra-index-url-in-environment-yml

About EgoClip Dataset

Hi, No doubt that's great job! I have a question for the EgoClip dataset, May i only clik the link of Ego4D to obtain meta dataset? Since i find the link that is corresponding to the EgoClip dataset is egoclip.csv, that's not meta dataset. Thanks a lot!

EgoVLP_FT_EPIC* checkpoint of a model trained on another task or with another architecture

Hello EgoVLP,

Thank you very much for sharing your work with the board community.

I'm interested in using the model you submitted to the Epic-Kitchens Multi-Instance Retrieval challenge. I'll use it as starting point for further research we are doing in our lab!

I was able to use the code and load the model provided here EgoVLP_FT_EPIC* (the file name is epic_mir_plus.pth). However, I got the following message:

- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Does this message mean that I have to fine-tune the model to obtain the numbers in your README's table with Epic-Kitchens results?

Thank you!

number of frames per clip

Thanks for the great work!

I am confused by how "num_frames" is set in video_params in the config files. If I understand correctly, the pre-trained Frozen model has num_frames=16 whereas only four frames are given as input to the model at training and inference time. In Table 4 of the paper, there are two entries for Frozen+EgoNCE with #frames equal to 4 and 16, respectively. I am wondering what is the difference here, and which corresponds to the pre-trained model weights (EgoVLP_PT_BEST) available in the repository? May I still provide 16 frames instead of four to the provided model for feature extraction? Thank you!

Error with local rank while testing

I'm running into an error while testing at the following line:

model/model.py", line 89, in init
local_rank = int(os.environ['LOCAL_RANK']) # fixed by qinghong.

Is this an issue with setting the local rank?

Request for Filtering Code Used in EgoClip Creation

Hello,

First off, I'd like to express my appreciation for the incredible work on this project. It has been immensely beneficial to my team, and we are grateful for the effort put into it.

We're currently interested in extending our work by applying the same filtering process to the captions of Ego4D validation videos, similar to the approach taken with EgoClip.

Unfortunately, after an extensive search through the documentation and codebase, I've been unable to locate the specific filtering code used for creating EgoClip. If possible, could you please share the filtering code or direct us to where it can be found? Additionally, access to the EgoClip for the validation set would greatly aid our efforts.

In summary, what we need will be either of

The filtering code used for making EgoClip.
EgoClip for validation videos.

Thank you in advance for your assistance.

Best regards,

Hyogun Lee

Commands to MQ Training with VSGN

Hi, thanks for releasing the code!

Could you provide some instructions on how to run VSGN training with EgoVLP features (hyper-parameters, learning rate, etc.)? Thanks!

Junwei

Usage

If I want to directly use a pre-trained model instead of applying it to downstream tasks, for example, to obtain the similarity between a piece of text and a video, how should I proceed? Which code should I refer to?

Inter-video and Intra-video idnex

EgoVLP/data_loader/EgoClip_EgoMCQ_dataset.py

Line 149 in 3919d73

type = itemMCQ['types'] # 1 for inter; 2 for intra

EgoVLP/model/metric.py

Line 221 in 3919d73

group_list = ["Intra-video", "Inter-video"]

The EgoMCQ dataset use 1 to represent Inter-video and 2 for Intra-video. This aligns with egomcq.json

But the metric function will take 1 for Intra-video and 2 for Inter-video.

About the verb frequency

Thanks for your great work!

I calculated the the statistics on the created word annotations in EgoClip by myself:

#C C looks around 79405
#C C walks around 28064
#C C turns around 13345
#C C moves around 7310
#C C walks around the room 5806
#C C looks around the room 5531
#C C walks in the room 4641
#C C walks around the house 4136
#C C stands up 3907
#C C adjusts the camera 3762

It seems the most frequent verbs are look and walk. But in the Fig. 7(a) of your paper, they are put and take.

I also notice that in B.1 (iv) you removed narrations less than 3 words like #C C looks, but #C C looks around has 4 words, so it should not be excluded. Have you filtered some sentences like #C C looks around ?

Questions about preprocess for downstream tasks

Thanks for your great work.

I would like to use your repo (https://github.com/EGO4D/episodic-memory) for downstream tasks. So, I want to ask some questions about the pre-process for these tasks. Should I also resize and chunk the clips (provided by the official Ego4d) same as the EgoVLP?

Any zero-shot/few-shot action recognition data on EPIC-Kitchens?

Hi there, thanks so much for open-sourcing this! This looks super cool!

Is there any zero-shot/few-shot action recognition baseline for EPIC-Kitchens using EgoVLP? It looks like this model has downstream baselines for action recognition (including zero-shot/few-shot) on Charades, and for multi-instance retrieval for EPIC-Kitchens, but are there baselines on zero-shot/few-shot on EPIC-Kitchens? If not, what checkpoint would you recommend I start from to train a zero-shot or few-shot baseline for action recognition?

Thanks again for your help,
Vineet

Yes, I only using the rgb features, without adopt other modality e.g., audio.

          Yes, I only using the rgb features, without adopt other modality e.g., audio.

Originally posted by @QinghongLin in #15 (comment)

EPIC-Kitchens MIR Finetuning parameters

Hi,

Thank you for your good work!

I am trying to finetune your egovlp.pth checkpoint to obtain the numbers that are reported in your paper. With this checkpoint I am able to get similar numbers for the zero-shot MIR experiment. However, I am not able to finetune the model to get the corresponding numbers reported in the paper. Can you please provide the hyperparameters that you used to get those results? In particular, number of nodes (and GPUs), batch size per GPU and learning rate.

For reference, I am training on one node with eight GPUs (batch size 4 per GPU) with lr=3e-5 and I get the following values for first 9 epochs

[mir_metrics]EpicKitchens_MIR epoch 1, nDCG_V2T: 53.056, nDCG_T2V: 49.662, nDCG_AVG: 51.359,, mAP_V2T: 40.498, mAP_T2V: 32.386, mAP_AVG: 36.442
[mir_metrics]EpicKitchens_MIR epoch 2, nDCG_V2T: 50.703, nDCG_T2V: 48.978, nDCG_AVG: 49.841,, mAP_V2T: 37.345, mAP_T2V: 30.398, mAP_AVG: 33.871
[mir_metrics]EpicKitchens_MIR epoch 3, nDCG_V2T: 52.399, nDCG_T2V: 50.459, nDCG_AVG: 51.429,, mAP_V2T: 37.134, mAP_T2V: 30.674, mAP_AVG: 33.904
[mir_metrics]EpicKitchens_MIR epoch 4, nDCG_V2T: 52.612, nDCG_T2V: 51.101, nDCG_AVG: 51.856,, mAP_V2T: 37.135, mAP_T2V: 30.797, mAP_AVG: 33.966
[mir_metrics]EpicKitchens_MIR epoch 5, nDCG_V2T: 49.996, nDCG_T2V: 49.994, nDCG_AVG: 49.995,, mAP_V2T: 36.629, mAP_T2V: 30.758, mAP_AVG: 33.694
[mir_metrics]EpicKitchens_MIR epoch 6, nDCG_V2T: 53.716, nDCG_T2V: 51.267, nDCG_AVG: 52.492,, mAP_V2T: 39.028, mAP_T2V: 31.126, mAP_AVG: 35.077
[mir_metrics]EpicKitchens_MIR epoch 7, nDCG_V2T: 51.784, nDCG_T2V: 50.258, nDCG_AVG: 51.021,, mAP_V2T: 37.062, mAP_T2V: 29.698, mAP_AVG: 33.380
[mir_metrics]EpicKitchens_MIR epoch 8, nDCG_V2T: 54.027, nDCG_T2V: 51.747, nDCG_AVG: 52.887,, mAP_V2T: 39.233, mAP_T2V: 31.393, mAP_AVG: 35.313
[mir_metrics]EpicKitchens_MIR epoch 9, nDCG_V2T: 54.211, nDCG_T2V: 51.660, nDCG_AVG: 52.935,, mAP_V2T: 40.166, mAP_T2V: 31.139, mAP_AVG: 35.653

TIA

About NLQ results.

Hello.
Thanks for such nice work!
Now, we have some questions and want your help.
We use your EgoVLP_PT_BEST checkpoint to extract the video feature.
We train VSLNet with the feature and the bert checkpoint from EgoVLP_PT_BEST .
It Can't seem to get the precision you have in the report, and we only get about 7~8 [email protected].