ju-chen / efficient-prompt Goto Github PK

View Code? Open in Web Editor NEW

182.0 182.0 12.0 3.7 MB

License: MIT License

Python 100.00%

efficient-prompt's People

Contributors

Stargazers

Watchers

Forkers

dyekuu zylwithxy alienge songxinran-coding yic666 leewlving xuyu0010 damien911224 bharathchalla plaovem kim-hojoon joseph16388

efficient-prompt's Issues

Questions on calculating the accuracy in validation phase

In val.py file

    sim_ensemble = torch.zeros(0).to(device)
    test_num = int(len(similarity)/valEnsemble)
    for enb in range(valEnsemble):
          sim_ensemble = torch.cat([sim_ensemble, similarity[enb*test_num:enb*test_num+test_num].unsqueeze(0)], dim=0) 
    target_final = targets[:test_num]

    sim_final = torch.mean(sim_ensemble, 0)
    values, indices = sim_final.topk(5)
    top1 = (indices[:, 0] == target_final).tolist()
    top5 = ((indices == einops.repeat(target_final, 'b -> b k', k=5)).sum(-1)).tolist()

    top1ACC = np.array(top1).sum() / len(top1)
    top5ACC = np.array(top5).sum() / len(top5)

I see you split the validation datasets into $valEnsemble$ parts, and then get average from $valEnsemble$ parts. Finally,
calculate average part accuracy in validation datasets. Not 0:len(similarity). so I want to know what reasons u do this? To my knowledge, isn't it more convincing to calculate accuracy of 0:len(similarity).

I am looking forward your reply!
Thx

Query regarding the testing setup for Action Localization

In the supplementary file, it has been written that for Action Localization , in ActivityNet dataset, each snippet consists of 768 frames ? This is extremely dense feature , because normal benchmarked approaches use 8 or 16 frames ! Is this claim correct ? If yes, can you share this feature atleast if not the code ?

代码挂错了吧？

这篇论文《Prompting Visual-Language Models for Efficient Video Understanding》
地址：https://voide1220.github.io/distillation_collaboration/
code的代码链接竟然是这个。。。。。。

Just want to know when will the code be released ?

about feature extract

Thank you for your excellent work！But I have some problems about the feature extract.
Could you please upload the code about the feature extract? I want to reproduce the work on my own dataset.
Thank a lot !

Has this work been submitted to CVPR2022?

About the K400 checkpoint

Hello, thank you for releasing the code. I would like to ask if you plan to release the model checkpoint trained on the Kinetics400 dataset.

About K700 openset split

Thanks for your wonderful work~ I want to know the details about your K700 openset split.

Is it right that videos of 400 classes in the original K700 training split are used for openset training, and videos of the other 300 classes in original K700 val splits are used for openset validation?

Or you directly gather all videos of K700 in both training and val list, and pick 400 classes for training and use the left videos of 300 classes for validation?

About CLIP feature extract？

It appears that there might be a discrepancy in the extracted features from Activity Net. The features you obtained from the video __c8enCfzqw are of dimension (4130, 512), while the features in ANet_CLIP for v___c8enCfzqw.npy have dimensions (864, 512). This suggests a difference in processing.
So，i am confused about it. Please tell me how do deal with the Activity Net, thanks!

When will the code be released?

Hi,
Nice work! May I know when will you release the code? Thank you.

What is the training epochs for the new parameters

Thanks for your great work and coming code in advance. Pretraining on image and finetuning on video is a standard pipeline. Usually, there are few new parameters introduced in the finetuning process. And, as we all know, the well-trained pre-train model can decrease the finetune epochs. However, this work introduces more new parameters than others(Apologize if I am wrong), though the pre-train model is very strong. I am just wondering does this method need longer epochs?

About few-shot hyperparameters for HMDB51

Hi,

Thank you for your solid work and code releases.
I'm trying to reproduce the results of the few-shot settings, but I'm not able to get the same result for HMDB51 dataset as you mentioned in the paper. I wonder if the hyperparameter setup of HMDB51 differs greatly from UCF101.

Thanks!

Questions on action localization

Hi,

Thank you for sharing your fantastic work. I have a few questions related to the action localization application.

In your work you mentioned that you follow a two-stage pipeline: class-agnostic localization, and then action classification. In the class-agnostic proposal generation step, it is understood that a generic detector is trained from scratch using clip image features (instead of I3D features). Could you please explain if the detector is trained in a class-agnostic way in your implementation? or if the class predictions are just discarded?
In step 1, it is mentioned in the supplementary that you "utilise three parallel prediction heads to determine" the localization. Can you please explain why three heads are used?
In section 3.2 training loss, it is explained that for the localization task, the mean pool of dense features from stage1 proposals is used to obtain v_i. So in the second step in action classification, is the model (prompting) trained for classification? If so, will the training data be original dataset videos sampled at 10 fps and length of 256 frames (following AFSD) and corresponding action class labels?
My last question is, could you provide insights on why the off-the-shelf detector is trained on the clip image features, instead of purely using off-the-shelf detections?

I would really appreciate it if you could provide the answers at the earliest.
Thank you.

Can you provide the code for the video retrieval model?

Intersection between test and train categories

Thank you for opening the source code, in the zero sample temporal action recognition, I sent the HMDB51 dataset and actually read train_split01.txt and text_split01.txt, which I guess are the training set and the test set, and the categories should not be duplicated as I am in the zero sample task. But why are the action categories duplicated in these two files? Looking forward to your reply.

About K-400 Performance

Hello, thank you for kindly releasing the code!
I'm trying to reproduce the closed-set action recognition results (Table 1), but I fail to get a comparable performance w/ and w/o temporal modeling (actually much worse than the reported results) :(
Can you share the extracted K-400 features for a further check? @ju-chen