jpthu17 / dicosa Goto Github PK

View Code? Open in Web Editor NEW

44.0 2.0 2.0 5.7 MB

[IJCAI 2023] Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

License: Apache License 2.0

Python 100.00%

cross-modal-retrieval ijcai video-retrieval

dicosa's People

Contributors

Stargazers

Watchers

Forkers

mengxiao-tian pppeng21

dicosa's Issues

Strange results occur when reproducing code on a GPU

I'm getting strange results when running the code on an RTX 3090 GPU. I first used the code in CLIP4Clip to compress the video size to 3fps :
https://github.com/ArrowLuo/CLIP4Clip/blob/master/preprocess/compress_video.py
and then froze the clip model by using those code:
for param in self.clip.parameters():
param.requires_grad = False # not update by gradient
the train log on MSRVTT as follows :
[2024-05-12 08:25:31,329 tvr 320 INFO]: eta: 4:50:08 epoch: 2/5 iteration: 3800/7030 time: 1.3135 (5.3897) data: 0.4849 (4.5103) loss: 6.1797 (6.1809) E_loss: 6.1559 (6.1561) M_loss: 0.0250 (0.0248) lr: logit_scale: 100.00max mem: 8443
[2024-05-12 08:28:30,665 tvr 320 INFO]: eta: 4:44:24 epoch: 2/5 iteration: 3850/7030 time: 1.3637 (5.3663) data: 0.4905 (4.4867) loss: 6.1970 (6.1808) E_loss: 6.1726 (6.1559) M_loss: 0.0248 (0.0248) lr: logit_scale: 100.00max mem: 8443
[2024-05-12 08:31:26,774 tvr 320 INFO]: eta: 4:38:42 epoch: 2/5 iteration: 3900/7030 time: 1.2943 (5.3427) data: 0.4724 (4.4631) loss: 6.1943 (6.1810) E_loss: 6.1701 (6.1561) M_loss: 0.0245 (0.0248) lr: logit_scale: 100.00max mem: 8443
[2024-05-12 08:31:26,780 tvr 485 INFO]: [start] extract train feature
[2024-05-12 08:35:03,700 tvr 505 INFO]: [finish] extract train feature
[2024-05-12 08:35:03,700 tvr 546 INFO]: [start] extract text+video feature
[2024-05-12 08:35:33,605 tvr 573 INFO]: [finish] extract text+video feature
[2024-05-12 08:35:33,605 tvr 577 INFO]: 1000 1000 1000 1000
[2024-05-12 08:35:33,605 tvr 581 INFO]: [start] calculate the similarity
[2024-05-12 08:35:33,605 tvr 387 INFO]: [finish] map to main gpu
[2024-05-12 08:35:33,609 tvr 401 INFO]: [finish] map to main gpu
[2024-05-12 08:36:08,858 tvr 584 INFO]: [end] calculate the similarity
[2024-05-12 08:36:08,858 tvr 587 INFO]: [start] compute_metrics
[2024-05-12 08:36:08,858 tvr 613 INFO]: sim matrix size: 1000, 1000
[2024-05-12 08:36:08,878 tvr 616 INFO]: Length-T: 1000, Length-V:1000
[2024-05-12 08:36:08,878 tvr 618 INFO]: [end] compute_metrics
[2024-05-12 08:36:08,878 tvr 621 INFO]: time profile: feat 29.9s match 35.25275s metrics 0.01992s
[2024-05-12 08:36:08,878 tvr 623 INFO]: Text-to-Video: R@1: 0.5 - R@5: 1.1 - R@10: 1.4 - R@50: 4.4 - Median R: 798.0 - Mean R: 683.1
[2024-05-12 08:36:08,878 tvr 625 INFO]: Video-to-Text: R@1: 0.6 - R@5: 1.1 - R@10: 1.7 - R@50: 4.6 - Median R: 810.5 - Mean R: 686.7
[2024-05-12 08:36:09,399 tvr 239 INFO]: Model saved to /root/autodl-tmp/outputs/pytorch_model.bin.step3900.2
[2024-05-12 08:36:10,072 tvr 239 INFO]: Model saved to /root/autodl-tmp/outputs/pytorch_model.bin.best.2
Can you give me some suggestions to deal with these problems ? Thanks

请问哪里可以看到论文？

你好，请问一下论文什么时候能看到

Questions about the inference stage

Thank you for sharing such a great job!

You concatenated the latent factors of text and video subspace to calculate similarity through MLP, which means that during the testing phase, we also need to perform this operation on the query and every candidate. Compared to the cosine similarity used in many previous methods, this does not seem to be an efficient approach. I would like to hear your opinion on this issue.

Trainin on one gpu.

Hello,
Thank you for the repo and well done for the project.

I have a question on how and if it's possible to train on a single gpu.

qb_norm issues

Hello,author,I found that qb_norm was used for code inference, but it doesn't seem to be mentioned in the paper?

MSVD checkpoint

Hi,

Congrats on your amazing work! Can you please upload the MSVD checkpoint and steps for inference?

Discrepancy between paper and code regarding attention pooling temperature

Hello,

While going through your paper and code, I noticed a discrepancy regarding the temperature parameter used in attention pooling. In the paper, it's mentioned that the softmax temperature is set to 0.01. However, in the code, the default temperature value appears to be 5, and in practice it seems to be set to 3.

Could you please clarify what the correct value of the temperature should be? It would be greatly appreciated if you could provide an explanation for the differences between these values.

Thanks in advance for your time and assistance.

Best regards

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.