jpthu17 / dicosa Goto Github PK
View Code? Open in Web Editor NEW[IJCAI 2023] Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
License: Apache License 2.0
[IJCAI 2023] Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
License: Apache License 2.0
I'm getting strange results when running the code on an RTX 3090 GPU. I first used the code in CLIP4Clip to compress the video size to 3fps :
https://github.com/ArrowLuo/CLIP4Clip/blob/master/preprocess/compress_video.py
and then froze the clip model by using those code:
for param in self.clip.parameters():
param.requires_grad = False # not update by gradient
the train log on MSRVTT as follows :
[2024-05-12 08:25:31,329 tvr 320 INFO]: eta: 4:50:08 epoch: 2/5 iteration: 3800/7030 time: 1.3135 (5.3897) data: 0.4849 (4.5103) loss: 6.1797 (6.1809) E_loss: 6.1559 (6.1561) M_loss: 0.0250 (0.0248) lr: logit_scale: 100.00max mem: 8443
[2024-05-12 08:28:30,665 tvr 320 INFO]: eta: 4:44:24 epoch: 2/5 iteration: 3850/7030 time: 1.3637 (5.3663) data: 0.4905 (4.4867) loss: 6.1970 (6.1808) E_loss: 6.1726 (6.1559) M_loss: 0.0248 (0.0248) lr: logit_scale: 100.00max mem: 8443
[2024-05-12 08:31:26,774 tvr 320 INFO]: eta: 4:38:42 epoch: 2/5 iteration: 3900/7030 time: 1.2943 (5.3427) data: 0.4724 (4.4631) loss: 6.1943 (6.1810) E_loss: 6.1701 (6.1561) M_loss: 0.0245 (0.0248) lr: logit_scale: 100.00max mem: 8443
[2024-05-12 08:31:26,780 tvr 485 INFO]: [start] extract train feature
[2024-05-12 08:35:03,700 tvr 505 INFO]: [finish] extract train feature
[2024-05-12 08:35:03,700 tvr 546 INFO]: [start] extract text+video feature
[2024-05-12 08:35:33,605 tvr 573 INFO]: [finish] extract text+video feature
[2024-05-12 08:35:33,605 tvr 577 INFO]: 1000 1000 1000 1000
[2024-05-12 08:35:33,605 tvr 581 INFO]: [start] calculate the similarity
[2024-05-12 08:35:33,605 tvr 387 INFO]: [finish] map to main gpu
[2024-05-12 08:35:33,609 tvr 401 INFO]: [finish] map to main gpu
[2024-05-12 08:36:08,858 tvr 584 INFO]: [end] calculate the similarity
[2024-05-12 08:36:08,858 tvr 587 INFO]: [start] compute_metrics
[2024-05-12 08:36:08,858 tvr 613 INFO]: sim matrix size: 1000, 1000
[2024-05-12 08:36:08,878 tvr 616 INFO]: Length-T: 1000, Length-V:1000
[2024-05-12 08:36:08,878 tvr 618 INFO]: [end] compute_metrics
[2024-05-12 08:36:08,878 tvr 621 INFO]: time profile: feat 29.9s match 35.25275s metrics 0.01992s
[2024-05-12 08:36:08,878 tvr 623 INFO]: Text-to-Video: R@1: 0.5 - R@5: 1.1 - R@10: 1.4 - R@50: 4.4 - Median R: 798.0 - Mean R: 683.1
[2024-05-12 08:36:08,878 tvr 625 INFO]: Video-to-Text: R@1: 0.6 - R@5: 1.1 - R@10: 1.7 - R@50: 4.6 - Median R: 810.5 - Mean R: 686.7
[2024-05-12 08:36:09,399 tvr 239 INFO]: Model saved to /root/autodl-tmp/outputs/pytorch_model.bin.step3900.2
[2024-05-12 08:36:10,072 tvr 239 INFO]: Model saved to /root/autodl-tmp/outputs/pytorch_model.bin.best.2
Can you give me some suggestions to deal with these problems ? Thanks
你好,请问一下论文什么时候能看到
Thank you for sharing such a great job!
You concatenated the latent factors of text and video subspace to calculate similarity through MLP, which means that during the testing phase, we also need to perform this operation on the query and every candidate. Compared to the cosine similarity used in many previous methods, this does not seem to be an efficient approach. I would like to hear your opinion on this issue.
Hello,
Thank you for the repo and well done for the project.
I have a question on how and if it's possible to train on a single gpu.
Hello,author,I found that qb_norm was used for code inference, but it doesn't seem to be mentioned in the paper?
Hi,
Congrats on your amazing work! Can you please upload the MSVD checkpoint and steps for inference?
Hello,
While going through your paper and code, I noticed a discrepancy regarding the temperature parameter used in attention pooling. In the paper, it's mentioned that the softmax temperature is set to 0.01. However, in the code, the default temperature value appears to be 5, and in practice it seems to be set to 3.
Could you please clarify what the correct value of the temperature should be? It would be greatly appreciated if you could provide an explanation for the differences between these values.
Thanks in advance for your time and assistance.
Best regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.