crossmodalgroup / gsmn Goto Github PK
View Code? Open in Web Editor NEWImplementation of our CVPR2020 paper, Graph Structured Network for Image-Text Matching
Implementation of our CVPR2020 paper, Graph Structured Network for Image-Text Matching
Hi,
Thanks for your sharing. Could you please provide more details about how to reproduce the paper results? I simply run the train.py and test.py for Flickr 30, and can not obtain the proposed performance.
Thanks a lot!
Hi,
I have a little question about your data-loader when I run the training process. The code here will update the self.bbox at each epoch when I debug it. Thus the boxes will get to zeros after a few epochs and take no influence on the model. Is there any miss understanding here? Many thank.
bboxes = self.bbox[img_id]
...
for i in range(k):
bbox = bboxes[i]
bbox[0] /= imsize['image_w']
bbox[1] /= imsize['image_h']
bbox[2] /= imsize['image_w']
bbox[3] /= imsize['image_h']
bboxes[i] = bbox
captions = torch.Tensor(caps)
What a wonderful job you have done! And i want to know when will you update your code on github. Thanks very much!
Dear contributor,
Many thanks for your time and attention.
When I run the code, I find that the loss does not change until the epoch 6. But normally the loss will perform fast at the beginning. Are there any settings here or any reasons for this situation?
020-09-15 21:05:17,981 Epoch: [5][1069/2266] Eit 12400 lr 0.0002 Le 25.600017547607422 Time 0.900 (0.000) Data 0.039 (0.000)
2020-09-15 21:06:52,140 Epoch: [5][1169/2266] Eit 12500 lr 0.0002 Le 25.60001564025879 Time 0.898 (0.000) Data 0.038 (0.000)
2020-09-15 21:08:22,427 Epoch: [5][1269/2266] Eit 12600 lr 0.0002 Le 25.600017547607422 Time 0.895 (0.000) Data 0.038 (0.000)
2020-09-15 21:09:56,861 Epoch: [5][1369/2266] Eit 12700 lr 0.0002 Le 25.600017547607422 Time 1.087 (0.000) Data 0.039 (0.000)
2020-09-15 21:11:45,949 Epoch: [5][1469/2266] Eit 12800 lr 0.0002 Le 25.600017547607422 Time 1.071 (0.000) Data 0.039 (0.000)
2020-09-15 21:13:34,645 Epoch: [5][1569/2266] Eit 12900 lr 0.0002 Le 25.600013732910156 Time 1.076 (0.000) Data 0.038 (0.000)
2020-09-15 21:15:23,490 Epoch: [5][1669/2266] Eit 13000 lr 0.0002 Le 25.60000991821289 Time 1.143 (0.000) Data 0.101 (0.000)
2020-09-15 21:16:54,946 Epoch: [5][1769/2266] Eit 13100 lr 0.0002 Le 25.600013732910156 Time 0.894 (0.000) Data 0.039 (0.000)
2020-09-15 21:18:24,783 Epoch: [5][1869/2266] Eit 13200 lr 0.0002 Le 25.600013732910156 Time 0.884 (0.000) Data 0.037 (0.000)
2020-09-15 21:19:54,308 Epoch: [5][1969/2266] Eit 13300 lr 0.0002 Le 25.600011825561523 Time 0.891 (0.000) Data 0.039 (0.000)
2020-09-15 21:21:27,408 Epoch: [5][2069/2266] Eit 13400 lr 0.0002 Le 25.60000991821289 Time 0.980 (0.000) Data 0.045 (0.000)
2020-09-15 21:22:58,489 Epoch: [5][2169/2266] Eit 13500 lr 0.0002 Le 25.600013732910156 Time 0.885 (0.000) Data 0.039 (0.000)
2020-09-15 21:24:27,616 Test: [0/79] Time 3.476 (0.000)
shard_xattn batch (15,78)
calculate similarity time: 432.85513067245483
2020-09-15 21:31:49,165 Image to text: 11.9, 31.0, 42.9, 16.0, 65.0
2020-09-15 21:31:49,413 Text to image: 5.7, 17.3, 26.1, 44.0, 103.4
runs/log
test
2020-09-15 21:31:53,452 Epoch: [6][3/2266] Eit 13600 lr 0.0002 Le 25.600006103515625 Time 0.962 (0.000) Data 0.106 (0.000)
2020-09-15 21:33:23,890 Epoch: [6][103/2266] Eit 13700 lr 0.0002 Le 25.600017547607422 Time 0.895 (0.000) Data 0.038 (0.000)
2020-09-15 21:34:53,814 Epoch: [6][203/2266] Eit 13800 lr 0.0002 Le 25.6002197265625 Time 0.894 (0.000) Data 0.038 (0.000)
2020-09-15 21:36:27,509 Epoch: [6][303/2266] Eit 13900 lr 0.0002 Le 25.59981346130371 Time 1.075 (0.000) Data 0.039 (0.000)
2020-09-15 21:38:15,702 Epoch: [6][403/2266] Eit 14000 lr 0.0002 Le 25.51906967163086 Time 1.076 (0.000) Data 0.039 (0.000)
2020-09-15 21:39:46,590 Epoch: [6][503/2266] Eit 14100 lr 0.0002 Le 23.482261657714844 Time 0.907 (0.000) Data 0.047 (0.000)
2020-09-15 21:41:17,698 Epoch: [6][603/2266] Eit 14200 lr 0.0002 Le 21.11315155029297 Time 0.883 (0.000) Data 0.038 (0.000)
2020-09-15 21:42:47,759 Epoch: [6][703/2266] Eit 14300 lr 0.0002 Le 19.28626251220703 Time 0.904 (0.000) Data 0.039 (0.000)
2020-09-15 21:44:18,304 Epoch: [6][803/2266] Eit 14400 lr 0.0002 Le 18.601877212524414 Time 0.903 (0.000) Data 0.040 (0.000)
2020-09-15 21:45:48,979 Epoch: [6][903/2266] Eit 14500 lr 0.0002 Le 18.63102149963379 Time 0.906 (0.000) Data 0.039 (0.000)
2020-09-15 21:47:19,877 Epoch: [6][1003/2266] Eit 14600 lr 0.0002 Le 16.385482788085938 Time 0.903 (0.000) Data 0.048 (0.000)
2020-09-15 21:48:54,785 Epoch: [6][1103/2266] Eit 14700 lr 0.0002 Le 12.813457489013672 Time 0.977 (0.000) Data 0.044 (0.000)
2020-09-15 21:50:33,562 Epoch: [6][1203/2266] Eit 14800 lr 0.0002 Le 16.835599899291992 Time 0.978 (0.000) Data 0.045 (0.000)
2020-09-15 21:52:05,988 Epoch: [6][1303/2266] Eit 14900 lr 0.0002 Le 14.264854431152344 Time 0.897 (0.000) Data 0.039 (0.000)
2020-09-15 21:53:36,136 Epoch: [6][1403/2266] Eit 15000 lr 0.0002 Le 14.423147201538086
Many Thanks!
The author can tell me how to fine-tune the Bigru part of the code in the paper, and what are the specific settings.
您好,有些问题想要请教您。
验证数据集有5070张图像和5070文本;处理成5000对图像和句子。为什么又把图像处理成1000张?
是指索引为0的图像和索引为0-4的文本相对应?那么图像索引为1-4的图像和文本索引1-4是没有关系的?
谢谢您
您好,有没有关于论文的视频,有一部分没整明白。
数据集下载那块没太明白,是需要用SCAN重新跑出需要的数据集吗,请问有没有直接可用的数据集文件,求解答
Excuse me,Thank you for your great work,I find as readme say that the text feature, image bounding box and semantic dependency are precomputed,I want to try your idea on other dataset,Could you share your the part code with me?Thank you very much
No such file or directory“/media/ubuntu/data/chunxiao/vocab/f30k_precomp_vocab.json”
I was interested in the paper named "Graph Structured Network for Image-Text Matching" . I've also been working on the task recently. Your code in https://github.com/CrossmodalGroup/GSMN is pretty professional! And I was interested in the visualization part of job, I think it's very interesting and awesome , but I don't really know how to draw it in a professional way, so could you please release the visualization part of code?
I will appreciate it very much!
Thanks for your work.Refer to the accuracy you gave in the pre-trained model on Flickr30K.GSMN-dense:rsum: 481.4,GSMN-sparse:
rsum: 476.8; in the paper,GSMN-dense:rsum: 483.6,GSMN-sparse:rsum: 480.1. Have two models been used in evaluation when the paper got the dense or sparse like SCAN
您好,我下载了您的程序以及在README和 #17 中提到的数据,但是训练后的Recall最高也不会高于1。程序的参数与README中的一致,请问是哪里有问题吗?
下面是我的目录结构:
./GSMN/
├── coco_dense.log
├── data.py
├── dependency_parser.py
├── evaluation.py
├── f30k_sparse.log
├── graph_model.py
├── layers.py
├── model.py
├── README.md
├── testall.py
├── test.py
├── test_stack.py
├── train.py
|── vocab.py
├── data
| └── f30k_precomp
| ├── dev_caps.json
│ ├── dev_caps.txt
│ ├── dev_ids.txt
│ ├── dev_ims_bbx.npy
│ ├── dev_ims.npy
│ ├── dev_ims_size.npy
│ ├── dev_precaps_stan.txt
│ ├── dev_tags.txt
│ ├── test_caps.json
│ ├── test_caps.txt
│ ├── test_ids.txt
│ ├── test_ims_bbx.npy
│ ├── test_ims.npy
│ ├── test_ims_size.npy
│ ├── test_precaps_stan.txt
│ ├── test_tags.txt
│ ├── train_caps.json
│ ├── train_caps.txt
│ ├── train_ids.txt
│ ├── train_ims_bbx.npy
│ ├── train_ims.npy
│ ├── train_ims_size.npy
│ ├── train_precaps_stan.txt
│ └── train_tags.txt
└── vocab
└── f30k_precomp_vocab.json
下面是我训练f30k数据集sparse模型的参数:
python train.py --data_path ./data/ --data_name f30k_precomp --vocab_path ./vocab/ --logger_name ./runs/run_f30k_sparse/log --model_name ./runs/run_f30k_sparse/checkpoint --bi_gru --max_violation --lambda_softmax=20 --num_epochs=30 --lr_update=15 --learning_rate=.0002 --embed_size=1024 --batch_size=64 --is_sparse
下面是经过30个epoch训练之后的评估结果:
Image to text: 0.1, 0.2, 0.3, 2230.0, 2225.5
Text to image: 0.1, 0.5, 0.9, 500.0, 500.4
我使用提供的预训练模型进行评估的结果与论文中的结果相近,但是自己训练的模型结果很差。我一开始怀疑是数据的问题,所以我下载了SCAN中的两个文件data (wget https://iudata.blob.core.windows.net/scan/data.zip) 和 vocab (wget https://iudata.blob.core.windows.net/scan/vocab.zip) 并替换了您所提供的文件中的相同部分,但是结果没有变化。所以我也不知道哪里出了问题。
Hi,
First of all, I appreciate the article you wrote, the content is very clear, but through the code you opened, according to the parameters you provided, it is very difficult to reproduce the results in the paper, and even far from it. Is it possible to open the pretrain model for us to use? I really want to do something innovative with your work. Hope to get your reply.
I want to know where to download "train_precaps_stan.txt"? I think there are no train_precaps_stan.txt in SCAN.
def get_gaussian_weights(self, pseudo_coord):
...
weights = weights_rho * weights_theta (Line117)
weights[(weights != weights).detach()] = 0 (Line118)
(Line119)
# normalise weights (Line120)
weights = weights / torch.sum(weights, dim=1, keepdim=True) (Line121)
...
The dimension of weights should be (batch_size * K, neighbourhood_size, n_kernels),
but the actual dimension is (batch_size * K * neighbourhood_size, n_kernels),
meaning that the weights are normalized on n_kernels instead of neighbourhood_size channels,
and it is different from the operations in compute_weights(self, neighbourhood_weights) (Line227).
I wonder if this is intentional or not, and I‘m not sure whether it affects network performance.
Maybe add 【weights = weights.view(batch_size * K, neighbourhood_size, -1)】in Line119.
Dear contribution,
I read your paper and the best performance is sparse+dense. May I ask how to use sparse+dense but you only provide the config file with sparse or dense.
Thank you very much.
Hi, thanks for your share
how can i generate the bounding box in my own database? can you provide the tool for generate it
wo lai gei shi jie da call!
Hi all,
I've encountered a challenge while working with my own data. Specifically, I'm unsure about how to compute the %s_precaps_stan value. Could anyone provide some guidance or assistance with this?
Thank you,
Yasmeen
now I confuse how to set the parameters in SCAN training. Can you provide more details about the model?
您好,关于论文的代码什么时候能够上传那?
Traceback (most recent call last):
File "/home/zyh/simulation/cvpr2020/GSMN-master/train.py", line 276, in
main()
File "/home/zyh/simulation/cvpr2020/GSMN-master/train.py", line 100, in main
opt.vocab_path, '%s_vocab.json' % opt.data_name))
File "/home/zyh/simulation/cvpr2020/GSMN-master/vocab.py", line 57, in deserialize_vocab
with open(src) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/zyh/simulation/cvpr2020/GSMN-master/coco_precomp/mscoco_precomp_vocab.json'
#######################################################################
The file "mscoco_precomp_vocab.json" is not found in data.zip, I will be greatful if you provide it.
请问coco数据集中的test_caps.json , test_ims_bbx.npy, test_ims_size.npy文件没有吗?
I found that the region box and size files you provided not match the original images, but it works.
i.e. in the valid dataset, the 23rd image is 1155138244.jpg, and its size is 333 * 500 (h * w), but in the file dev_ims_size.npy, it is 375*500.
How to explain this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.