baai-wudao / brivl Goto Github PK
View Code? Open in Web Editor NEWBridging Vision and Language Model
License: MIT License
Bridging Vision and Language Model
License: MIT License
Hi, thanks for the great work!
I tested the pretrained model for zero-shot img2text and text2img retrieval on flickr30k-cn validation set. The bboxes are obtained as indicated in https://github.com/chuhaojin/BriVL-BUA-applications. For each image, we only select the one caption with the highest fluency score. However, the recall@1 for the two task is only 15.93% and 13.74%, respectively. The same evaluation for ViLT reaches 73.2% and 55.0%. I'm wondering whether you test on this dataset? Any comments on my results?
p.s. An example json file of the dataset is as follows
{"sentences": [["0", "一个小男孩正在玩呼啦圈。"]], "bbox": [[78, 92, 183, 124], [179, 137, 363, 214], [68, 21, 170, 101], [73, 326, 206, 498], [338, 150, 379, 187], [0, 305, 363, 396], [105, 273, 179, 342], [30, 32, 261, 483], [89, 192, 130, 210], [12, 155, 389, 498], [173, 150, 192, 167], [17, 134, 237, 353], [10, 341, 389, 496], [90, 76, 170, 169], [29, 118, 282, 363], [17, 357, 339, 402], [129, 133, 152, 155], [6, 423, 78, 498], [97, 231, 138, 250], [74, 22, 174, 175], [165, 167, 197, 191], [34, 77, 242, 494], [316, 145, 341, 197], [33, 167, 164, 323], [294, 1, 382, 19], [199, 8, 382, 158], [15, 385, 389, 497], [1, 366, 379, 396], [179, 126, 371, 228], [204, 13, 379, 130], [57, 23, 189, 235], [59, 71, 230, 482], [55, 23, 203, 167], [44, 29, 213, 248], [61, 27, 210, 219], [32, 124, 264, 367], [44, 39, 236, 286], [18, 326, 338, 445], [198, 383, 389, 496], [61, 344, 209, 498], [95, 269, 186, 340], [46, 302, 331, 471], [19, 123, 344, 307], [11, 14, 374, 409], [31, 132, 234, 357], [20, 134, 271, 354], [16, 10, 358, 360], [32, 20, 297, 478], [39, 19, 206, 157], [2, 330, 62, 443], [29, 168, 175, 331], [153, 312, 389, 404], [2, 408, 272, 498], [0, 328, 347, 467], [317, 148, 349, 197], [35, 302, 227, 458], [38, 143, 229, 366], [11, 367, 385, 492], [191, 320, 380, 389], [323, 148, 347, 199], [61, 324, 244, 498], [79, 0, 385, 495], [47, 143, 222, 355], [6, 0, 389, 221], [0, 367, 377, 407], [0, 194, 389, 498], [103, 123, 356, 222], [14, 7, 222, 183], [20, 4, 389, 164], [0, 286, 389, 497], [14, 4, 191, 132], [21, 331, 308, 438], [59, 118, 352, 219], [70, 88, 181, 128], [0, 227, 389, 498], [4, 327, 389, 490], [0, 330, 363, 451], [15, 348, 302, 436], [126, 116, 156, 147], [48, 52, 269, 480], [17, 0, 224, 154], [34, 54, 245, 478], [8, 98, 389, 491], [24, 12, 167, 110], [17, 116, 316, 361], [32, 0, 305, 476], [4, 110, 37, 201], [48, 135, 223, 349], [14, 410, 370, 497], [38, 13, 265, 391], [51, 301, 219, 483], [54, 332, 244, 484], [22, 127, 256, 356], [47, 172, 216, 360], [81, 92, 178, 124], [75, 82, 174, 140], [27, 150, 230, 361], [53, 20, 192, 152], [0, 269, 356, 357], [18, 2, 195, 118]], "image_id": "/export/PTM_dataset/flickr30k-cn/flickr30k-images/2954461906.jpg"}
{"sentences": [["0", "妇女们正在喝酒和编织。"]], "bbox": [[74, 113, 383, 271], [451, 159, 499, 273], [6, 20, 75, 106], [5, 16, 114, 277], [0, 7, 481, 251], [434, 195, 454, 221], [353, 34, 478, 264], [217, 8, 320, 161], [287, 127, 317, 209], [376, 15, 439, 72], [28, 260, 84, 277], [163, 12, 245, 154], [333, 163, 465, 269], [115, 152, 196, 195], [147, 3, 179, 78], [440, 49, 499, 185], [293, 182, 321, 211], [198, 136, 237, 180], [241, 8, 291, 58], [325, 139, 344, 178], [394, 126, 411, 149], [2, 205, 320, 277], [1, 70, 93, 197], [210, 125, 228, 156], [123, 95, 141, 152], [146, 0, 499, 65], [162, 6, 324, 152], [167, 50, 237, 131], [16, 167, 90, 274], [51, 0, 149, 80], [0, 64, 100, 233], [111, 139, 184, 181], [385, 63, 452, 151], [230, 54, 302, 138], [378, 50, 490, 264], [18, 180, 88, 266], [54, 142, 80, 163], [65, 259, 85, 277], [6, 9, 80, 112], [162, 53, 396, 151], [177, 11, 486, 254], [397, 94, 494, 267], [121, 89, 141, 148], [5, 4, 111, 277], [165, 6, 244, 149], [423, 58, 499, 254], [336, 12, 477, 273], [338, 14, 465, 258], [83, 84, 144, 142], [119, 16, 440, 163], [293, 160, 319, 214], [9, 162, 90, 270], [9, 16, 120, 277], [441, 157, 499, 272], [111, 142, 188, 184], [164, 14, 491, 271], [15, 174, 137, 275], [7, 32, 139, 276], [5, 0, 114, 277], [347, 120, 494, 277], [4, 12, 126, 277], [213, 5, 309, 161], [429, 35, 494, 175], [88, 209, 319, 276], [140, 0, 499, 75], [222, 6, 305, 153], [6, 8, 106, 277], [340, 90, 492, 277], [108, 123, 401, 274], [95, 1, 488, 268], [434, 157, 499, 271], [347, 214, 452, 274], [114, 88, 147, 154], [157, 14, 251, 154], [48, 139, 257, 271], [194, 128, 238, 181], [80, 120, 384, 273], [169, 47, 233, 133], [170, 43, 235, 133], [346, 12, 470, 195], [54, 6, 451, 244], [12, 1, 161, 88], [67, 195, 350, 275], [345, 170, 469, 269], [379, 23, 484, 201], [350, 213, 475, 273], [6, 13, 67, 109], [60, 85, 328, 266], [7, 2, 338, 263], [293, 127, 314, 203], [11, 11, 84, 107], [211, 13, 463, 205], [342, 79, 496, 274], [71, 15, 483, 169], [198, 132, 233, 175], [54, 104, 384, 269], [161, 9, 246, 152], [367, 181, 478, 270], [93, 1, 499, 103], [16, 190, 366, 276]], "image_id": "/export/PTM_dataset/flickr30k-cn/flickr30k-images/2314492671.jpg"}
The current test codes are made for sample pictures. Can you release a more convenient code for extracting features for a random picture or provide an api.
I used the provided pre-trained BriVL model to obtain the text and image embeddings for classification tasks, but the results are dissatisfactory. Will the fine-tuning code be provided?
您好,请问您是使用了Validation Set的全集作为测试吗,还是只使用了其中一个子集呢?
I have test the model on ImageNet-1k val set with zero-shot setting and the labels are translated to Chinese. However the top1 accuracy is only around 25%. As a comparison, the digit on CLIP is 65%.
On AIC-ICC, the text2image recall@top10 is 13%, which is also far from the digit in BriVL paper(~40%).
Could the authors help to give some reference results to verify the results on the two datasets?
是字粒度吗?还是词粒度?如果是按词粒度评估,分词工具用的是什么?
Following Readme, some extra models are required, including chinese-roberta-wwm-ext, used as sub-model of text encoder, and tf_efficientnet_b5_ns-6f26d0cf.pth, used as sub-model of image encoder. (According to BriVL-BUA-applications)
While in ImgLearnableEncoder.init_param function, TextLearnableEncoder.init_param function, We noticed that there are some conditions to control if some params of these backbones, i.e. efficientnet and chinese-roberta-wwm-ext mentioned above, are requires_grad or not, or saying whether these params are trainable.
And these two classes are used in eval from VL_model class.
Thus this eval makes me confused: VL_model is TRAINABLE, which means downloaded official sub-models, efficientnet and chinese-roberta-wwm-ext are NOT satisfied, their finetuned models are required, is there something wrong?
i don't know if i missed some details or mistook something.
Looking forward to your reply:)
Thanks for your excellent work!
In Chapter 3.5, you gave examples of outstanding text generation results. Could you provide more details about image-to-text generation model?
看了很多图文检索的应用,想了解下利用BriVL做图像标注该怎么做?有参考代码吗?
how to get 'bbox' in BriVL/BriVL-code-inference/data/jsonls/example.jsonl
Dear authors:
Thanks for open sourcing. I Can not found the pretrained model download link? does you forget to append the download linkk?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.