baai-wudao / brivl Goto Github PK

View Code? Open in Web Editor NEW

275.0 275.0 31.0 321 KB

Bridging Vision and Language Model

License: MIT License

Python 99.17% Shell 0.83%

brivl's People

Contributors

Stargazers

Watchers

brivl's Issues

Low recall when testing on flickr30k-cn dataset

Hi, thanks for the great work!

I tested the pretrained model for zero-shot img2text and text2img retrieval on flickr30k-cn validation set. The bboxes are obtained as indicated in https://github.com/chuhaojin/BriVL-BUA-applications. For each image, we only select the one caption with the highest fluency score. However, the recall@1 for the two task is only 15.93% and 13.74%, respectively. The same evaluation for ViLT reaches 73.2% and 55.0%. I'm wondering whether you test on this dataset? Any comments on my results?

p.s. An example json file of the dataset is as follows
{"sentences": [["0", "一个小男孩正在玩呼啦圈。"]], "bbox": [[78, 92, 183, 124], [179, 137, 363, 214], [68, 21, 170, 101], [73, 326, 206, 498], [338, 150, 379, 187], [0, 305, 363, 396], [105, 273, 179, 342], [30, 32, 261, 483], [89, 192, 130, 210], [12, 155, 389, 498], [173, 150, 192, 167], [17, 134, 237, 353], [10, 341, 389, 496], [90, 76, 170, 169], [29, 118, 282, 363], [17, 357, 339, 402], [129, 133, 152, 155], [6, 423, 78, 498], [97, 231, 138, 250], [74, 22, 174, 175], [165, 167, 197, 191], [34, 77, 242, 494], [316, 145, 341, 197], [33, 167, 164, 323], [294, 1, 382, 19], [199, 8, 382, 158], [15, 385, 389, 497], [1, 366, 379, 396], [179, 126, 371, 228], [204, 13, 379, 130], [57, 23, 189, 235], [59, 71, 230, 482], [55, 23, 203, 167], [44, 29, 213, 248], [61, 27, 210, 219], [32, 124, 264, 367], [44, 39, 236, 286], [18, 326, 338, 445], [198, 383, 389, 496], [61, 344, 209, 498], [95, 269, 186, 340], [46, 302, 331, 471], [19, 123, 344, 307], [11, 14, 374, 409], [31, 132, 234, 357], [20, 134, 271, 354], [16, 10, 358, 360], [32, 20, 297, 478], [39, 19, 206, 157], [2, 330, 62, 443], [29, 168, 175, 331], [153, 312, 389, 404], [2, 408, 272, 498], [0, 328, 347, 467], [317, 148, 349, 197], [35, 302, 227, 458], [38, 143, 229, 366], [11, 367, 385, 492], [191, 320, 380, 389], [323, 148, 347, 199], [61, 324, 244, 498], [79, 0, 385, 495], [47, 143, 222, 355], [6, 0, 389, 221], [0, 367, 377, 407], [0, 194, 389, 498], [103, 123, 356, 222], [14, 7, 222, 183], [20, 4, 389, 164], [0, 286, 389, 497], [14, 4, 191, 132], [21, 331, 308, 438], [59, 118, 352, 219], [70, 88, 181, 128], [0, 227, 389, 498], [4, 327, 389, 490], [0, 330, 363, 451], [15, 348, 302, 436], [126, 116, 156, 147], [48, 52, 269, 480], [17, 0, 224, 154], [34, 54, 245, 478], [8, 98, 389, 491], [24, 12, 167, 110], [17, 116, 316, 361], [32, 0, 305, 476], [4, 110, 37, 201], [48, 135, 223, 349], [14, 410, 370, 497], [38, 13, 265, 391], [51, 301, 219, 483], [54, 332, 244, 484], [22, 127, 256, 356], [47, 172, 216, 360], [81, 92, 178, 124], [75, 82, 174, 140], [27, 150, 230, 361], [53, 20, 192, 152], [0, 269, 356, 357], [18, 2, 195, 118]], "image_id": "/export/PTM_dataset/flickr30k-cn/flickr30k-images/2954461906.jpg"}
{"sentences": [["0", "妇女们正在喝酒和编织。"]], "bbox": [[74, 113, 383, 271], [451, 159, 499, 273], [6, 20, 75, 106], [5, 16, 114, 277], [0, 7, 481, 251], [434, 195, 454, 221], [353, 34, 478, 264], [217, 8, 320, 161], [287, 127, 317, 209], [376, 15, 439, 72], [28, 260, 84, 277], [163, 12, 245, 154], [333, 163, 465, 269], [115, 152, 196, 195], [147, 3, 179, 78], [440, 49, 499, 185], [293, 182, 321, 211], [198, 136, 237, 180], [241, 8, 291, 58], [325, 139, 344, 178], [394, 126, 411, 149], [2, 205, 320, 277], [1, 70, 93, 197], [210, 125, 228, 156], [123, 95, 141, 152], [146, 0, 499, 65], [162, 6, 324, 152], [167, 50, 237, 131], [16, 167, 90, 274], [51, 0, 149, 80], [0, 64, 100, 233], [111, 139, 184, 181], [385, 63, 452, 151], [230, 54, 302, 138], [378, 50, 490, 264], [18, 180, 88, 266], [54, 142, 80, 163], [65, 259, 85, 277], [6, 9, 80, 112], [162, 53, 396, 151], [177, 11, 486, 254], [397, 94, 494, 267], [121, 89, 141, 148], [5, 4, 111, 277], [165, 6, 244, 149], [423, 58, 499, 254], [336, 12, 477, 273], [338, 14, 465, 258], [83, 84, 144, 142], [119, 16, 440, 163], [293, 160, 319, 214], [9, 162, 90, 270], [9, 16, 120, 277], [441, 157, 499, 272], [111, 142, 188, 184], [164, 14, 491, 271], [15, 174, 137, 275], [7, 32, 139, 276], [5, 0, 114, 277], [347, 120, 494, 277], [4, 12, 126, 277], [213, 5, 309, 161], [429, 35, 494, 175], [88, 209, 319, 276], [140, 0, 499, 75], [222, 6, 305, 153], [6, 8, 106, 277], [340, 90, 492, 277], [108, 123, 401, 274], [95, 1, 488, 268], [434, 157, 499, 271], [347, 214, 452, 274], [114, 88, 147, 154], [157, 14, 251, 154], [48, 139, 257, 271], [194, 128, 238, 181], [80, 120, 384, 273], [169, 47, 233, 133], [170, 43, 235, 133], [346, 12, 470, 195], [54, 6, 451, 244], [12, 1, 161, 88], [67, 195, 350, 275], [345, 170, 469, 269], [379, 23, 484, 201], [350, 213, 475, 273], [6, 13, 67, 109], [60, 85, 328, 266], [7, 2, 338, 263], [293, 127, 314, 203], [11, 11, 84, 107], [211, 13, 463, 205], [342, 79, 496, 274], [71, 15, 483, 169], [198, 132, 233, 175], [54, 104, 384, 269], [161, 9, 246, 152], [367, 181, 478, 270], [93, 1, 499, 103], [16, 190, 366, 276]], "image_id": "/export/PTM_dataset/flickr30k-cn/flickr30k-images/2314492671.jpg"}

How extract features from my own images by wenlan?

The current test codes are made for sample pictures. Can you release a more convenient code for extracting features for a random picture or provide an api.

Fine-tuning code

I used the provided pre-trained BriVL model to obtain the text and image embeddings for classification tasks, but the results are dissatisfactory. Will the fine-tuning code be provided?

Thanks, and when will WenLan 2.0 be released ?

关于AIC-ICC图文检索测试集的问题

您好，请问您是使用了Validation Set的全集作为测试吗，还是只使用了其中一个子集呢？

When will WenLan 2.0 be released ?

The top1-acc on ImageNet-1k and recall on AICICC

I have test the model on ImageNet-1k val set with zero-shot setting and the labels are translated to Chinese. However the top1 accuracy is only around 25%. As a comparison, the digit on CLIP is 65%.
On AIC-ICC, the text2image recall@top10 is 13%, which is also far from the digit in BriVL paper(~40%).
Could the authors help to give some reference results to verify the results on the two datasets?

请问WenLan 1.0 中报告的 AIC-ICC Image caption 效果是按什么粒度评估的

是字粒度吗？还是词粒度？如果是按词粒度评估，分词工具用的是什么？

Are efficientnet for img encoder and bert for text encoder fixed or partially trainable？

Following Readme, some extra models are required, including chinese-roberta-wwm-ext, used as sub-model of text encoder, and tf_efficientnet_b5_ns-6f26d0cf.pth, used as sub-model of image encoder. (According to BriVL-BUA-applications)

While in ImgLearnableEncoder.init_param function, TextLearnableEncoder.init_param function, We noticed that there are some conditions to control if some params of these backbones, i.e. efficientnet and chinese-roberta-wwm-ext mentioned above, are requires_grad or not, or saying whether these params are trainable.

And these two classes are used in eval from VL_model class.

Thus this eval makes me confused: VL_model is TRAINABLE, which means downloaded official sub-models, efficientnet and chinese-roberta-wwm-ext are NOT satisfied, their finetuned models are required, is there something wrong?

i don't know if i missed some details or mistook something.

Looking forward to your reply:)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.