researchmm / soho Goto Github PK

View Code? Open in Web Editor NEW

206.0 206.0 19.0 452 KB

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Python 99.94% Shell 0.06%

soho's People

Contributors

Stargazers

Watchers

Forkers

mymuli ybybzhang xixiareone maxylee carboncoo trendingtechnology yk135915 splionar neudeep metavai kelikeli misery0424 haodongze lyimage lily11223344 guoqi0531 whuhxb tennyweideteddy jiejie13388

soho's Issues

It is abnormal , so many unexpected keys???

pretrained models can not be downloaded

"wget https://sohose.s3.ap-southeast-1.amazonaws.com/checkpoint/soho_res18_fp16_40-9441cdd3.pth"

I can not download this pth file even if i am using a VPN.

It throw a "ERROR 403: Forbidden" error.

Can you fix it? thanks.

The Accuracy of Masked Visual Modeling

Hi, what is your mvm accuracy of pretrained model? I only got about 30% when pretraining and wanted to know if that is normal?

Where I can find the VD?

In the paper, there is a Visual Dictionary(VD) to remodel the image of query, but the class of SOHO_direct_VD(SOHO/models/necks/utils.py) only operate the image by torch.agrmax in the code, which is not matched with what you described in paper. Please tell me where I can find the VD which is the same as described in the paper. Thank you.

你们这是开源了个寂寞啊。。

vd呢。。

the download link may be useless, can you update these? Thank you, sir.

wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/coco_cap_train_pre.json
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/coco_cap_val_pre.json
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/vg_cap_pre.json

cannot reproduce the performance of visual Entailment dataset.

Hi;
I conduct the pretraining with resent18+3 layer transformer by using indomain data. (without MVM loss)

I can get a similar result on VQA downstream tasks, around 66.5 accuracy.
But the performance on visual entailments is relatively lower than reported in the paper, I can just get 74 accuracy (~82% reported in paper)
I am wondering why the resnet18+3 layer outperforms the Uniter Base?
Are there any training strategies specialized for this downstream task?

Thanks