jmhessel / clipscore Goto Github PK

CLIPScore EMNLP code

License: MIT License

Python 100.00%

clipscore's Introduction

What's in here?

This repo contains the code for our EMNLP 2021 paper: CLIPScore: A Reference-free Evaluation Metric for Image Captioning. CLIPScore is a metric that you can use to evaluate the quality of an automatic image captioning system. In our paper, we show that CLIPScore achieves high correlation with human judgment on literal image captioning tasks. However, unlike BLEU or CIDEr, CLIPScore doesn't require reference captions.

If you find the paper or this code useful, please consider citing:

@inproceedings{hessel2021clipscore,
  title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning},
  author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin},
  booktitle={EMNLP},
  year={2021}
}

How do I run the code?

Command Line

Example usage

> python clipscore.py example/good_captions.json example/images/
...
CLIPScore: 0.8584

If you include optionally some references, you will see RefCLIPScore, alongside a usual set of caption generation evaluation metrics. The references are optional.

> python clipscore.py example/good_captions.json example/images/ --references_json example/refs.json
...
BLEU-1: 0.6667
BLEU-2: 0.4899
BLEU-3: 0.3469
BLEU-4: 0.0000
METEOR: 0.3444
ROUGE: 0.4280
CIDER: 0.5637
SPICE: 0.4000
CLIPScore: 0.8584
RefCLIPScore: 0.8450

Worse captions should get lower scores:

> python clipscore.py example/bad_captions.json example/images/ --references_json example/refs.json
...
BLEU-1: 0.4815
BLEU-2: 0.2404
BLEU-3: 0.1359
BLEU-4: 0.0000
METEOR: 0.1861
ROUGE: 0.3121
CIDER: 0.2790
SPICE: 0.1500
CLIPScore: 0.7153
RefCLIPScore: 0.7253

You can treat/report CLIPScore and RefCLIPScore similarly to the other evaluation metrics. See the paper for more details about CLIPScore and RefCLIPScore. Full usage options can be given by python clipscore.py -h. An example set of inputs, including a candidate json, image directory, and references json is given this repo under example/

The input files are formatted as follows:

The candidates json should be a dictionary that maps from {"string_image_identifier": "candidate"}, e.g.,

{'image1': 'an orange cat and a grey cat are lying together.',
 'image2': 'a black dog looks at the camera.'
 ...}

The image directory should be a directory containing the images that act as the keys in the candidates json, e.g.,

images/
├── image1.jpg
└── image2.jpg

and, finally, the references json should be a dictionary that maps from {"string_image_identifier": ["list", "of", "references"]}, e.g.,

{"image1": ["two cats are sleeping next to each other.",
            "a grey cat is cuddling with an orange cat on a blanket.",
	    "the orange cat is happy that the black cat is close to it."],
 "image2": ["a dog is wearing ear muffs as it lies on a carpet.",
            "a black dog and an orange cat are looking at the photographer.",
	    "headphones are placed on a dogs ears."]}

MSCOCO dataset in pycocoevalcap

If you're running on the MSCOCO dataset and using the standard evaluation toolkit, you can use our version of pycocoevalcap to evaluate. You won't even need to download the original MSCOCO images, thanks to a bit of magic :-)

To use pycocoevalcap on the MSCOCO dataset in the MSCOCO format, you can simply use:

pip install git+https://github.com/jmhessel/pycocoevalcap.git

there is an example evaluation in that repo under examples/eval.py. After pip installing, if you clone the pycocoeval repo and run

python eval.py

after a bit of time, the output should be:

Bleu_1: 0.579
Bleu_2: 0.404
Bleu_3: 0.279
Bleu_4: 0.191
METEOR: 0.195
ROUGE_L: 0.396
CIDEr: 0.600
SPICE: 0.133
CLIPScore: 0.528
RefCLIPScore: 0.605

Reproducibility notes:

CLIPScore can run either on CPU or GPU. But, there are slight differences due to floating point precision. As discussed here, on CPU, all operations run in float32, but on GPU, some operations run in float16. The differences are generally small (e.g., for the example run above, with example/good_captions.json captions and example/images/ images, on CPU, the output is CLIPScore: 0.8585, but on GPU, the output is CLIPScore: 0.8584.) All experiments in the paper were run on GPU, and this code will raise a warning if you're not using a GPU.
Because CLIPScore depends on the images to compute, resizing, compressing, etc. can all cause slight differences in computing the CLIPScore. Even saving a jpg twice can result in different compression, because that format is lossy! To this end, we release the checksums of the images we used for the paper. see checksums/ for more info. For the pycocoevalcap repo, we have also included the checksums for MSCOCO --- see here for more info.
The prompt we used for the text side of CLIP, as mentioned in the paper is ``A photo depicts" This is hard-coded into this repo. Other prompts will result in slightly different results, and we don't recommend them for the sake of reproducibility.

Acknowledgment

The authors would like to thank Jungo Kasai for being the first to use this repo. Thanks to Jungo, we fixed a few issues, and added some information about reproducibility that was missing before.

clipscore's People

Contributors

Stargazers

Watchers

clipscore's Issues

About subset FOIL dataset

To adapt the corpus to our setting, for each of the 32K test images, we sample a (FOIL, true) pair, and compute the accuracy of each evaluation metric in their capacity to assign a higher score to the true candidate versus the FOIL.

Hi there, I want to reproduce Table 5(corresponding to section 4.4 Sensitivity of CLIP-S to hallucination), and found there is some randomness for sampling (FOIL) dataset. Do you happen to have any log of sampled image key to reproduce same result?

How to replicate the COCO challenge result?

Hi I tried to replicate the result on the dev set of the coco 2015 challenge. But I could not find the corresponding dataset and their human annotations. Can you pinpoint me to the location of the dataset? Thank you for the help!

Replicate results on composite dataset

Hi I downloaded composite dataset from https://imagesdg.wordpress.com/image-to-scene-description-graph/ as your instruction.

But I found that their human annotation includes two types, thoroughness and relevance. May I ask which score you were using to calculate the correlation between human judgement and CLIPscore? Thank you!

Correlations on Composite

Hello,

Thanks for the nice contribution.
I am trying to understand how you calculated the correlations with the Composite caption-level likert judgements.

You mentioned in the paper that Composite contains (12K judgements - with F8K (997 imgs), F30K (991 imgs), and MSCOCO (2007 imgs)).
In the judgement .csv files at the AMT eval link you provided, there are 3 judgements for each F8K img, 4 for each F30K img, and 4 for each MSCOCO img. This is adding upto ~15K judgements.

Is there a reason why you considered only 12K or am I missing something?

Difference in CPU/GPU results larger than expected

Hello,

I've successfully reproduced the results from the Readme on the CPU, but the results when I switch the code to the GPU are very different. I'm aware of the minor difference you say could happen in the Readme, but this is larger.

For example, the "good" captions have a CLIPScore of 0.8585, but when I switch to GPU the score is 2.52734375. I think this may be specific to my environment/GPU because this didn't happen when I ran it on a GPU with Codalab.

Have you seen this issue before? Here is my exact environment

clip @ git+git://github.com/openai/CLIP.git@2867559c5fe0b02a2d3167aeacd333b3c4276847
cycler==0.11.0
Cython==0.29.24
ftfy==6.0.3
joblib==1.1.0
kiwisolver==1.3.2
matplotlib==3.4.3
numpy==1.21.3
Pillow==8.4.0
pycocoevalcap @ git+git://github.com/jmhessel/pycocoevalcap.git@273d8d5c42ca81fb43fd6bd699cfc4aa26bcda9a
pycocotools==2.0.2
pyparsing==3.0.4
python-dateutil==2.8.2
regex==2021.11.2
scikit-learn==1.0.1
scipy==1.7.1
six==1.16.0
sklearn==0.0
threadpoolctl==3.0.0
torch==1.7.1
torchvision==0.8.2
tqdm==4.62.3
typing-extensions==3.10.0.2
wcwidth==0.2.5

I tried with CUDA 10.2 and 11.0 with the same result.

Thanks!

Access to the Dataset Used in Your 'Case Study' Section

Thanks for your excellent work CLIPScore! I am also working on caption evaluation and wondering about where to download the alt-text, personality, and news image captioning human rating datasets used in the 'Case Study' section of your paper. I would appreciate it if you could offer the download links.

Error in running code

Run python clipscore.py example/good_captions.json example/images/
An error occurred as follows：
Traceback (most recent call last):
File "D:\ConZIC-main\clipscore-main\clipscore.py", line 273, in
main()
File "D:\ConZIC-main\clipscore-main\clipscore.py", line 232, in main
image_feats = extract_all_images(
File "D:\ConZIC-main\clipscore-main\clipscore.py", line 126, in extract_all_images
for b in tqdm.tqdm(data):
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\site-packages\tqdm\std.py", line 1180, in iter
for obj in iterable:
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 359, in iter
return self._get_iterator()
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 918, in init
w.start()
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Users\Prc\anaconda3\envs\pytorch\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CLIPImageDataset._transform_test..'

COCO challenge result reproduction

I want to reproduce the coco chanllenge results, but I had a very different results.
My idea is to calculate the average for the prediciton of 12 models, then use the 12 dim vector to calculate the correlation with the human metric(M1, M2).
The data I use is mscoco val2014.
This file is downloaded from https://cocodataset.org/#captions-leaderboard.
The prediction of 12 models is from the bert score issue with the similar experiment. Tiiiger/bert_score#79 (comment)
Next is my reprodution code:

def cal_metric(model_name, clip_model, images_filename, device, image_features):
    root_dir = Path('test_dataset/coco_captioning_challenge') / model_name
    data_path = list(root_dir.glob('captions_val2014*_results.json'))[0]
    with open(data_path, 'r') as f:
        data = json.load(f)
    id2text = {
        d['image_id']: d['caption'] for d in data
    }

    text = [id2text[filename2id(k)] for k in images_filename]
    res = get_clip_score(clip_model, image_features, text, device)
    return res


model2company = {
    'kolarmartin': 'Brno University',
    'karpathy': 'NeuralTalk',
    'rakshithShetty': 'PicSOM',
    'junhua.mao': 'm-RNN',
    'OriolVinyals': 'Google',
    'myamaguchi': 'MIL',
    'Q.Wu': 'ACVT',
    'jeffdonahue': 'Berkeley LRCN',
    'mRNN_share.JMao': 'm-RNN',
    'TsinghuaBigeye': 'Tsinghua Bigeye',
    'ryank': 'MLBL',
    'kelvin_xu': 'Montreal/Toronto'
}
model_list = model2company.keys()

model, transform = clip.load("ViT-B/32", device=device, jit=False)
model.eval()
human_metric = pd.read_csv('./test_dataset/coco_captioning_challenge/leaderboard.csv')

images_filename, images_features = init_image_features(clip_model, images_root_dir, device)
clip_score_res = []
human_metric_res_m1 = []
human_metric_res_m2 = []
for model_name in model_list:
    mean_score, per_score, _ = cal_metric(model_name, clip_model, image_paths, device, images_features)
    clip_score_res.append(mean_score)
    human_metric_res_m1.append(human_metric[model2company[model_name]]['M1'])
    human_metric_res_m2.append(human_metric[model2company[model_name]]['M2'])


m1_spearmanr, m1_p_value = stats.spearmanr(clip_score_res, human_metric_res_m1)
print(f'CLIPScore for M1 Spearmanr: {m1_spearmanr}, p-value: {m1_p_value}')
m2_spearmanr, m2_p_value = stats.spearmanr(clip_score_res, human_metric_res_m2)
print(f'CLIPScore for M2 Spearmanr: {m2_spearmanr}, p-value: {m2_p_value}')

The results I got:

CLIPScore for M1 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925
CLIPScore for M2 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925

Is there anything wrong?

Reproducing Pascal-50S

Thanks for sharing the code.

I am trying to reproduce Pascal-50S, but I got some different scores as follows:

CLIP-S
HC, HI, HM, MM, Mean
0.551, 0.992, 0.960, 0.715, 0.804
RefCLIP-S
HC, HI, HM, MM, Mean
0.650, 0.995, 0.956, 0.737, 0.835

Not much different when resampling the five references among 48 candidates.

Observations are:

CLIP-S is lower than RefCLIP-S.
The mean of RefCLIP-S is similar to the paper's report, but HC and MM reports are significantly different.

For integrity, I checked the number of samples for each category (=1k), and double-check the categories using the fields of the Pascal-50S dataset, new_data and category.

I believe there are very few factors to swing the results. My questions are:

I preprocessed all captions using strip() and add the prefix "A photo depicts ".
For the CLIP encoding, I used the CLIP tokenizer like clip.tokenize(caption, truncate=True).
Random selection of 5 references among the 48 candidates for RefCLIP-S.

Is there any other factor to reproduce as in the paper?

PASCAL-50S

Hi, Thank you so much for your awesome work!

I am wondering if you have the dataset and instructions/script to set up the PASCAL-50S dataset by any chance? The link to the annotation and original data on the official webpage does not work anymore.

Thanks,
David

Reproducing the results of the paper

Dear Authors,
Thanks a lot for outsourcing the code and your great work.

You mentionned a examples/eval.py but i cannot find it in your repo?

Do you plan to include the Flickr8K experiement as well ?
Thanks again for your paper and the code.
Kindest regards,
Pierre

About Composite dataset

Sorry to bother, I want to know where to download the Composite dataset? Can't find a link in the original paper

Clipscore replication error: AttributeError: Can't pickle local object

After I install all the requirements and try to run the example usage line in the readme "python clipscore.py example/good_captions.json example/images/" I get this error:

AttributeError: Can't pickle local object 'CLIPImageDataset._transform_test..

Please let me know if you have any advice!

About the composite datasets

I want to reproduce the results on Composite dataset

I downloaded Composite dataset from https://imagesdg.wordpress.com/image-to-scene-description-graph/, there are two types :correctness and throughness. I use the first one, correctness.
In the correctness file, the composite data usually includes only 3 or 4 captions rated by humans per image. Some candidate captions look like a paragraph, do I need to truncate it into a short sentence? One example as follows:
person is pulling bow in the back.A person might be wearing helmet in the scene.person is having tattoo.The scene contains grass and well-maintained grass and garden and playhouse.
In your paper, the correlation scores are computed with groundtruth references removed. Could you give me some guidance about how to process the composite dataset and reproduce the scores on Composite dataset?