Git Product home page Git Product logo

crisscrossed-captions's Introduction

Crisscrossed Captions

Crisscrossed Captions (CxC) contains 247,315 human-labeled annotations including positive and negative associations between image pairs, caption pairs and image-caption pairs.

For more details, please refer the accompanying paper:
Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

Motivation

Image captioning datasets have proven useful for multimodal representation learning, and a common evaluation paradigm based on multimodal retrieval has emerged. Unfortunately, datasets have only limited cross-modal associations: images are not paired with others, captions are only paired with others that describe the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines retrieval evaluation and limits research into how inter-modality learning impacts intra-modality tasks. CxC addresses this gap by extending MS-COCO (dev and test sets from the Karpathy split) with new semantic similarity judgments.

Below are some examples of caption pairs rated based on Semantic Textual Similarity:

  • Rating 0:

Caption 1: A kite flying in the air over a sand castle.

Caption 2: Scattered people on a wide dry beach including surfers.

  • Rating 1:

Caption 1: Giraffe watching a man push a wheelbarrow loaded with hay.

Caption 2: Two giraffes stand outside of a large building.

  • Rating 2:

Caption 1: A man is sitting on a bench while another takes a nap.

Caption 2: There is an old woman sitting on a bench.

  • Rating 3:

Caption 1: A train is driving down the tracks in front of a building.

Caption 2: A purple and yellow train traveling down train tracks.

  • Rating 4:

Caption 1: A cut pizza and a glass on a table.

Caption 2: Small pizza sits on a plate on a restaurant table.

  • Rating 5:

Caption 1: A family of sheep standing next to each other on a lush green field.

Caption 2: A herd of sheep standing next to each other on a lush green field.

Structure of the data

There are 2 CSV files per task (STS, SIS, SITS) and per split (val, test): one with raw annotator scores ('*_raw.csv') and one with aggregated scores per example. The first two columns represent the IDs from MS-COCO for the corresponding image or caption, followed by the annotation score. The last column indicates the method in which the example was sampled:

  • STS:

c2c_cocaption: caption pairs from the same MS-COCO example

c2c_isim: caption pairs from different MS-COCO examples sampled based on image similarity

  • SIS:

i2i_csim: image pairs from different MS-COCO examples sampling based on caption similarity.

  • SITS:

c2i_intrasim: caption-image pairs from different MS-COCO examples.

c2i_original: caption-image pairs from the same MS-COCO examples.

Examples

Following are some examples for each task:

  • STS:

    • Rating 1:

      • An old car sitting on top of a lush green field.
      • A couple of motorcycles parked next to each other.
    • Rating 3:

      • A yellow tray topped with a cup of coffee and a donut.
      • A white plate topped with donuts sitting on a stove top.
    • Rating 5:

      • A man standing on a tennis court holding a racquet.
      • A man standing on a tennis court holding a tennis racquet.
  • SIS:

    • Rating 1:

      SIS Rating 1 SIS Rating 1

    • Rating 3:

      SIS Rating 3 SIS Rating 3

    • Rating 5:

      SIS Rating 5 SIS Rating 5

  • SITS:

    • Rating 1:

      A man in a hat rides an elephant in a river.

      SITS Rating 1

    • Rating 3:

      A man is riding a surfboard at the beach.

      SITS Rating 3

    • Rating 5:

      A man poses with a surfboard on a beach.

      SITS Rating 5

Augment MS-COCO examples with CxC labels

Download the MS-COCO Karpathy split annotations (from here) and pass them as coco_input in the following merge script:

python -m crisscrossed_captions/setup --coco_input "/path/to/coco/json" --cxc_input "/path/to/cxc/sits/*" --output "/path/to/combined/json"

Reference

If you use or discuss this dataset in your work, please cite our paper:

@article{parekh2020crisscrossed,
  title={Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO},
  author={Parekh, Zarana and Baldridge, Jason and Cer, Daniel and Waters, Austin and Yang, Yinfei},
  journal={arXiv preprint arXiv:2004.15020},
  year={2020}
}

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.

crisscrossed-captions's People

Contributors

zarana-parekh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crisscrossed-captions's Issues

Problems reproducing the I2I results in the paper

Hi,

Thanks for sharing the great dataset!

I'm trying to implement the evaluation pipeline based on the released annotations, and I'm using the official VSRN (the "VSRN-github" entry in your original paper) to verify the correctness of my implementation. I have successfully reproduced the I2T, T2I, and T2T numbers, but got some problems on I2I. Using the positive annotations available in "sis_test.csv", the Recalls I got are lower than the ones reported in the paper, so I guess I might miss some details in my current implementation.

There are several questions that I want to confirm:
(1). Do you use the annotations in SITS or STS to augment the positive pairs for I2I? For example, if there are two positive image-text pairs (i1, t1), (i2, t1) in the SITS annotation, will you consider (i1, i2) as a positive I2I pair? Or if (t3, t4) is a positive caption pair in the STS annotation, and they belong to two different MS-COCO images i3 and i4, will you consider (i3, i4) as a positive I2I pair?
(2). There are 5,000 images in the test set, but some of them do not have any associated positive pairs in the annotation. For these images, do you still include them in the retrieval process? If yes, then how do you evaluate the retrieval results for them?

Thank you and looking forward to your reply!

R@1 for image-to-image and text-to-text retrieval are 0

Hello, thanks for contributing such a dataset for research. Nowadays I have run into a problem, that for image-to-image and text-to-text retrieval, I got 0 for R@1. I think there should be some error in my evaluation code. Would you mind open-source your code for evaluation, or do you have some idea about this. It would be appreciated if you could provide some help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.