Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

License: Other

Dockerfile 5.82% Python 40.36% Shell 53.82%

conceptual-captions's Issues

Download broken?

It seems that something is broken on the website. The buttons / links do not work for me.

These download links are not working when I'm trying to download over the terminal, could you please help me with this or could you provide direct links or any other snippet to do so,
Thanks in advance.

[Download link broken]

Hi, I find the downloading link is broken and "save as link" cannot work too....

Only ~2m out of ~3m links are working currently. It is CC2M now!

Validation data - Access denied

I was able to download the training split but Not able to download validation split. Got this message
Anonymous caller does not have storage.objects.get access to gcc-data/Validation/GCC-Validation.tsv.

Can I get captioned wild animals image dataset ?

Download of the dataset is not working

Hello is the link for the training split of the dataset operating? I am in South Korea and it is not working

Any code for downloading the dataset?

Thank you for your great work.
Can you provide any example code for download images by urls from the tsv files? Some url can not be downloaded by urllib in python (IOError: [Errno socket error] [Errno 110] Connection timed out). But I can see the images in the browser.

A lot of the links from the training/validation set do not exist or cannot be read.

Hey!
I recently started working on the competition and thank you so much to the Google AI Research team for open sourcing such data sets for us to use and learn.
While going through the data set (train and validation both), it seems some of the links do not exist or for some reason cannot be read through. A lot of URLs can be parsed but cannot be read by using Image from the pillow package in python. I will post some of the scripts and the some output to give everyone a better idea of what I am seeing. Please feel free to correct me if I am doing something wrong. Hope this helps anyone facing the errors too.

I am using the requests library to read URLs and the pillow library in Python to read from them.
The code is very primitive since I am still in the exploration stages and hence I'm appending images to the list

from PIL import Image
import requests 

train_file = 'data/Train%2FGCC-training.tsv'  # train file
with open(train_file,'r') as f:
    train_read = f.readlines()

sample_train = train_read[:10000]

train_map = {
   line.split("\t")[1][:-1] : line.split("\t")[0] for line in sample_train
}
links = [k for k,v in train_map.items()]

not_read = 0 # keep a count of images that were not possible to read

# loop over the links and read whichever possible
for link in links:
    try:
        im = Image.open(requests.get(link, stream=True).raw)
    except:
        print(link)
        not_read += 1

Here are some of the links that did not work.

https://cdn.mantelligence.com/wp-content/uploads/2017/08/Questions-to-Ask-a-Girl-to-Get-to-Know-Her-What-do-you-want-most-out-of-life.jpg
http://duro6.com/weather/images/gallery3_lightning_rainbow_shot.jpg
http://image.dailyfreeman.com/storyimage/DF/20170505/NEWS/170509808/AR/0/AR-170509808.jpg&maxh=400&maxw=667
https://cdn.bravehunters.com/wp-content/uploads/2017/09/Guide-to-Living-in-a-Tent-800x416.jpg
http://www.saltandpinephoto.com/wp-content/uploads/2016/06/Bride-and-Groom-Walking-through-the-Forest.jpg
http://blog.visitmo.com/wp-content/uploads/2014/03/12506026093_092d091fc2_b.jpg
https://lynismael.com/wp-content/uploads/2014/07/Belwood-Lake-Conservation-wedding-sara-ayron-_0011(pp_w768_h534).jpg
http://www.eurasianet.org/sites/default/files/imagecache/galleria_fullscreen/060613_0.jpg
https://www.bailiwickexpress.com/files/cache/88ec9331c05013c55b49024a551341ac_f587432.jpg
https://i2-prod.mirror.co.uk/incoming/article1443634.ece/ALTERNATES/s615/%C2%A3%C2%A3%C2%A3%20%20Police%20car%20driving%20straight%20into%20a%20road%20of%20freshly%20layed%20cement
http://www.nerjarob.com/nature/wp-content/uploads/Cormorants-in-tree-sized.jpg
http://grantbaldwin.com/wp-content/uploads/2015/11/ScottVoelker.jpeg
https://drawinglics.com/view/186698/how-to-draw-flowers-and-leaves-in-a-vase-9-steps-with-pictures-image-titled-draw-flowers-and-leaves-in-a-vase-step-9bullet1.jpg

From a sample of 10000, I was able to get at least 51 links that did not work.
Looking forward to hearing more from you guys.
Thanks!

Could you release the class labels of the images obtained by Google Cloud Vision API?

Thanks for the awesome paper. I am wondering could you release the class labels of the images? Calling the cloud API is too expensive to me.

Not able to download the dataset

Hej,
thank you for your great work!
However, it seems like the dataset cannot be downloaded from the website - the button simply is not working. Could you please tell me if there is a way to get the dataset?

Image ownership

Since Google has got the images from different websites, what is the ownership status of images? Does google own the images?
In other words, are we allowed to use these images freely without knowing the license of the original images?

Conceptual Captions Dataset with Proper Names

I was wondering if there was a version of the Conceptual Captions Dataset without the proper names cleaned out (ie the version of the dataset with captions like (“Crowd at a concert in Los Angeles“) and (“Former Miss World Priyanka Chopra on the red carpet"))?

Thanks!

Ethics of this research set

An organisation that I work for has been having problems with robots requesting images from their website for AI training. We've managed to contact one of the people operating the robots who said they were using this dataset, and claimed because https://github.com/google-research-datasets/conceptual-captions/blob/master/LICENSE says "The dataset may be freely used for any purpose" that they had the right to use these images.

The problem is that you are publishing a dataset of non-Google URLs:

Google has no control of the hosted images, and they may be changed or removed or blocked, e.g. #17.
Google is not paying the hosting costs of these images. Organisations have to pay for bandwidth, CPU time, or even the number of requests.

So every time a user of this dataset requests the image, somebody else pays for it. (This is incentive to block or remove the images, see no. 1).
These images were added without the consent of the organisations, who have to pay costs of hosting (see no 2).
The images were added without the consent of the copyright holders (who may be different from the server hosts).
This dataset was created before 2018, before concerns about the use of images for AI training were common, and before protocols to disallow use of web-hosted media for machine learning existed.
Many of the images URLs are hosted by stock photo agencies, and may not be licensed for machine-learning use. They may also regard the captions (which require human effort to write) as part of their intellectual property.
Many of the images are on news websites, and were licensed from stock photo agencies, so may not be licensed for machine-learning use.
Many of the photos are hosted outside of the USA, by organisations which are not based in the USA, so US "fair use" copyright exceptions do not apply.

It would have been ethical for Google to license copies of the images, and then host them as part of the dataset (but still publish the URLs where they originally came from).

Data for Metrics on the Flickr 1K Test

Would it be possible to release just the data (raw output captions generated by the models) used specifically to make Table 4 and Table 7 from the paper? I have an idea for a new automatic metric, and would like to test whether or not it does a better job of capturing human-like evaluations, or performs more like the usual automatic metrics.

Contributing.

Some models trained on this (like llava) do not perform well at understanding comic book pages.

Would you be open to a PR with some data related to comic book pages?

Using Creative-Commons images obviously. Or even public domain images if that's required.

What is the process to offer contributions?

Thanks a lot in advance.

google-research-datasets / conceptual-captions Goto Github PK

conceptual-captions's Issues

Recommend Projects

Recommend Topics

Recommend Org