Git Product home page Git Product logo

conceptual-captions's Issues

A lot of the links from the training/validation set do not exist or cannot be read.

Hey!
I recently started working on the competition and thank you so much to the Google AI Research team for open sourcing such data sets for us to use and learn.
While going through the data set (train and validation both), it seems some of the links do not exist or for some reason cannot be read through. A lot of URLs can be parsed but cannot be read by using Image from the pillow package in python. I will post some of the scripts and the some output to give everyone a better idea of what I am seeing. Please feel free to correct me if I am doing something wrong. Hope this helps anyone facing the errors too.

I am using the requests library to read URLs and the pillow library in Python to read from them.
The code is very primitive since I am still in the exploration stages and hence I'm appending images to the list

from PIL import Image
import requests 

train_file = 'data/Train%2FGCC-training.tsv'  # train file
with open(train_file,'r') as f:
    train_read = f.readlines()

sample_train = train_read[:10000]

train_map = {
   line.split("\t")[1][:-1] : line.split("\t")[0] for line in sample_train
}
links = [k for k,v in train_map.items()]

not_read = 0 # keep a count of images that were not possible to read

# loop over the links and read whichever possible
for link in links:
    try:
        im = Image.open(requests.get(link, stream=True).raw)
    except:
        print(link)
        not_read += 1

Here are some of the links that did not work.

https://cdn.mantelligence.com/wp-content/uploads/2017/08/Questions-to-Ask-a-Girl-to-Get-to-Know-Her-What-do-you-want-most-out-of-life.jpg
http://duro6.com/weather/images/gallery3_lightning_rainbow_shot.jpg
http://image.dailyfreeman.com/storyimage/DF/20170505/NEWS/170509808/AR/0/AR-170509808.jpg&maxh=400&maxw=667
https://cdn.bravehunters.com/wp-content/uploads/2017/09/Guide-to-Living-in-a-Tent-800x416.jpg
http://www.saltandpinephoto.com/wp-content/uploads/2016/06/Bride-and-Groom-Walking-through-the-Forest.jpg
http://blog.visitmo.com/wp-content/uploads/2014/03/12506026093_092d091fc2_b.jpg
https://lynismael.com/wp-content/uploads/2014/07/Belwood-Lake-Conservation-wedding-sara-ayron-_0011(pp_w768_h534).jpg
http://www.eurasianet.org/sites/default/files/imagecache/galleria_fullscreen/060613_0.jpg
https://www.bailiwickexpress.com/files/cache/88ec9331c05013c55b49024a551341ac_f587432.jpg
https://i2-prod.mirror.co.uk/incoming/article1443634.ece/ALTERNATES/s615/%C2%A3%C2%A3%C2%A3%20%20Police%20car%20driving%20straight%20into%20a%20road%20of%20freshly%20layed%20cement
http://www.nerjarob.com/nature/wp-content/uploads/Cormorants-in-tree-sized.jpg
http://grantbaldwin.com/wp-content/uploads/2015/11/ScottVoelker.jpeg
https://drawinglics.com/view/186698/how-to-draw-flowers-and-leaves-in-a-vase-9-steps-with-pictures-image-titled-draw-flowers-and-leaves-in-a-vase-step-9bullet1.jpg

From a sample of 10000, I was able to get at least 51 links that did not work.
Looking forward to hearing more from you guys.
Thanks!

Sharing Alt-Text associated with each image

Would the Alt-text associate with each image be released? Would code used to generate captions be also released?

Would be great to know if that's about to happen anytime soon.

Any code for downloading the dataset?

Thank you for your great work.
Can you provide any example code for download images by urls from the tsv files? Some url can not be downloaded by urllib in python (IOError: [Errno socket error] [Errno 110] Connection timed out). But I can see the images in the browser.

Ethics of this research set

An organisation that I work for has been having problems with robots requesting images from their website for AI training. We've managed to contact one of the people operating the robots who said they were using this dataset, and claimed because https://github.com/google-research-datasets/conceptual-captions/blob/master/LICENSE says "The dataset may be freely used for any purpose" that they had the right to use these images.

The problem is that you are publishing a dataset of non-Google URLs:

  1. Google has no control of the hosted images, and they may be changed or removed or blocked, e.g. #17.

  2. Google is not paying the hosting costs of these images. Organisations have to pay for bandwidth, CPU time, or even the number of requests.

    So every time a user of this dataset requests the image, somebody else pays for it. (This is incentive to block or remove the images, see no. 1).

  3. These images were added without the consent of the organisations, who have to pay costs of hosting (see no 2).

  4. The images were added without the consent of the copyright holders (who may be different from the server hosts).

  5. This dataset was created before 2018, before concerns about the use of images for AI training were common, and before protocols to disallow use of web-hosted media for machine learning existed.

  6. Many of the images URLs are hosted by stock photo agencies, and may not be licensed for machine-learning use. They may also regard the captions (which require human effort to write) as part of their intellectual property.

  7. Many of the images are on news websites, and were licensed from stock photo agencies, so may not be licensed for machine-learning use.

  8. Many of the photos are hosted outside of the USA, by organisations which are not based in the USA, so US "fair use" copyright exceptions do not apply.

It would have been ethical for Google to license copies of the images, and then host them as part of the dataset (but still publish the URLs where they originally came from).

Download over terminal

These download links are not working when I'm trying to download over the terminal, could you please help me with this or could you provide direct links or any other snippet to do so,
Thanks in advance.

Validation data - Access denied

I was able to download the training split but Not able to download validation split. Got this message
Anonymous caller does not have storage.objects.get access to gcc-data/Validation/GCC-Validation.tsv.

Data for Metrics on the Flickr 1K Test

Would it be possible to release just the data (raw output captions generated by the models) used specifically to make Table 4 and Table 7 from the paper? I have an idea for a new automatic metric, and would like to test whether or not it does a better job of capturing human-like evaluations, or performs more like the usual automatic metrics.

Conceptual Captions Dataset with Proper Names

I was wondering if there was a version of the Conceptual Captions Dataset without the proper names cleaned out (ie the version of the dataset with captions like (“Crowd at a concert in Los Angeles“) and (“Former Miss World Priyanka Chopra on the red carpet"))?

Thanks!

Contributing.

Some models trained on this (like llava) do not perform well at understanding comic book pages.

Would you be open to a PR with some data related to comic book pages?

Using Creative-Commons images obviously. Or even public domain images if that's required.

What is the process to offer contributions?

Thanks a lot in advance.

Download broken?

It seems that something is broken on the website. The buttons / links do not work for me.

Not able to download the dataset

Hej,
thank you for your great work!
However, it seems like the dataset cannot be downloaded from the website - the button simply is not working. Could you please tell me if there is a way to get the dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.