Git Product home page Git Product logo

Comments (32)

gabrielilharco avatar gabrielilharco commented on June 26, 2024 2

Yes, absolutely, that would be amazing!

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024 2

@dpaleka I manually looked at 100 images where there was a hash mismatch between our metadata and the redownloaded image and couldn't find anything like that. I put a copy of a few of our shards here if you want to investigate further: https://drive.google.com/file/d/1898MDL_fXOYPjIzNYTt_B6nH0nZqRiTh/view?usp=sharing

from datacomp.

mranzinger avatar mranzinger commented on June 26, 2024 1

So you may already have done this, but I scanned the xlarge pool metadata, looking for that URL, and this is what I found:

+----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
| uid                              | url                                               | text                                                                                                                                                             | original_width | original_height | clip_b32_similarity_score | clip_l14_similarity_score | face_bboxes | sha256                                                           |
+----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
| 38f76e4b1b4a77ca66a62b453da17912 | https://images.eanixter.com/viewex/PR108844V6.JPG | "Cable Manager Horizontal Recessed Flat 3-Ring Rack Mount 1RU 19"" Width x 4.8"" Depth x 1.72"" Height 16 Gauge Steel Powder Coated Black With 3"" Metal D-Ring" | 250            | 250             | 0.33520508                | 0.28076172                | []          | 0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164 |
+----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+

So this URL only occurs once in the data, and it's the same as in the metadata for DataComp-1B.

I then searched the metadata for the hash 6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273 which is what we get when we download the image, and it never occurs in the metadata.

from datacomp.

rom1504 avatar rom1504 commented on June 26, 2024

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on June 26, 2024

We have our own downloading framework. But all we do really is hashlib.sha256(image_bytes).hexdigest()
Let me check what img2dataset does and if we can make it match

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on June 26, 2024

Just checked. img2dataset does the same thing. I downloaded some images with img2dataset, and I'm getting the same hashes as with our code and they don't match the given metadata.

I ran it like img2dataset --url_list=myimglist.txt --output_folder=$PWD --thread_count=2 --image_size=1024 --compute_hash=sha256

The result for the image above is:

{
    "url": "https://images.eanixter.com/viewex/PR108844V6.JPG",
    "key": "000000000",
    "status": "success",
    "error_message": null,
    "width": 1024,
    "height": 1024,
    "original_width": 250,
    "original_height": 250,
    "exif": "{}",
    "sha256": "6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273"
}

Can you provide the version (commit ID) and commandline of img2dataset that you used?
Also please try for yourself.

from datacomp.

rom1504 avatar rom1504 commented on June 26, 2024

@gabrielilharco should be able to say who ran this hash computation and what commit was used

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on June 26, 2024

We are blocked by this issue @gabrielilharco, @rom1504 . Please help soon by checking how this was done.

from datacomp.

Vaishaal avatar Vaishaal commented on June 26, 2024

Is the problem only for DC-1B or do you have the problem downloading CommonPool as well?

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on June 26, 2024

We haven't downloaded CommonPool with hash checking. Can you please explain the thought behind your question? What is the conclusion if it's different / the same?

Given the above example, you can just compute the hash yourself (independent of our download) and you will see it doesn't match.
Note that most of the hashes do match, it's just the 14% part where they don't.

from datacomp.

Vaishaal avatar Vaishaal commented on June 26, 2024

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024

Also tagging @GeorgiosSmyrnis

from datacomp.

mranzinger avatar mranzinger commented on June 26, 2024

Has anybody been able to look into this?

from datacomp.

Vaishaal avatar Vaishaal commented on June 26, 2024

hello we are looking into this right now!

So the download code that computes the hash is right here: https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L203-L318C6

We are checking our internal pool to see what the computed hash was. We have 3-4 working hypotheses that we are working on resolving:

  1. Hash was corrupted somehow when subsetting CommonPool-12.8B -> DataComp-1B. Since we dowloaded CommonPool-12.8B first and picked a subset to generate DataComp-1B the hash needed to be carried over, perhaps there was a pandas error here
  2. Incomplete stream during downloading that lead to wrong hash. Since the hash is computed directly from the URL stream, if somehow the stream truncated early (even by 1-2 bytes), we would get a different hash. We might have a higher failure rate than you because we downloaded the images really quickly using many cpu cores + many machines leading to many failures we had to handle
  3. Similar to above but when we were hitting certain domains with a heavy load we were given a different CDN's copy of the image which had slightly different exif/metadata which caused the hash to be wrong
  4. Something to do with the resizer affecting the image hash.

We are exploring all of these and hopefully in the next day or so we can come to a conclusion. What happens if you just skip the 14% for now? would that unblock you?

from datacomp.

mranzinger avatar mranzinger commented on June 26, 2024

Yes, we're able to move forward with initial experimentation with the missing 14%. We are eagerly awaiting your findings though. Your hypotheses 2-4 are interesting though; would that suggest that we'd also run into hash mismatches coming from CommonPool-12.8B?

from datacomp.

Vaishaal avatar Vaishaal commented on June 26, 2024

yeah I believe so, my money is on 2. What we can do is just look at our internal copy of one of the hash mismatched images and see if either the exif data or the actual data is truncated.

from datacomp.

sagadre avatar sagadre commented on June 26, 2024

Here is a preliminary investigation on incomplete streams. Seems that this does not explain the hash mismatch, at least not at clean byte boundaries. Will now try to track down a mismatched image in our tar files. Hopefully a visual inspection will be useful

Sorry about this and thanks for raising the issue!

import urllib.request
import io
import hashlib
import os
from tqdm import tqdm

current_hash = "6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273"
datacomp_hash = "0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164"

url = "https://images.eanixter.com/viewex/PR108844V6.JPG"

img_stream = None
user_agent_string = (
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"
)

request = urllib.request.Request(
    url, data=None, headers={"User-Agent": user_agent_string}
)
with urllib.request.urlopen(request, timeout=20) as r:
    img_stream = io.BytesIO(r.read())

img_stream.seek(0, os.SEEK_END)
max_bytes = img_stream.tell()
img_stream.seek(0)

for i in tqdm(range(max_bytes + 1)):
    for j in range(max_bytes):
        img_stream.seek(j)
        computed_hash = hashlib.sha256(img_stream.read(i)).hexdigest()

        if computed_hash == datacomp_hash:
            print("hit the datacomp hash")  # does not happen
        if computed_hash == current_hash:
            print("hit the current hash!")  # happens

from datacomp.

rom1504 avatar rom1504 commented on June 26, 2024

from datacomp.

rom1504 avatar rom1504 commented on June 26, 2024

from datacomp.

mranzinger avatar mranzinger commented on June 26, 2024

Do you still have a local copy of this image? Can you verify that you get the hash in the dataset metadata? Or is that impossible due to preprocessing changing the image bytes in your stored copy?

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024

We do, but we don't have an efficient way of finding it since we don't have a mapping from uid to which shard the datapoint is in. Do you happen to have any example like this in CommonPool small?

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024

We were able to find some images where the sha256 we have on file doesn't match the one if we re-download the image, we're currently investigating further

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024

Here's an example of an image where there is a mismatch. They are very similar visually, but there's a small difference. They are both jpgs, and with the same size and properties (JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, baseline, precision 8, 260x260, components 3)

The original image url is http://www.dhresource.com/260x260s/f2-albu-g5-M01-35-B5-rBVaJFip2JWAWPjrAAG7KGesIUY669.jpg/wholesale-new-casual-leather-men-bag-small.jpg

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024

This could be because of CDNs and how aggressive we were when downloading. I can't think of a way out of it that doesn't involve re-downloading the entire pool, which is prohibitively expensive for us at the moment. The safest thing to do if there is concern about poison attacks is to only trust the hashes that match and ignore all else. Unfortunately that will mean throwing away some data that is probably good. Alternately, if you're less worried about attacks but still want to guarantee the integrity of the image-text pair, you could look at our metadata features (e.g. CLIP features) to check if they are roughly the same

from datacomp.

mranzinger avatar mranzinger commented on June 26, 2024

Thanks for investigating this. I think what we're going to do is download the mismatch 14%, place it in "quarantine", and then, as you said, compare clip scores, bringing in those within some threshold. If we go down this path, would you be interested in us sharing the results, e.g. we could create new parquets for those images that failed the hash check but passed the clip check, along with the new hashes.

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on June 26, 2024

@gabrielilharco: I think your finding (small pixel differences) does not fully explain the issue. Please see my original post above, where the image was the exact same image already back in 2019. How can we explain that?
Could you try to find that image in your local storage?

from datacomp.

rom1504 avatar rom1504 commented on June 26, 2024

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on June 26, 2024

Sure, I can create a list. But for each sample there should be some explanation why the hash is different and I think we haven't found an explanation for the initial example.
I.e. we currently don't know what content was hashed to get the hash 0e77fada...

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024

@pfischer-nvidia my best guess is because of CDNs, in a way I don't fully understand yet. But in general, my understanding is that there is no guarantee that you'll always get the same image when querying an url. The fact that we were doing so many queries in parallel when downloading might have played a role too. Visually inspecting some images where there's a mismatch, the images we have locally look very similar to the ones we download again.

A few other data points:

  • I looked at ~50 more images where there's a mismatch. The average absolute difference in pixel space was 59 +- 44. The relative size (in bytes) of the images we have compared to the ones we download again was 1.06 +- 0.30 (so we have both redownloaded images that are larger and smaller than what we have in bytes).
  • I also found images where the jpeg config is different. Here's an example. Redownloaded image: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, baseline, precision 8, 500x500, components 3. Image we have on file: JPEG image data, JFIF standard 1.01, resolution (DPI), density 96x96, segment length 16, comment: "CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), quality = 70", baseline, precision 8, 500x500, components 3

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on June 26, 2024

Ok I understand that images on the internet change over time (even only slightly).
But my example above is byte-wise exactly the same as in 2019 and yields the same hash when retrieved from the web archive.

from datacomp.

gabrielilharco avatar gabrielilharco commented on June 26, 2024

That's not the kind of change I'm talking about. CDNs can dynamically optimize images in an attempt to deliver faster loading times. Even minor dynamic changes such as compression can lead to different hashes. Another example:

{
    "uid": "dfbee2ca55081959ae35db0a5f390618",
    "caption": "Aceito Murda",
    "url": "https://thumbnailer.mixcloud.com/unsafe/60x60/tmp/7/9/0/f/c08c-b512-40fd-812c-539f2d6c7c00",
    "key": "00026667506",
    "status": "success",
    "error_message": null,
    "width": 60,
    "height": 60,
    "original_width": 60,
    "original_height": 60,
    "exif": "{\"Image Orientation\": \"Horizontal (normal)\", \"Image XResolution\": \"72\", \"Image YResolution\": \"72\", \"Image ResolutionUnit\": \"Pixels/Inch\", \"Image YCbCrPositioning\": \"Centered\", \"Image ExifOffset\": \"102\", \"EXIF ExifVersion\": \"\", \"EXIF ComponentsConfiguration\": \"\", \"EXIF FlashPixVersion\": \"\", \"EXIF ColorSpace\": \"Uncalibrated\", \"EXIF ExifImageWidth\": \"60\", \"EXIF ExifImageLength\": \"60\"}",
    "sha256": "2404bdf21b07450db94d602045a537982a9db00ea20fe9c099eeb3f33358a89b"
}

This image also hasn't changed since 2019 according to the web archive https://web.archive.org/web/20230000000000*/https://thumbnailer.mixcloud.com/unsafe/60x60/tmp/7/9/0/f/c08c-b512-40fd-812c-539f2d6c7c00

The image we have on file differs slightly from the one we redownload, but they look very visually similar. Here's a comparison (our image first).

ours
redownloaded

from datacomp.

dpaleka avatar dpaleka commented on June 26, 2024

The three Aceito Murda images (the first attached, the second attached, the Internet Archive one) produce three different hashes for me (2404bdf21b07450db94d602045a537982a9db00ea20fe9c099eeb3f33358a89b, d390cd59881005416fc25f93b497df1105fbb674fd581941aab03e817d5dec85, 767c1a47a8996bc372f2bb2280fdc966d303a6a3f2d8de6e30b89ba425162b4a respectively).

Are there other examples like the image in row 21, where the current download and the Internet Archive match, but the sha256 column in Datacomp doesn't? It could be worth running @rom1504's proposed experiment on those.

from datacomp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.