Thanks for your great and meaningful competition. When I run <code c

Thanks for your nice advice! Since the raw image (<code class="notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

How to deal with images that cannot be downloaded? about datacomp HOT 20 CLOSED

vtddggg commented on July 24, 2024

How to deal with images that cannot be downloaded?

from datacomp.

Comments (20)

rom1504 commented on July 24, 2024 4

Hi, i advise you set up the right DNS resolver to increase your success rate
See img2dataset readme for details

from datacomp.

rom1504 commented on July 24, 2024 3

The reality of the web is it's ever changing and it has a number of laws restricting redistribution.
So I would argue instead of trying to make the web immutable, it would make more sense to accept its mutability and adapt training and evaluation recipes to it.
For example estimating that 2 collections contain the same distribution in some ways would go a long way.
Going further, continual training and on demand dataset collection would be true adaptation to the web.

That's even more true for going beyond the web and towards world mutability.

All that said, if you do really want a few billions of images that you can redistribute to everyone, and it should stay up for many years, then I think the only way is to build a service that hosts only perpetually granted public domain images. And then probably incentivizing a lot of people to put content on it.

TLDR: it's an interesting topic, and there are a lot of possible solutions. However tuning a downloader tool while keeping the same collection of img links extracted from the web is unlikely to achieve immutability.

from datacomp.

alexanderremmerie commented on July 24, 2024 1

We managed to get 92% finally. For future reference: we didn't have a public ip (for security reasons) on GCP. By giving the instance a public ip we were able to get much faster downloads. Now using 8 instaces of knot dns resolver, 88 cores, 88 processes and 128 threads we can download the small dataset (13 million images, 450 GB) in about 2 hours.

from datacomp.

zzzzzero commented on July 24, 2024

In my machine，~94% images can be download successfully.
I'm also wondering if there's a way to download the failed parts , because I've noticed that some downloads don't complete due to broken links or network issues，a considerable portion of the failed links can be successfully downloaded again. If no such method exists, we may have to create a Python script ourselves.
Here are methods that can improve download success rate: 1. You can try to reduce the number of download threads and processes. 2. You can try to change the network or machine used for downloading.

from datacomp.

vtddggg commented on July 24, 2024

Thanks for your nice advice!

Since the raw image (tars) data is not very large for small scale track (450 GB). Will the organizers consider to open tars data directly? It can help us conduct more rigorous academic exploration on image-text data filtering problem.

from datacomp.

vtddggg commented on July 24, 2024

@rom1504 Thanks, I will have a try. How is the proportion of successful downloads running on your machine?
Just for a reference of what is a reasonable success rate.

from datacomp.

gabrielilharco commented on July 24, 2024

Hey @vtddggg. I'm also getting 94-95% success rate currently. I suspect this difference will matter little in terms of the accuracy of the trained models. In many experiments we found that changing the size of datasets coming from the same distribution have little impact in performance. For example, in Figure 3 we show that using only 50% of the unfiltered pools performs very closely to using the entire pool (see https://arxiv.org/abs/2304.14108).

Will the organizers consider to open tars data directly?

We understand that releasing the tars directly would make things simpler for participants. However, our dataset is designed to be an index to public images on the internet, which means that if any image is deleted from its original source, it will also be deleted from our dataset. Releasing the tars directly would mean creating public copies of the images, which is problematic. For those reasons, we won't be able to share the data directly, and hope you understand the decision

from datacomp.

vtddggg commented on July 24, 2024

Thanks for sharing your success rate.
Agree your point. As changing the size of datasets coming from the same distribution have little impact in performance, it is perfectly acceptable for lost some images during downloading.

from datacomp.

pfischer-nvidia commented on July 24, 2024

Hi, I believe the success rate will become lower and lower over time. We downloaded the 45TB set and our rates look as following:

Success: 89.9%
Failed to download: 9.5%
Failed to resize: 0.6%

So 10% missing already.

from datacomp.

afang-story commented on July 24, 2024

Hello @pfischer-nvidia ,
Sorry for the late reply - can you confirm whether you have tried this #3 (comment)?

from datacomp.

pfischer-nvidia commented on July 24, 2024

We did change the DNS servers to the Google ones (8.8.8.8 and 8.8.4.4) but we did not change the resolver to bind9. Instead we used dnspython.

from datacomp.

rom1504 commented on July 24, 2024

See https://github.com/rom1504/img2dataset#setting-up-a-high-performance-dns-resolver

from datacomp.

zzzzzero commented on July 24, 2024

I found that I can increase the success rate by increasing the number of retries, when I set retries=2, I got ~95% success rate currently .

Just add retries=2 setting on line 136 of download_upstream.py like this.

img2dataset.download(
    url_list=str(metadata_dir),
    image_size=args.image_size,
    output_folder=str(shard_dir),
    processes_count=args.processes_count,
    thread_count=args.thread_count,
    resize_mode=args.resize_mode,
    resize_only_if_bigger=not args.no_resize_only_if_bigger,
    encode_format=args.encode_format,
    output_format=args.output_format,
    input_format='parquet',
    url_col='url',
    caption_col='text',
    bbox_col=bbox_col,
    save_additional_columns=['uid'],
    number_sample_per_shard=10000,
    oom_shard_count=8,
    retries=2,
)

from datacomp.

Vaishaal commented on July 24, 2024

Updated download_upstream!

from datacomp.

alexanderremmerie commented on July 24, 2024

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000
16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

from datacomp.

zzzzzero commented on July 24, 2024

My current download success rate is around 94% now. My suggestion is for you to use wandb to monitor the download process. This way, you can analyze the reasons for download failures more effectively. I've shared the analysis of the download process recorded on my end using wandb. My wandb report link is here . It's evident from the analysis that the primary reason for failures is due to invalid download links (2% of the links are invalid), while 0.8% are due to IP bans.

Your current download speed appears to be very slow. Bandwidth and CPU utilization don't seem to be fully occupied. It also doesn't seem like a DNS server problem. I believe the main reason for the failures could be related to your IP. You can consider trying to download the dataset using an IP address from a different region. If your bandwidth and CPU utilization aren't maxed out, increasing the number of processes and threads might not significantly impact your download success rate. My thread-to-process ratio config is 4:1.

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000 16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

from datacomp.

rom1504 commented on July 24, 2024

#39 (comment) answered there

from datacomp.

alexanderremmerie commented on July 24, 2024

This is the wandb ouput we are getting, (https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full report). As you can see, most of the errors are network unreachable errors and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot as dns resolver. resolv.conf file:

(base) jupyter@datacomp-ubuntu:/etc$ cat /etc/resolv.conf
nameserver 127.0.0.1
```

from datacomp.

rom1504 commented on July 24, 2024

Ok I see. Yeah the network error is about something lower level. Could be a kernel setting, eg limit of number of files, could be local network card limit, a limit on your local router

…

On Thu, Aug 10, 2023, 12:23 Alexander Remmerie ***@***.***> wrote: [image: image] <https://user-images.githubusercontent.com/48564828/259694620-58df9125-d33e-4737-ac36-f25b1ca489df.png> This is the wandb ouput we are getting, ( https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full report). As you can see, most of the errors are network unreachable errors and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot as dns resolver. resolv.conf file: (base) ***@***.***:/etc$ cat /etc/resolv.conf nameserver 127.0.0.1. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437VBZPZNCF2N6XUYPPTXUSZAXANCNFSM6AAAAAAXWWDYLA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from datacomp.

nahidalam commented on July 24, 2024

@rom1504 @Vaishaal Can this not be solved without the dns (or other networking) setups? Even 92% is not the full download. And how do I know my 92% is the same as @alexanderremmerie 92%? Is there a way we can go towards more 'immutability' for these open vision-language datasets?

from datacomp.

How to deal with images that cannot be downloaded? about datacomp HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent