Git Product home page Git Product logo

Comments (20)

rom1504 avatar rom1504 commented on July 24, 2024 4

Hi, i advise you set up the right DNS resolver to increase your success rate
See img2dataset readme for details

from datacomp.

rom1504 avatar rom1504 commented on July 24, 2024 3

The reality of the web is it's ever changing and it has a number of laws restricting redistribution.
So I would argue instead of trying to make the web immutable, it would make more sense to accept its mutability and adapt training and evaluation recipes to it.
For example estimating that 2 collections contain the same distribution in some ways would go a long way.
Going further, continual training and on demand dataset collection would be true adaptation to the web.

That's even more true for going beyond the web and towards world mutability.

All that said, if you do really want a few billions of images that you can redistribute to everyone, and it should stay up for many years, then I think the only way is to build a service that hosts only perpetually granted public domain images. And then probably incentivizing a lot of people to put content on it.

TLDR: it's an interesting topic, and there are a lot of possible solutions. However tuning a downloader tool while keeping the same collection of img links extracted from the web is unlikely to achieve immutability.

from datacomp.

alexanderremmerie avatar alexanderremmerie commented on July 24, 2024 1

We managed to get 92% finally. For future reference: we didn't have a public ip (for security reasons) on GCP. By giving the instance a public ip we were able to get much faster downloads. Now using 8 instaces of knot dns resolver, 88 cores, 88 processes and 128 threads we can download the small dataset (13 million images, 450 GB) in about 2 hours.

from datacomp.

zzzzzero avatar zzzzzero commented on July 24, 2024

In my machine,~94% images can be download successfully.
I'm also wondering if there's a way to download the failed parts , because I've noticed that some downloads don't complete due to broken links or network issues,a considerable portion of the failed links can be successfully downloaded again. If no such method exists, we may have to create a Python script ourselves.
Here are methods that can improve download success rate: 1. You can try to reduce the number of download threads and processes. 2. You can try to change the network or machine used for downloading.

from datacomp.

vtddggg avatar vtddggg commented on July 24, 2024

Thanks for your nice advice!

Since the raw image (tars) data is not very large for small scale track (450 GB). Will the organizers consider to open tars data directly? It can help us conduct more rigorous academic exploration on image-text data filtering problem.

from datacomp.

vtddggg avatar vtddggg commented on July 24, 2024

@rom1504 Thanks, I will have a try. How is the proportion of successful downloads running on your machine?
Just for a reference of what is a reasonable success rate.

from datacomp.

gabrielilharco avatar gabrielilharco commented on July 24, 2024

Hey @vtddggg. I'm also getting 94-95% success rate currently. I suspect this difference will matter little in terms of the accuracy of the trained models. In many experiments we found that changing the size of datasets coming from the same distribution have little impact in performance. For example, in Figure 3 we show that using only 50% of the unfiltered pools performs very closely to using the entire pool (see https://arxiv.org/abs/2304.14108).

Will the organizers consider to open tars data directly?

We understand that releasing the tars directly would make things simpler for participants. However, our dataset is designed to be an index to public images on the internet, which means that if any image is deleted from its original source, it will also be deleted from our dataset. Releasing the tars directly would mean creating public copies of the images, which is problematic. For those reasons, we won't be able to share the data directly, and hope you understand the decision

from datacomp.

vtddggg avatar vtddggg commented on July 24, 2024

Thanks for sharing your success rate.
Agree your point. As changing the size of datasets coming from the same distribution have little impact in performance, it is perfectly acceptable for lost some images during downloading.

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on July 24, 2024

Hi, I believe the success rate will become lower and lower over time. We downloaded the 45TB set and our rates look as following:

Success: 89.9%
Failed to download: 9.5%
Failed to resize: 0.6%

So 10% missing already.

from datacomp.

afang-story avatar afang-story commented on July 24, 2024

Hello @pfischer-nvidia ,
Sorry for the late reply - can you confirm whether you have tried this #3 (comment)?

from datacomp.

pfischer-nvidia avatar pfischer-nvidia commented on July 24, 2024

We did change the DNS servers to the Google ones (8.8.8.8 and 8.8.4.4) but we did not change the resolver to bind9. Instead we used dnspython.

from datacomp.

rom1504 avatar rom1504 commented on July 24, 2024

See https://github.com/rom1504/img2dataset#setting-up-a-high-performance-dns-resolver

from datacomp.

zzzzzero avatar zzzzzero commented on July 24, 2024

I found that I can increase the success rate by increasing the number of retries, when I set retries=2, I got ~95% success rate currently .

Just add retries=2 setting on line 136 of download_upstream.py like this.

img2dataset.download(
    url_list=str(metadata_dir),
    image_size=args.image_size,
    output_folder=str(shard_dir),
    processes_count=args.processes_count,
    thread_count=args.thread_count,
    resize_mode=args.resize_mode,
    resize_only_if_bigger=not args.no_resize_only_if_bigger,
    encode_format=args.encode_format,
    output_format=args.output_format,
    input_format='parquet',
    url_col='url',
    caption_col='text',
    bbox_col=bbox_col,
    save_additional_columns=['uid'],
    number_sample_per_shard=10000,
    oom_shard_count=8,
    retries=2,
)

from datacomp.

Vaishaal avatar Vaishaal commented on July 24, 2024

Updated download_upstream!

from datacomp.

alexanderremmerie avatar alexanderremmerie commented on July 24, 2024

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000
16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

from datacomp.

zzzzzero avatar zzzzzero commented on July 24, 2024

My current download success rate is around 94% now. My suggestion is for you to use wandb to monitor the download process. This way, you can analyze the reasons for download failures more effectively. I've shared the analysis of the download process recorded on my end using wandb. My wandb report link is here . It's evident from the analysis that the primary reason for failures is due to invalid download links (2% of the links are invalid), while 0.8% are due to IP bans.

Your current download speed appears to be very slow. Bandwidth and CPU utilization don't seem to be fully occupied. It also doesn't seem like a DNS server problem. I believe the main reason for the failures could be related to your IP. You can consider trying to download the dataset using an IP address from a different region. If your bandwidth and CPU utilization aren't maxed out, increasing the number of processes and threads might not significantly impact your download success rate. My thread-to-process ratio config is 4:1.

Hi, We are getting success rate of +- 86% (downloading the small, 12 million images, dataset):

15it [44:26, 19.65s/it]worker - success: 0.867 - failed to download: 0.128 - failed to resize: 0.005 - images per sec: 4 - count: 10000 total - success: 0.865 - failed to download: 0.130 - failed to resize: 0.005 - images per sec: 56 - count: 150000 16it [44:49, 20.77s/it]worker - success: 0.860 - failed to download: 0.134 - failed to resize: 0.006 - images per sec: 4 - count: 10000 total - success: 0.864 - failed to download: 0.131 - failed to resize: 0.005 - images per sec: 60 - count: 160000

We are using knot dns resolver (8 instances), using the default 16 processes, and using just 16 threads (instead of the default 128). We did this to slow down the dns resolve requests. We use an e2-highcpu-32 (Efficient Instance, 32 vCPUs, 32 GB RAM) instance on GCP.

86% is the best we could get, but due to the low number of threads, the downloading is very slow. Can you share any more details on how you got the 95% success rate (configuration, machine type, number of threads and processes, GCP/AWS/Azure, dns resolver, etc.) and any advice to improve speed and success rate? Many thanks!

from datacomp.

rom1504 avatar rom1504 commented on July 24, 2024

#39 (comment) answered there

from datacomp.

alexanderremmerie avatar alexanderremmerie commented on July 24, 2024

image

This is the wandb ouput we are getting, (https://api.wandb.ai/links/alexander-remmerie/n1kikknk for the full report). As you can see, most of the errors are network unreachable errors and network timeout errors. Knot CPU usage is 2-5% CPU, so it is using knot as dns resolver. resolv.conf file:

(base) jupyter@datacomp-ubuntu:/etc$ cat /etc/resolv.conf
nameserver 127.0.0.1
```

from datacomp.

rom1504 avatar rom1504 commented on July 24, 2024

from datacomp.

nahidalam avatar nahidalam commented on July 24, 2024

@rom1504 @Vaishaal Can this not be solved without the dns (or other networking) setups? Even 92% is not the full download. And how do I know my 92% is the same as @alexanderremmerie 92%? Is there a way we can go towards more 'immutability' for these open vision-language datasets?

from datacomp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.