Hello! I'm running the command: python dow

I've reduced the download code to the following: <div class="snippet-clipboard-con

Connection error while half-downloading metadata about datacomp HOT 3 OPEN

mlfoundations commented on July 24, 2024

Connection error while half-downloading metadata

from datacomp.

Comments (3)

sedol1339 commented on July 24, 2024

I've reduced the download code to the following:

from huggingface_hub import snapshot_download
snapshot_download(**{'repo_id': 'mlfoundations/datacomp_medium', 'allow_patterns': '*.parquet', 'local_dir': 'medium/metadata', 'cache_dir': 'medium/hf', 'local_dir_use_symlinks': False, 'repo_type': 'dataset', 'resume_download': True})

Still see that even with resume_download=True it keeps downloading the same files every time after error

from datacomp.

LengSicong commented on July 24, 2024

same here, any solution?

from datacomp.

simon-ging commented on July 24, 2024

A temporary solution would be to catch the URLs that are downloaded and then download them manually.

Change download_upstream.py

# add at the beginning
class QuietTqdm(tqdm):
    def __init__(self, *a, **kw):
        kw["disable"] = True
        super().__init__(*a, **kw)

# change
    hf_snapshot_args = dict(
        repo_id=hf_repo,
        allow_patterns=f"*.parquet",
        local_dir=metadata_dir,
        cache_dir=cache_dir,
        local_dir_use_symlinks=False,
        repo_type="dataset",
        max_workers=1,
        tqdm_class=QuietTqdm,
    )

# delete this line:  print(f"Downloading metadata to {metadata_dir}...")

Find and change the file site-packages/huggingface_hub/file_download.py

Find the line 1245 and add the print and return statement

# find this line (1245)
    url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
# add the print and return statement
    print(url)
    return "none"

Finally call the downloader

HF_HUB_DISABLE_PROGRESS_BARS=1 python download_upstream.py --scale xlarge --data_dir data/datacomp --skip_shards > urls.txt

This gives you a list of ~24K URLs to manually download. Now you just need some sort of download utility that can batch-download URLs and you have the metadata.

from datacomp.

Recommend Projects