Git Product home page Git Product logo

Comments (3)

sedol1339 avatar sedol1339 commented on July 24, 2024

I've reduced the download code to the following:

from huggingface_hub import snapshot_download
snapshot_download(**{'repo_id': 'mlfoundations/datacomp_medium', 'allow_patterns': '*.parquet', 'local_dir': 'medium/metadata', 'cache_dir': 'medium/hf', 'local_dir_use_symlinks': False, 'repo_type': 'dataset', 'resume_download': True})

Still see that even with resume_download=True it keeps downloading the same files every time after error

from datacomp.

LengSicong avatar LengSicong commented on July 24, 2024

same here, any solution?

from datacomp.

simon-ging avatar simon-ging commented on July 24, 2024

A temporary solution would be to catch the URLs that are downloaded and then download them manually.

Change download_upstream.py

# add at the beginning
class QuietTqdm(tqdm):
    def __init__(self, *a, **kw):
        kw["disable"] = True
        super().__init__(*a, **kw)

# change
    hf_snapshot_args = dict(
        repo_id=hf_repo,
        allow_patterns=f"*.parquet",
        local_dir=metadata_dir,
        cache_dir=cache_dir,
        local_dir_use_symlinks=False,
        repo_type="dataset",
        max_workers=1,
        tqdm_class=QuietTqdm,
    )

# delete this line:  print(f"Downloading metadata to {metadata_dir}...")

Find and change the file site-packages/huggingface_hub/file_download.py

Find the line 1245 and add the print and return statement

# find this line (1245)
    url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
# add the print and return statement
    print(url)
    return "none"

Finally call the downloader

HF_HUB_DISABLE_PROGRESS_BARS=1 python download_upstream.py --scale xlarge --data_dir data/datacomp --skip_shards > urls.txt

This gives you a list of ~24K URLs to manually download. Now you just need some sort of download utility that can batch-download URLs and you have the metadata.

from datacomp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.