Git Product home page Git Product logo

Comments (5)

ccl-core avatar ccl-core commented on May 24, 2024

Hello @christian-steinmeyer !

As HF and TFDS have different naming rules, you will have to adapt the dataset name to follow TFDS' naming: in this case, the correct name would be huggingface:imagenet_1k

As a pointer, you can refer to the from_hf_to_tfds function under:

We will update our documentation so that this is clearer for users!

from datasets.

christian-steinmeyer avatar christian-steinmeyer commented on May 24, 2024

That worked, thanks! And yes, an update in the documentation would be very helpful!

from datasets.

christian-steinmeyer avatar christian-steinmeyer commented on May 24, 2024

@ccl-core Quick follow-up question: Downloading the dataset worked - however, after generating splits, the load function also includes the step of generating tfrecords (Output "Generating training examples..."), which is pretty slow for me (~20 examples/s). Is there any way to speed this up? I couldn't find anything in the builder config or the download and prepare config. The number of available CPUs doesn't seem to be a factor. For Imagenet-1k, this is taking many hours.

from datasets.

christian-steinmeyer avatar christian-steinmeyer commented on May 24, 2024

Hi again! I found the tfds_num_proc argument of the hugginface dataset builder. However, it doesn't seem to be what I'm looking for. Using a number equal to my cpu count or half / quarter times that, there is no progress printed in the generating training examples... step, only my ram fills up and then at some point it crashes.

tfds.load(
    'huggingface:imagenet_1k',
    data_dir=IMAGE_DIR,
    shuffle_files=True,
    builder_kwargs={"tfds_num_proc": N_JOBS}
)

In the meantime, my original try ran through (without builder_kwargs). However, when I use this in a training run, I get tons of warnings like W tensorflow/core/lib/png/png_io.cc:88] PNG warning: 1CCP: known incorrect profile or profile 'ICC PRofile': 'RGB ': RGB color space not permitted on grayscale PNG. Both of which to me seem like a misconfiguration of the dataset somehow. Or is this expected?

from datasets.

maziarzamani avatar maziarzamani commented on May 24, 2024

@ccl-core Quick follow-up question: Downloading the dataset worked - however, after generating splits, the load function also includes the step of generating tfrecords (Output "Generating training examples..."), which is pretty slow for me (~20 examples/s). Is there any way to speed this up? I couldn't find anything in the builder config or the download and prepare config. The number of available CPUs doesn't seem to be a factor. For Imagenet-1k, this is taking many hours.

Same problem here. It runs ~20 examples/s and eventually after a day or so it crashes.

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.