Comments (6)
Parallelization is done by processing each config in a separate process:
datasets/tensorflow_datasets/scripts/cli/build.py
Lines 314 to 315 in 69e781f
This means that there's no reason to set
-num-processes
higher than the number of distinct configs for your dataset.
The way multiprocessing works in Python it would pickle each builder (including references to Mk0
module) in the main process and then unpickle it in the child processes. During unpickling it must be able to import Mk0
, which is most probably not installed in your Python environment.
You could make your dataset builder code discoverable by Python (e.g. including it into TFDS and reinstalling the library locally). Or perhaps you could consider implementing parallelization in your dataset builder or use Beam.
from datasets.
#5279 should make children processes aware of your dataset builder.
from datasets.
#5279 should make children processes aware of your dataset builder.
I can now run with the --num-processes flag which is pretty cool. Thank you for fixing that!
This means that there's no reason to set -num-processes higher than the number of distinct configs for your dataset.
Is it possible to effectively split my dataset into different configs then merge them together when done?
Or perhaps you could consider implementing parallelization in your dataset builder
My code unfortunately isn't the slow part. Its my abuse of TFDS due to me having 138 features in my dataset . My code takes 2.5 seconds per 3000 examples meanwhile my generation happens at 80 examples per second.
from datasets.
I can now run with the --num-processes flag which is pretty cool. Thank you for fixing that!
Great, really glad that it worked for you! I'll be closing this issue then.
Is it possible to effectively split my dataset into different configs then merge them together when done?
TFDS doesn't natively support mixing datasets, but you can use some other tools for that, e.g. https://github.com/google/seqio
My code unfortunately isn't the slow part. Its my abuse of TFDS due to me having 138 features in my dataset . My code takes 2.5 seconds per 3000 examples meanwhile my generation happens at 80 examples per second.
It's usually very straightforward to parallelize examples generation:
Or with Beam:
datasets/tensorflow_datasets/datasets/beir/beir_dataset_builder.py
Lines 302 to 320 in 38727f7
from datasets.
Ok. Thanks for you for your help with this.
I'll look into parallel example generation. and see if I can get that working with my dataset.
from datasets.
Hi, I have unfortunately encountered the same issues and had no luck on trying tf-nightly
or multiprocessing
with python 3.9/3.10 (py3.10 was necessary to use tensorflow-datasets==4.9.4 and nightly versions with the fix) with MacOS.
However, I think I managed to bypass this using BUILDER_CONFIGS
. Let's say I want to generate the datasets in parallel with 10 workers, I simply need to define 10 configs as below,
class Builder(tfds.core.GeneratorBasedBuilder, skip_registration=True):
VERSION = tfds.core.Version("1.0.0")
BUILDER_CONFIGS: ClassVar[list[tfds.core.BuilderConfig]] = [
tfds.core.BuilderConfig(name=str(group)) for group in range(1,11)
]
then run the CLI per group one by one.
tfds build my_dataset --config 1
Later we can load and concatenate the datasets as below
ds1 = tfds.load('my_dataset/1', split='train')
ds2 = tfds.load('my_dataset/2', split='train')
ds = tf.data.Dataset.sample_from_datasets([ds1, ds2])
I tested locally, it does seem to work. Please let me know if there are any hidden traps 🙌 If not, hope it could help others who are facing same issues!
from datasets.
Related Issues (20)
- Load Sentiment140 failed with HTTP 404 HOT 1
- Can not load robotics dataset HOT 4
- TFBertModel: InvalidArgumentError.__init__() missing 2 required positional arguments: 'op' and 'message' HOT 1
- Streaming dataset construction or appending to an existing dataset HOT 1
- Please support prefetch with python datasets HOT 2
- gs' not implemented HOT 2
- tfds build failed HOT 2
- need help to build a dataset from local numpy data HOT 4
- [data request] <dataset dengue> HOT 1
- [data request] <dataset educação superior no Brasil> HOT 1
- [data request] smallnorb HOT 2
- Multi-threaded compression? HOT 1
- checksum updated HOT 1
- Exception ignored in: <function AtomicFunction.__del__ at 0x71926a728940> HOT 13
- canot load EMNIST dataset HOT 8
- HTTP Error 301 HOT 1
- Example serializer doesn't properly raise exception HOT 2
- [data request] <emnist>
- Error when processing speech_commands dataset HOT 1
- [data request] <poker>
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.