Comments (8)
Thank you all for the minimal reproducer and the information about versions. If folks wanted to take this further, the next step here is probably to use git bisect
to find the change that caused this. My guess is that that makes it much easier to diagnose.
from dask.
I also tested this on a Windows computer, with a similar python install: same issue.
2024.1.1
(almost) no memory use2024.5.1
: About 13 GB memory use, and takes about 1 minute
from dask.
I tested this on my Windows computer and I get the same behaviour:
- 2024.5.1 and 2024.4.1: about 21 GB memory use, taking about 30 secs
- 2024.1.1: Instant and no memory use
from dask.
I ran git bisect
to locate the bad commit, testing through about 80 revisions. And here is the final output:
f51fa77f4cbba7a92a54a760da65a4b6e712e4ad is the first bad commit
commit f51fa77f4cbba7a92a54a760da65a4b6e712e4ad
Author: crusaderky <[email protected]>
Date: Tue Feb 6 17:50:42 2024 +0000
Make tokenization more deterministic (#10876)
dask/array/ufunc.py | 4 +-
dask/base.py | 176 ++++++------
dask/tests/test_base.py | 12 +-
dask/tests/test_delayed.py | 12 +-
dask/tests/test_highgraph.py | 6 +-
dask/tests/test_tokenize.py | 670 +++++++++++++++++++++++++++----------------
dask/utils.py | 18 +-
7 files changed, 534 insertions(+), 364 deletions(-)
The commit is this one here. There seems to be a lot of changes to the base.py
file, where the problem could have arisen.
from dask.
Thanks for doing that work!
Looking more closely at the commit it looks like some code was removed that used to handle memmap files.
f51fa77#diff-10422b02c591d63ee295724faa14f7698b4a742c98ba20771c5f70d1a6926d06L1309-L1327
cc @fjetter when he gets back (on PTO currently) or maybe @jrbourbeau if he wants a quick fix
from dask.
Excellent! Would this be a regression? Or is there some other way of "lazily" loading binary data like with dask?
from dask.
Thing used to work. Thing no longer works. Sounds like a regression to me π
I'd say most people doing lazy loading of array data tend to use more established formats like Zarr or HDF today. memmapped numpy files are rare these days (probably why this issue went unreported for so long).
If someone was starting a new project today I'd point them to https://zarr.dev/ which seems to be becoming the standard surprisingly quickly across a variety of fields.
from dask.
For most of our work with large array data, we do use Zarr. The reason for having to work with binary data here, is to load data generated by a type of very fast camera. Thus, the raw data itself comes out as binary files.
For all of my own data processing needs it is zarr
all the way :)
from dask.
Related Issues (20)
- ValueError: An error occurred while calling the read_csv method registered to the pandas backend HOT 2
- add a api load dataset from [huggingface datasets] HOT 4
- map_blocks returning pd.DataFrame fails with block_info parameter HOT 4
- Couple of sparse tests are failing HOT 1
- I'm not sure what βb_dictβ is, I couldn't find any relevant content HOT 1
- Release GH action needs to be run twice HOT 1
- gpuCI failing due to `pytest` warning HOT 5
- pandas>=2.0.0 incompatibility ?
- Concat with unknown divisions raises TypeError HOT 1
- Dask 2024.5.1 removed `.attrs` HOT 11
- Dask 2024.5.1 raises exception when `.compute()` is called on a categorical column HOT 3
- a tutorial for distributed text deduplication HOT 5
- `dask/dataframe/tests/test_indexing.py::test_getitem_integer_slice` failing with nightly `pandas`
- dask.dataframe import error for Python 3.12.3 HOT 3
- 'SeriesGroupBy' object has no attribute 'nunique_approx' HOT 6
- Categorical column information incorrectly copied over when using series to create new dataframe resulting in a broken dataframe
- calling repartition on ddf with timeseries index after resample causes ValueError: left side of old and new divisions are different
- Can not process datasets created by the older version of Dask HOT 9
- P2P rechunking of ERA-5 from spatial to temporal dimension is failing hard HOT 15
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.