Git Product home page Git Product logo

Comments (8)

mrocklin avatar mrocklin commented on July 22, 2024 1

Thank you all for the minimal reproducer and the information about versions. If folks wanted to take this further, the next step here is probably to use git bisect to find the change that caused this. My guess is that that makes it much easier to diagnose.

from dask.

magnunor avatar magnunor commented on July 22, 2024

I also tested this on a Windows computer, with a similar python install: same issue.

  • 2024.1.1 (almost) no memory use
  • 2024.5.1: About 13 GB memory use, and takes about 1 minute

from dask.

sivborg avatar sivborg commented on July 22, 2024

I tested this on my Windows computer and I get the same behaviour:

  • 2024.5.1 and 2024.4.1: about 21 GB memory use, taking about 30 secs
  • 2024.1.1: Instant and no memory use

from dask.

sivborg avatar sivborg commented on July 22, 2024

I ran git bisect to locate the bad commit, testing through about 80 revisions. And here is the final output:

f51fa77f4cbba7a92a54a760da65a4b6e712e4ad is the first bad commit
commit f51fa77f4cbba7a92a54a760da65a4b6e712e4ad
Author: crusaderky <[email protected]>
Date:   Tue Feb 6 17:50:42 2024 +0000

    Make tokenization more deterministic (#10876)

 dask/array/ufunc.py          |   4 +-
 dask/base.py                 | 176 ++++++------
 dask/tests/test_base.py      |  12 +-
 dask/tests/test_delayed.py   |  12 +-
 dask/tests/test_highgraph.py |   6 +-
 dask/tests/test_tokenize.py  | 670 +++++++++++++++++++++++++++----------------
 dask/utils.py                |  18 +-
 7 files changed, 534 insertions(+), 364 deletions(-)

The commit is this one here. There seems to be a lot of changes to the base.py file, where the problem could have arisen.

from dask.

mrocklin avatar mrocklin commented on July 22, 2024

Thanks for doing that work!

Looking more closely at the commit it looks like some code was removed that used to handle memmap files.

f51fa77#diff-10422b02c591d63ee295724faa14f7698b4a742c98ba20771c5f70d1a6926d06L1309-L1327

cc @fjetter when he gets back (on PTO currently) or maybe @jrbourbeau if he wants a quick fix

from dask.

magnunor avatar magnunor commented on July 22, 2024

Excellent! Would this be a regression? Or is there some other way of "lazily" loading binary data like with dask?

from dask.

mrocklin avatar mrocklin commented on July 22, 2024

Thing used to work. Thing no longer works. Sounds like a regression to me πŸ™‚

I'd say most people doing lazy loading of array data tend to use more established formats like Zarr or HDF today. memmapped numpy files are rare these days (probably why this issue went unreported for so long).

If someone was starting a new project today I'd point them to https://zarr.dev/ which seems to be becoming the standard surprisingly quickly across a variety of fields.

from dask.

magnunor avatar magnunor commented on July 22, 2024

For most of our work with large array data, we do use Zarr. The reason for having to work with binary data here, is to load data generated by a type of very fast camera. Thus, the raw data itself comes out as binary files.

For all of my own data processing needs it is zarr all the way :)

from dask.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.