Git Product home page Git Product logo

dask-histogram's Introduction

dask-histogram

Scale up histogramming with Dask.

Tests Documentation Status PyPI version Conda Forge Python Version

The boost-histogram library provides a performant object oriented API for histogramming in Python. Building on the foundation of boost-histogram, dask-histogram adds support for lazy calculations on Dask collections.

See the documentation for examples and the API reference.

dask-histogram's People

Contributors

douglasdavis avatar lgray avatar martindurant avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dask-histogram's Issues

bug: dask histograms are not recognized as `compute`-able

if I have a dict of stuff to compute like:

out = {'mass': Histogram(
   StrCategory(['ZJets', 'Data'], growth=True),
   Regular(30000, 0.25, 300),
   storage=Double()) # (has staged fills),
 'pt': Histogram(
   StrCategory(['ZJets', 'Data'], growth=True),
   Regular(30000, 0.24, 300),
   storage=Double()) # (has staged fills),
 'cutflow': {'ZJets_pt': dask.awkward<sum-agg, type=Scalar, dtype=Unknown>,
  'ZJets_mass': dask.awkward<sum-agg, type=Scalar, dtype=Unknown>,
  'Data_pt': dask.awkward<sum-agg, type=Scalar, dtype=Unknown>,
  'Data_mass': dask.awkward<sum-agg, type=Scalar, dtype=Unknown>}}

where mass and pt are dask_histograms. When I do dask.compute(out) the dask_awkward sums are computed but not the histograms!

and in fact:

dask.compute(out["mass"])

just returns again the dask_histogram as though it's not recognized as a dask collection at all.

Feature request: support float weights for histograms

Originally opened in hist as scikit-hep/hist#493.

Being able to use floats as weights is convenient, but currently not supported.

import boost_histogram as bh
import dask.array as da
import dask_histogram as dh

x = da.random.uniform(size=(1000,), chunks=(250,))
ref = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.Weight())
h = dh.factory(x, weights=0.5, histref=ref)  # AttributeError: 'float' object has no attribute 'ndim'

Computing histogram with dask distributed cluster

When running one of examples with dask distributed, I get an error. If I comment out the client line, it works.

from dask.distributed import Client
import dask_histogram as dh
import dask_histogram.boost as dhb
import dask.array as da
client = Client()
x = da.random.standard_normal(size=(100_000_000, 2), chunks=(10_000_000, 2))
h = dhb.Histogram(dh.axis.Regular(10, -3, 3), dh.axis.Regular(10, -3, 3), storage=dh.storage.Double())
h.fill(x)
h.compute()

I am using Python 3.8.6 installed with conda, boost-histogram version 1.2.1, dask version 2021.09.1, and dask-histogram version 0.2.1.dev1+g4907a18.
Thank you for your work on this project!

distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.core - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.worker - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.core - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f14d26bed60>>, <Task finished name='Task-5' coro=<Worker.handle_scheduler() done, defined at /dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py:1045> exception=AttributeError("can't set attribute")>)
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.worker - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fa9b10ddd60>>, <Task finished name='Task-5' coro=<Worker.handle_scheduler() done, defined at /dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py:1045> exception=AttributeError("can't set attribute")>)
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.core - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.worker - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f2d1b8fed60>>, <Task finished name='Task-5' coro=<Worker.handle_scheduler() done, defined at /dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py:1045> exception=AttributeError("can't set attribute")>)
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.core - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
distributed.worker - ERROR - can't set attribute
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f14d26bed60>>, <Task finished name='Task-269' coro=<Worker.handle_scheduler() done, defined at /dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py:1045> exception=AttributeError("can't set attribute")>)
Traceback (most recent call last):
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/worker.py", line 1047, in handle_scheduler
    await self.handle_stream(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/core.py", line 557, in handle_stream
    msgs = await comm.read()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/tcp.py", line 226, in read
    msg = await from_frames(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 78, in from_frames
    res = _from_frames()
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 111, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/core.py", line 103, in _decode_default
    return merge_and_deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 475, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 391, in deserialize
    deserialize(
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 407, in deserialize
    return loads(header, frames)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 86, in pickle_loads
    return pickle.loads(x, buffers=new)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads
    return pickle.loads(x, buffers=buffers)
  File "/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/bag/core.py", line 401, in __setstate__
    self.dask, self.key = state
AttributeError: can't set attribute
---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<ipython-input-3-ff431c4d1e1e> in <module>
----> 1 h.compute()

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask_histogram/boost.py in compute(self)
    401             result_view = self.view(flow=True) + self._staged.compute().view(flow=True)
    402         else:
--> 403             result_view = self._staged.compute().view(flow=True)
    404         self[...] = result_view
    405         self._staged = None

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs)
    286         dask.base.compute
    287         """
--> 288         (result,) = compute(self, traverse=False, **kwargs)
    289         return result
    290 

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    568         postcomputes.append(x.__dask_postcompute__())
    569 
--> 570     results = schedule(dsk, keys, **kwargs)
    571     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    572 

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2687                     should_rejoin = False
   2688             try:
-> 2689                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2690             finally:
   2691                 for f in futures.values():

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1964             else:
   1965                 local_worker = None
-> 1966             return self.sync(
   1967                 self._gather,
   1968                 futures,

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    858             return future
    859         else:
--> 860             return sync(
    861                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    862             )

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    324     if error[0]:
    325         typ, exc, tb = error[0]
--> 326         raise exc.with_traceback(tb)
    327     else:
    328         return result[0]

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/utils.py in f()
    307             if callback_timeout is not None:
    308                 future = asyncio.wait_for(future, callback_timeout)
--> 309             result[0] = yield future
    310         except Exception:
    311             error[0] = sys.exc_info()

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/dark/dmpoehlm/software/miniconda/envs/lm2/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1829                             exc = CancelledError(key)
   1830                         else:
-> 1831                             raise exception.with_traceback(traceback)
   1832                         raise exc
   1833                     if errors == "skip":

KilledWorker: ("('standard_normal-hist-on-block-0efa75cec56b59492ad97060b3c80d95', 7)", <WorkerState 'tcp://169.237.42.20:46463', name: 2, memory: 0, processing: 10>)

dask_histogram creates MaterializedLayers which are slow to optimize over multiple partitions

Long story short:

https://github.com/dask-contrib/dask-histogram/blob/main/src/dask_histogram/core.py#L728-L766
and
https://github.com/dask-contrib/dask-histogram/blob/main/src/dask_histogram/core.py#L1019-L1049

create MaterializedLayers because we're passing dicts to HighLevelGraph.from_collections, which are then immediately turned into MaterializedLayers. MaterializedLayers cause optimization to scale with the number of input partitions, which is untenable for realistic dataset chunk multiplicities.

I have tried to use https://github.com/dask/dask/blob/main/dask/layers.py#L1090 (DataFrameTreeReduction) but that's resulted in some fairly odd outcomes that I don't quite understand.

This must be fixed before we even begin to think about telling people they should use this package for HEP analysis.

copied hist will retain the original name after computing it

I just found out that if you change the axes name of a copied histogram, the computed histogram will retain the original axes name.
Of course I can create multiple similar histograms by copy-pasting code but I feel like this shouldn't work like that.
Here is a minimal example where I only fill it just so that it doesn't complain that the dask graph is empty when calling compute().

import hist
import hist.dask
import dask.array as da
​
x = da.random.normal(size=100)
h1 = hist.dask.Hist(hist.axis.Regular(10, -2, 2, name="ax1"))
h2 = h1.copy()
h2.axes.name = ("ax2",)
h2.fill(x)
​
print(h2.axes.name)
print(h2.compute().axes.name)
('ax2',)
('ax1',)

Merge results with a tree reduction.

When fills are staged/queued the resulting histograms are aggregated with delayed(sum)(hists), where hists is a list of delayed-filled-histograms.

A simple implementation of a tree reduction is copied below, but it would be nice to use existing parts of the dask API (if possible), general reduction functions exist (like for arrays).

Example implementation
def _tree_reduce(hists: List[Delayed]) -> Delayed:
    """Tree summation of delayed histogram objects.

    Parameters
    ----------
    hists : List[Delayed]
        Delayed histograms to be aggregated.

    Returns
    -------
    Delayed
        Final histogram aggregation.

    """
    hist_list = hists
    while len(hist_list) > 1:
        updated_list = []
        # even N, do all
        if len(hist_list) % 2 == 0:
            for i in range(0, len(hist_list), 2):
                lazy_comp = delayed(operator.add)(hist_list[i], hist_list[i + 1])
                updated_list.append(lazy_comp)
        # odd N, hold back the tail and add it later
        else:
            for i in range(0, len(hist_list[:-1]), 2):
                lazy_comp = delayed(operator.add)(hist_list[i], hist_list[i + 1])
                updated_list.append(lazy_comp)
            updated_list.append(hist_list[-1])

        hist_list = updated_list
    return hist_list[0]

Mean storage for histograms

When using a Mean or WeightedMean storage for a histogram to track an average value (the sample kwarg), I get an error.

import dask_histogram.boost as dhb
import dask.array as da
h = dhb.Histogram(
    dhb.axis.Regular(10, 0, 1),
    dhb.axis.Regular(10, 0, 1),
    storage=dhb.storage.Mean(),
)
h.fill(da.random.uniform(size=1_000_000), da.random.uniform(size=1_000_000), sample=da.random.normal(size=1_000_000))
h = h.compute()

Thanks again for your fast responses and for developing this package, it's really helping me to scale up my analysis and finally move away from ROOT!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2999949/3335659655.py in <module>
      7 )
      8 h.fill(da.random.uniform(size=1_000_000), da.random.uniform(size=1_000_000), sample=da.random.normal(size=1_000_000))
----> 9 h = h.compute()

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask_histogram/boost.py in compute(self)
    401             result_view = self.view(flow=True) + self._staged.compute().view(flow=True)
    402         else:
--> 403             result_view = self._staged.compute().view(flow=True)
    404         self[...] = result_view
    405         self._staged = None

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs)
    286         dask.base.compute
    287         """
--> 288         (result,) = compute(self, traverse=False, **kwargs)
    289         return result
    290 

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    568         postcomputes.append(x.__dask_postcompute__())
    569 
--> 570     results = schedule(dsk, keys, **kwargs)
    571     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    572 

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     77             pool = MultiprocessingPoolExecutor(pool)
     78 
---> 79     results = get_async(
     80         pool.submit,
     81         pool._max_workers,

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    515                             _execute_task(task, data)  # Re-execute locally
    516                         else:
--> 517                             raise_exception(exc, tb)
    518                     res, worker_id = loads(res_info)
    519                     state["cache"][key] = res

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/local.py in reraise(exc, tb)
    323     if exc.__traceback__ is not tb:
    324         raise exc.with_traceback(tb)
--> 325     raise exc
    326 
    327 

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    221     try:
    222         task, data = loads(task_info)
--> 223         result = _execute_task(task, data)
    224         id = get_id()
    225         result = dumps((result, id))

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/core.py in <genexpr>(.0)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    113     """
    114     if isinstance(arg, list):
--> 115         return [_execute_task(a, cache) for a in arg]
    116     elif istask(arg):
    117         func, args = arg[0], arg[1:]

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/core.py in <listcomp>(.0)
    113     """
    114     if isinstance(arg, list):
--> 115         return [_execute_task(a, cache) for a in arg]
    116     elif istask(arg):
    117         func, args = arg[0], arg[1:]

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/optimization.py in __call__(self, *args)
    967         if not len(args) == len(self.inkeys):
    968             raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 969         return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
    970 
    971     def __reduce__(self):

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/core.py in get(dsk, out, cache)
    149     for key in toposort(dsk):
    150         task = dsk[key]
--> 151         result = _execute_task(task, cache)
    152         cache[key] = result
    153     result = _execute_task(out, cache)

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask/utils.py in apply(func, args, kwargs)
     33 def apply(func, args, kwargs=None):
     34     if kwargs:
---> 35         return func(*args, **kwargs)
     36     else:
     37         return func(*args)

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/dask_histogram/core.py in _blocked_ma(histref, *sample)
     97 ) -> bh.Histogram:
     98     """Blocked calculation; multiargument; unweighted."""
---> 99     return clone(histref).fill(*sample, weight=None)
    100 
    101 

/dark/dmpoehlm/software/miniconda/envs/lm/lib/python3.8/site-packages/boost_histogram/_internal/hist.py in fill(self, weight, sample, threads, *args)
    470 
    471         if threads is None or threads == 1:
--> 472             self._hist.fill(*args_ars, weight=weight_ars, sample=sample_ars)
    473             return self
    474 

ValueError: Sample array must be 1D

feature request: Filling really large histograms that are distributed across a cluster.

It's often the case in HEP analysis that scientists will prefer making a very large (6 dense dimensions, 20-30 bins per dense dim) multidimensional histogram to store their reduced data, and this histogram will often have a large number of categories (between datasets and systematic variations). The full size of such a histogram often exceeds 2 GB and can easily be up to 8 GB in practice, depending on the analysis! This creates a clear resource problem when the standard batch slot size LHC experiments is 2GB/core, and the most freely available batch slot is 2GB/1core. Usually larger allocations are penalized heavily in terms of normal users' batch priority and/or require waiting for node drainage to schedule.

An immediate solution to this would be the ability to scatter the histogram across the accumulated compute resources so that the complete histogram with all its categories is never materialized until it is on the node the user submitted the task from in the first place (which tends to have 10s of GB of memory available). Or to write the histograms immediately to disk somewhere in parts (as a dask dataset too possibly)?

Since distributing the dense dimensions is comparatively harder to reassemble, perhaps a first place to start would be allowing a histogram to be distributed by its categorical dimensions. That is that each tuple of categories (dataset, systematic1, systematics2, ...) be a node in a dask task graph that we can send data to fill that instance of the histogram for that category tuple. This would restrict the typical histogram memory need to at most each about ~1GB (i.e. just the dense bins) which fits stably in these slots. The user is then free to collect those histogram parts on their local machine and merge the histograms into the final data product.

I'm happy to discuss this further, this is really just writing down thoughts about what would grab the attention of many HEP physicists actively doing analysis (giant multi-dim histograms using too much memory is the most common mode of failure for awkward-based analyses). Structurally I don't think it would be too bad to generate such a structure in dask, I don't know the specifics of really pulling it off, but could likely do so with guidance. We could also probably hide it all beneath the current fairly pleasant interface!

histgram filling not reproducible except in "processes" scheduler?

@iasonkrom @NJManganelli

@douglasdavis we've run into a problem where the bins filled for the histogram are mangled in the case of running in the sync, threads, and distributed executor, but somehow not processes (that or we haven't triggered it yet).

The plot below shows the difference between the first run of filling a histogram and then resetting that histogram and filling again with varying number of threads. If things were OK this would be a flat line!
image

What's interesting is that the total number of events filled does not change but the per-bin contents do!

We're trying to work on a minimal repro - but this is clearly not really acceptable behavior, so posting before we have the complete picture.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.