Git Product home page Git Product logo

Comments (4)

quasiben avatar quasiben commented on July 22, 2024

Thanks @joshua-gould . I can easily reproduce with what you have above. It seems that when block_info is provided dask is in code path which assumes an Array collection

dask/dask/array/core.py

Lines 901 to 926 in 484fc3f

if has_keyword(func, "block_info"):
starts = {}
num_chunks = {}
shapes = {}
for i, (arg, in_ind) in enumerate(argpairs):
if in_ind is not None:
shapes[i] = arg.shape
if drop_axis:
# We concatenate along dropped axes, so we need to treat them
# as if there is only a single chunk.
starts[i] = [
(
cached_cumsum(arg.chunks[j], initial_zero=True)
if ind in out_ind
else [0, arg.shape[j]]
)
for j, ind in enumerate(in_ind)
]
num_chunks[i] = tuple(len(s) - 1 for s in starts[i])
else:
starts[i] = [
cached_cumsum(c, initial_zero=True) for c in arg.chunks
]
num_chunks[i] = arg.numblocks
out_starts = [cached_cumsum(c, initial_zero=True) for c in out.chunks]

This is obviously a problem as meta is coercing the collection into a resulting Dataframe. I don't think we can just swap out partitions for chunks here

Would it be possible to convert the array to a dataframe and call map_partitions instead:

a.to_dask_dataframe().map_partitions(test2)

from dask.

joshua-gould avatar joshua-gould commented on July 22, 2024

I'm trying to extract the indices and values of a 2-d array where the value > 0. I have a solution below using dask.delayed:

import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.array.core import slices_from_chunks

x = da.random.random((24, 24), chunks=(5, 6))
# numpy solution
_x = x.compute()
_indices = np.where(_x > 0)
df = pd.DataFrame({"value": _x[_indices], "y": _indices[0], "x": _indices[1]})


@dask.delayed
def process_chunk(a, offset):
    indices = np.where(a > 0)
    y = indices[0] + offset[0]
    x = indices[1] + offset[1]
    return pd.DataFrame({"value": a[indices], "y": y, "x": x})


output = []
for s in slices_from_chunks(x.chunks):
    r = process_chunk(x[s], (s[0].start, s[1].start))
    output.append(r)
meta = dd.utils.make_meta([("value", x.dtype), ("y", np.int64), ("x", np.int64)])
ddf = dd.from_delayed(output, meta=meta).compute()

# compare with numpy
df = df.sort_values(["y", "x", "value"]).reset_index(drop=True)
ddf = ddf.sort_values(["y", "x", "value"]).reset_index(drop=True)
pd.testing.assert_frame_equal(df, ddf)

from dask.

quasiben avatar quasiben commented on July 22, 2024

Hmm, could you instead do this with nonzero and a mask ?

arr = np.array([[-1, 2, 0], [4, -5, 6], [0, 0, 7]])
arr = da.from_array(arr)
indicies = da.nonzero(arr > 0) # or rely on dispatching with np.nonzero
arr[arr >0]

from dask.

joshua-gould avatar joshua-gould commented on July 22, 2024

Thanks for your response. I'm not sure how to create a dask dataframe using this approach. I tried:

x = da.random.random((24, 24), chunks=(5, 6))
indices = da.where(x > 0)
vals = x.reshape(-1)[indices[0] * x.shape[1] + indices[1]]

ddf = dd.concat(
    [
        dd.from_array(vals, columns=["value"]),
        dd.from_array(da.stack(indices, axis=1, allow_unknown_chunksizes=True), columns=["y", "x"]),
    ],
    axis=1,
).compute()

But I get the warnings:

dask/array/slicing.py:1089: PerformanceWarning: Increasing number of chunks by factor of 20
  p = blockwise(
dask_expr/_concat.py:146: UserWarning: Concatenating dataframes with unknown divisions.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.

from dask.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.