<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

map_blocks returning pd.DataFrame fails with block_info parameter about dask HOT 4 OPEN

joshua-gould commented on July 22, 2024

map_blocks returning pd.DataFrame fails with block_info parameter

from dask.

Comments (4)

quasiben commented on July 22, 2024

Thanks @joshua-gould . I can easily reproduce with what you have above. It seems that when block_info is provided dask is in code path which assumes an Array collection

dask/dask/array/core.py

Lines 901 to 926 in 484fc3f

 if has_keyword(func, "block_info"): 

 starts = {} 

 num_chunks = {} 

 shapes = {} 

 for i, (arg, in_ind) in enumerate(argpairs): 

 if in_ind is not None: 

 shapes[i] = arg.shape 

 if drop_axis: 

 # We concatenate along dropped axes, so we need to treat them 

 # as if there is only a single chunk. 

 starts[i] = [ 

 ( 

 cached_cumsum(arg.chunks[j], initial_zero=True) 

 if ind in out_ind 

 else [0, arg.shape[j]] 

 ) 

 for j, ind in enumerate(in_ind) 

 ] 

 num_chunks[i] = tuple(len(s) - 1 for s in starts[i]) 

 else: 

 starts[i] = [ 

 cached_cumsum(c, initial_zero=True) for c in arg.chunks 

 ] 

 num_chunks[i] = arg.numblocks 

 out_starts = [cached_cumsum(c, initial_zero=True) for c in out.chunks]

This is obviously a problem as meta is coercing the collection into a resulting Dataframe. I don't think we can just swap out partitions for chunks here

Would it be possible to convert the array to a dataframe and call map_partitions instead:

a.to_dask_dataframe().map_partitions(test2)

from dask.

joshua-gould commented on July 22, 2024

I'm trying to extract the indices and values of a 2-d array where the value > 0. I have a solution below using dask.delayed:

import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.array.core import slices_from_chunks

x = da.random.random((24, 24), chunks=(5, 6))
# numpy solution
_x = x.compute()
_indices = np.where(_x > 0)
df = pd.DataFrame({"value": _x[_indices], "y": _indices[0], "x": _indices[1]})


@dask.delayed
def process_chunk(a, offset):
    indices = np.where(a > 0)
    y = indices[0] + offset[0]
    x = indices[1] + offset[1]
    return pd.DataFrame({"value": a[indices], "y": y, "x": x})


output = []
for s in slices_from_chunks(x.chunks):
    r = process_chunk(x[s], (s[0].start, s[1].start))
    output.append(r)
meta = dd.utils.make_meta([("value", x.dtype), ("y", np.int64), ("x", np.int64)])
ddf = dd.from_delayed(output, meta=meta).compute()

# compare with numpy
df = df.sort_values(["y", "x", "value"]).reset_index(drop=True)
ddf = ddf.sort_values(["y", "x", "value"]).reset_index(drop=True)
pd.testing.assert_frame_equal(df, ddf)

from dask.

quasiben commented on July 22, 2024

Hmm, could you instead do this with nonzero and a mask ?

arr = np.array([[-1, 2, 0], [4, -5, 6], [0, 0, 7]])
arr = da.from_array(arr)
indicies = da.nonzero(arr > 0) # or rely on dispatching with np.nonzero
arr[arr >0]

from dask.

joshua-gould commented on July 22, 2024

Thanks for your response. I'm not sure how to create a dask dataframe using this approach. I tried:

x = da.random.random((24, 24), chunks=(5, 6))
indices = da.where(x > 0)
vals = x.reshape(-1)[indices[0] * x.shape[1] + indices[1]]

ddf = dd.concat(
    [
        dd.from_array(vals, columns=["value"]),
        dd.from_array(da.stack(indices, axis=1, allow_unknown_chunksizes=True), columns=["y", "x"]),
    ],
    axis=1,
).compute()

But I get the warnings:

dask/array/slicing.py:1089: PerformanceWarning: Increasing number of chunks by factor of 20
  p = blockwise(
dask_expr/_concat.py:146: UserWarning: Concatenating dataframes with unknown divisions.
We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.

from dask.

map_blocks returning pd.DataFrame fails with block_info parameter about dask HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if has_keyword(func, "block_info"):
	starts = {}
	num_chunks = {}
	shapes = {}

	for i, (arg, in_ind) in enumerate(argpairs):
	if in_ind is not None:
	shapes[i] = arg.shape
	if drop_axis:
	# We concatenate along dropped axes, so we need to treat them
	# as if there is only a single chunk.
	starts[i] = [
	(
	cached_cumsum(arg.chunks[j], initial_zero=True)
	if ind in out_ind
	else [0, arg.shape[j]]
	)
	for j, ind in enumerate(in_ind)
	]
	num_chunks[i] = tuple(len(s) - 1 for s in starts[i])
	else:
	starts[i] = [
	cached_cumsum(c, initial_zero=True) for c in arg.chunks
	]
	num_chunks[i] = arg.numblocks
	out_starts = [cached_cumsum(c, initial_zero=True) for c in out.chunks]