Comments (4)
Thanks @joshua-gould . I can easily reproduce with what you have above. It seems that when block_info
is provided dask is in code path which assumes an Array collection
Lines 901 to 926 in 484fc3f
This is obviously a problem as meta
is coercing the collection into a resulting Dataframe. I don't think we can just swap out partitions for chunks here
Would it be possible to convert the array to a dataframe and call map_partitions instead:
a.to_dask_dataframe().map_partitions(test2)
from dask.
I'm trying to extract the indices and values of a 2-d array where the value > 0. I have a solution below using dask.delayed
:
import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.array.core import slices_from_chunks
x = da.random.random((24, 24), chunks=(5, 6))
# numpy solution
_x = x.compute()
_indices = np.where(_x > 0)
df = pd.DataFrame({"value": _x[_indices], "y": _indices[0], "x": _indices[1]})
@dask.delayed
def process_chunk(a, offset):
indices = np.where(a > 0)
y = indices[0] + offset[0]
x = indices[1] + offset[1]
return pd.DataFrame({"value": a[indices], "y": y, "x": x})
output = []
for s in slices_from_chunks(x.chunks):
r = process_chunk(x[s], (s[0].start, s[1].start))
output.append(r)
meta = dd.utils.make_meta([("value", x.dtype), ("y", np.int64), ("x", np.int64)])
ddf = dd.from_delayed(output, meta=meta).compute()
# compare with numpy
df = df.sort_values(["y", "x", "value"]).reset_index(drop=True)
ddf = ddf.sort_values(["y", "x", "value"]).reset_index(drop=True)
pd.testing.assert_frame_equal(df, ddf)
from dask.
Hmm, could you instead do this with nonzero
and a mask ?
arr = np.array([[-1, 2, 0], [4, -5, 6], [0, 0, 7]])
arr = da.from_array(arr)
indicies = da.nonzero(arr > 0) # or rely on dispatching with np.nonzero
arr[arr >0]
from dask.
Thanks for your response. I'm not sure how to create a dask dataframe using this approach. I tried:
x = da.random.random((24, 24), chunks=(5, 6))
indices = da.where(x > 0)
vals = x.reshape(-1)[indices[0] * x.shape[1] + indices[1]]
ddf = dd.concat(
[
dd.from_array(vals, columns=["value"]),
dd.from_array(da.stack(indices, axis=1, allow_unknown_chunksizes=True), columns=["y", "x"]),
],
axis=1,
).compute()
But I get the warnings:
dask/array/slicing.py:1089: PerformanceWarning: Increasing number of chunks by factor of 20
p = blockwise(
dask_expr/_concat.py:146: UserWarning: Concatenating dataframes with unknown divisions.
We're assuming that the indices of each dataframes are
aligned. This assumption is not generally safe.
from dask.
Related Issues (20)
- ValueError: An error occurred while calling the read_csv method registered to the pandas backend HOT 2
- add a api load dataset from [huggingface datasets] HOT 4
- Couple of sparse tests are failing HOT 1
- I'm not sure what βb_dictβ is, I couldn't find any relevant content HOT 1
- Release GH action needs to be run twice HOT 1
- gpuCI failing due to `pytest` warning HOT 5
- pandas>=2.0.0 incompatibility ?
- Concat with unknown divisions raises TypeError HOT 1
- Dask 2024.5.1 removed `.attrs` HOT 11
- Dask 2024.5.1 raises exception when `.compute()` is called on a categorical column HOT 3
- a tutorial for distributed text deduplication HOT 5
- Large memory use when loading file with np.memmap in recent dask versions HOT 8
- `dask/dataframe/tests/test_indexing.py::test_getitem_integer_slice` failing with nightly `pandas`
- dask.dataframe import error for Python 3.12.3 HOT 3
- 'SeriesGroupBy' object has no attribute 'nunique_approx' HOT 6
- Categorical column information incorrectly copied over when using series to create new dataframe resulting in a broken dataframe
- calling repartition on ddf with timeseries index after resample causes ValueError: left side of old and new divisions are different
- Can not process datasets created by the older version of Dask HOT 9
- P2P rechunking of ERA-5 from spatial to temporal dimension is failing hard HOT 15
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.