Comments (3)
@hhuuggoo might have thoughts on this too.
from dask.
cytoolz solution
import numpy as np
from cytoolz import merge_sorted, concat, map, partition_all
def shard(n, x):
"""
>>> list(shard(3, list(range(10))))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
"""
for i in range(0, len(x), n):
yield x[i: i + n]
def array_merge_sorted(seqs, out=None, out_chunksize=2**14):
""" Merge step of external sort
Merged sorted sequences of numpy arrays in to out result
"""
assert out is not None
seqs2 = [concat(x.tolist() for x in seq) for seq in seqs]
seq = merge_sorted(*seqs2)
chunks = map(np.array, partition_all(out_chunksize, seq))
for i, chunk in enumerate(chunks):
out[i*out_chunksize: min(len(out), (i+1)*out_chunksize)] = chunk
return out
test file
from dask.frame.esort import array_merge_sorted, shard
import numpy as np
def test_shard():
result = list(shard(3, np.arange(10)))
assert result[0].tolist() == [0, 1, 2]
assert result[1].tolist() == [3, 4, 5]
assert result[2].tolist() == [6, 7, 8]
assert result[3].tolist() == [9]
def test_esort():
seqs = [np.random.random(size=(np.random.randint(100))) for i in range(5)]
sorteds = [np.sort(x) for x in seqs]
chunks = [shard(5, x) for x in sorteds]
out = np.empty(shape=(sum(len(x) for x in seqs),))
array_merge_sorted(chunks, out=out)
assert (out == np.sort(np.concatenate(seqs))).all()
from dask.
Closing this in favor of approximate percentiles
from dask.
Related Issues (20)
- `isinstance(Future, Future)` evaluates to `False` HOT 1
- linalg.solve has unexpected TypeError with cupy HOT 3
- Applying `functools.partial()` to a `@delayed`-decorated function changes the execution behavior
- Shuffle-based drop duplicates produces incorrect result with ``shuffle="p2p"`` HOT 10
- `read_parquet` doesn't handle nested objects (dicts, arrays) HOT 1
- Docs page for dask.dataframe.to_parquet doesn't include None option for compression param
- Nice visual for docs page HOT 4
- `test_split_adaptive_aggregate_files` failing on main HOT 4
- Deprecate ``shuffle`` keyword in favour of ``shuffle_method`` for DataFrame methods HOT 5
- Deprecate fastparquet engine for read_parquet to enable switch to dask-expr HOT 1
- Boolean logic with shift seems broken
- Setup unit test overview and test report HOT 1
- Breaking of concurrency when calling `dask.delayed` inside a `@dask.delayed`-decorated function HOT 5
- Deprecate ``npartitions="auto"`` for set_index and sort_values
- Deprecate compute keyword for set_index HOT 2
- DataFrame interchange protocol support
- Deprecate inplace keywords for dask-expr
- Deprecate Series.view
- 3.12 CI started failing a couple of days ago
- pylint raises no-value-for-the-parameter rule with dask.delayed used as decorator with keyword arguments
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.