Comments (6)
One question that comes up here (and it's the same question for spatial repartitioning a dask.dataframe to conform to given regions): how to deal with possible duplicates?
If you specify the regions for the spatial partitions and use ST_INTERSECTS
, a given geometry can intersect with multiple regions.
In @mrocklin's prototype, he had some special code to check the representative point of a geometry instead, to avoid such duplicates: https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L339-L406
from dask-geopandas.
I was hoping that this will be resolved in a spatial repartitioning bit and we could just use the same principle here :).
Using an intersection with a representative point instead of geometry is one way, which would resolve a lot of cases. But not all, since you can have input points fixed to some grid and use the same grid for partitioning. Then one point could be intersecting the edge of 2 (or even 4) parts, so not even that is robust enough.
I would probably try the following:
- Generate representative_point/centroid for non-point geometry
- partition based on points (this is where
ST_INTERSECTS
would be used) - find those duplicated duplicates (if we'll use
query_bulk
than it is easy) - keep duplicates in one partition only, no matter which one. Or use this clever thing in the code you linked - https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L394-L406
I am just afraid that it can be expensive.
from dask-geopandas.
I am just afraid that it can be expensive.
When starting from an existing dask.dataframe (not necessarily in memory, can also be backed by reading in from eg parquet), doing a repartition is expensive anyway (since it's doing a full re-shuffle of the data), so the additional checks for intersects / duplicates might not be that of a problem.
And for this case, your points above seem like a good workflow (and I think we can probably simply keep the duplicates in one of the partitions, no matter which one, without using the clever touch/shift trick).
For reading from PostGIS, it's a bit different though. Because in this case ideally the queries you do from postgis directly give you the correct data for the partitions of the GeoDataFrame, I think? (so we wouldn't use query_bulk
on those data afterwards to check if the partitioning is correct) In which case the "logic" that determines which rows to select should live in the SQL query?
from dask-geopandas.
Are we actually able to put this logic to the SQL query? If we can get "clean" chunks from SQL which can be directly mapped to partitions, that is ideal. That would not be a simple query. That is why I was mentioning that it can be expensive, as I imagined dumb reading from PostGIS and filtering on the dask side (which would require unique id coming from PostGIS).
from dask-geopandas.
Are we actually able to put this logic to the SQL query?
If we cannot (and thus doing the "dumb reading from PostGIS and filtering/repartitioning on the dask side"), I am not sure how useful the method would be.
Repartitioning will typically be something you want to do once, and then try to "persist" the result or write the result to disk using a more efficient format (and a format that preserves the spatial partitioning information). Certainly in case the PostGIS table is larger than memory, doing the full shuffle each time when reading + doing a calculation will not be very efficient, I think.
from dask-geopandas.
I have had good experience in letting postgis handle the partitioning.
In general, the partitioning should be done right at the IO level, either by having the data preprocessed in a partitioned structure, or by having an index on the data that allows efficient spatial queries (and thus spatial partitioning). The PostGIS database falls into the second category.
When using the standard GIST index on geometry, the most efficient query would be “SELECT (...) WHERE geom_col && partition_box”. So not using ST_Intersects, which might become expensive for complex geometries.
You could then add a filter with a representative point and check if it is in the bbox (half-open intervals) to make this unique in all cases. That is a rather efficient (but verbose) addition to the query. You could also do the index “ON ST_Centroid(geometry)”.
However in my experience it is often faster to accept some duplicates and filter them out when the partitions are merged. The duplicates give some superfluous computations, but this is an edge effect. For many spatial analyses you actually need the duplicates to be there.
Another issue with an external database is: how to choose the partitions without doing a full scan through the data? I always did some square grid and supplied the extent of it myself, which mostly results in a big imbalance between partition sizes. This is certainly not a show-stopper. When comparing geometries with array tiles, this is actually a good idea.
For more balanced partitions: maybe we could leverage the internal index structure of the PostGIS R-tree?
from dask-geopandas.
Related Issues (20)
- Drop distributed as a required dependency? HOT 2
- Question regarding parallelism over many seperate GeoSeries HOT 2
- dask geopandas to parquet does not seem to persist spatial paritions HOT 1
- Can someone answer why the number and x columns of '201105. shp' in the output of this code also become 0? HOT 1
- msgpack - ValueError: 2369781118 exceeds max_bin_len(2147483647 HOT 1
- Remove dask anti-pattern example on README and docs HOT 1
- DeprecationWarning: underlying geometries through the `.data` attribute is deprecated HOT 1
- Error when reading geoparquet file HOT 3
- Support latest dask.dataframe with query planning (dask-expr) HOT 3
- ddf._meta_nonempty doesnt instantiate correctly when calling `from_dask_dataframe` HOT 1
- BUG: `to_parquet()` failing with `dask=2024.4.1` HOT 2
- Uninformative AttributeError for aggregation methods
- AttributeError: 'DataFrame' object has no attribute 'within' HOT 9
- Can `GeoDataFrame.crs` set `None`? HOT 5
- spatial_shuffle fails when loading from a shapefile HOT 9
- `ValueError: 'left_df' should be GeoDataFrame, got <class 'tuple'>` using sjoin after `spatial_shuffle` HOT 7
- dissolve does not accept multiple columns to groupBy
- ValueError: 'left_df' should be GeoDataFrame, got <class 'tuple'> HOT 2
- spatial_shuffle fails when loading from a geopackage HOT 1
- Error when importing dask_geopandas.expr on Python 3.10 when Dask query planning is disabled HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-geopandas.