Comments (11)
We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.
Are we ever interested in maintaining the correct value of .attrs
for any intermediate expression or only in its value on the root collection? In the latter case, attaching it to the collection and fixing it during the optimization step feels manageable.
from dask.
I also experienced a problem with pandas
and .attrs
: copied DataFrame
objects where not containing a copy of .attrs
. (I initially thought it was a geopandas
problem geopandas/geopandas#2920), but then it turned out to be from pandas
pandas-dev/pandas#54134. Fortunately the bug above is now fixed. Which other problems did you find?
The use case for .attrs
we have in the library we developed https://github.com/scverse/spatialdata/, is to store metadata associated to GeoDataFrame
and DataFrame
objects (both lazy and non-lazy). The metadata mostly contain json-like information that describes how various spatial objects are aligned together.
from dask.
Thanks for the report.
Could you add a bit more context about what you are trying to achieve with attr? attr doesn't really work in pandas either and support is spotty at most.
from dask.
Do you have any update on this? 😊 Some users reported problems with installation due to having pinned an older version of Dask in our config so it's an important issue for us. Thank you for your time.
from dask.
This is non-trivial to add and the semantics aren't completely clear either. This can't live on the collection level, it has to be on the expression level since we are constantly recreating the underlying expression, which makes it non-trivial. That said, the following is unclear to me:
df = dd.from_pandas(...)
df.attrs = "foo"
df = df.fillna(100)
df = df["a"]
df.attrs = "bar"
dask-expr will reorder the query and push the projection in front of the fillna. So what attr should take precedence here?
Contributions are very welcome. I won't have much time to think about this though
from dask.
Thanks for the answer. From my understanding (and for the purpose of our use case) the df.attrs
slot should not be treated as lazy but always be executed immediately. This was the behavior implemented before Dask 2024.5.1. In particular in your example the computational graph would never contain nodes related to modifying .atttrs
.
I think making a PR for this should be quick, if you agree with this semantic I could give it a try.
from dask.
That would be fine but this still doesn't cover what should happen after df.optimize() is called, which is the tricky part here
from dask.
I will try making some experiments and get back to you.
from dask.
This can't live on the collection level, it has to be on the expression level
can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.
--
pandas-dev/pandas#52166 reads like there are a lot of questions around this feature and I wouldn't be surprised if some of this is subject to change.
from dask.
can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.
We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.
from dask.
To summarize a bit of an offline conversation since I got confused about some of the earlier comments
attrs
is currently a poorly defined API in pandas where semantics are not always clearly defined (some examples are defined in pandas-dev/pandas#52166 for instance around copy-on-write)- This lack of specification makes it very hard for us to implement this. While we could attach high level metadata to a collection we could not rely this metadata to be there on an intermediate layer. This is pretty much impossible right now with how the optimizer works
- This may be a problem for libraries that rely on this to define/control behavior
Note: this is not related to 2024.5.1 but was introduced in 2024.3.0 when we enabled query optimization by default
from dask.
Related Issues (20)
- test_tokenize failures in 2024.8.1 HOT 5
- Slicing an array on the last chunk of an axis duplicates the number of chunks HOT 3
- different `run_spec` between consecutive calls to `update_graph` | zarr-formatted xarray
- Appending to partitioned parquet with metadata throws appended dtypes differ even though they should be the same
- when using max/min as first expression for new collumn dataframe will not compute HOT 1
- New "auto" rechunking can break with Zarr
- dask.dataframe can't read_csv HOT 3
- Memory issues with slicing HOT 3
- mode on `axis=1` HOT 4
- "Order of columns does not match" error should give an extra info about expected order HOT 2
- Circular imports in dask-histogram/dask-awkward HOT 6
- Discrepancy in column property with actual structure after grouping
- When adding collumns from 2 dataframes will not compute in some instances, fix for one instance seems to break the other HOT 1
- Are there any workarounds for dask breaking altogether with higher amounts of load than what fits into a worker HOT 2
- Improve error message for boolean index assignment with `nan` shape HOT 1
- Boolean index assignment fails for values of `ndim>1`
- pandas & dask metadata mismatch after .unique()
- Is there any way to have the finalize task be distributed across workers too HOT 1
- Delayed function with string argument value matching dask_key_name causes a circular reference detection HOT 1
- Invalid `validate` argument in `dask.dataframe.merge`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.