The latest release removed .attrs . This breaks backwa

I also experienced a problem with pandas and <code cl

Dask 2024.5.1 removed `.attrs` about dask HOT 11 OPEN

LucaMarconato commented on September 26, 2024

Dask 2024.5.1 removed `.attrs`

from dask.

Comments (11)

hendrikmakait commented on September 26, 2024 3

We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.

Are we ever interested in maintaining the correct value of .attrs for any intermediate expression or only in its value on the root collection? In the latter case, attaching it to the collection and fixing it during the optimization step feels manageable.

from dask.

LucaMarconato commented on September 26, 2024 1

I also experienced a problem with pandas and .attrs: copied DataFrame objects where not containing a copy of .attrs. (I initially thought it was a geopandas problem geopandas/geopandas#2920), but then it turned out to be from pandas pandas-dev/pandas#54134. Fortunately the bug above is now fixed. Which other problems did you find?

The use case for .attrs we have in the library we developed https://github.com/scverse/spatialdata/, is to store metadata associated to GeoDataFrame and DataFrame objects (both lazy and non-lazy). The metadata mostly contain json-like information that describes how various spatial objects are aligned together.

from dask.

phofl commented on September 26, 2024

Thanks for the report.

Could you add a bit more context about what you are trying to achieve with attr? attr doesn't really work in pandas either and support is spotty at most.

from dask.

LucaMarconato commented on September 26, 2024

Do you have any update on this? 😊 Some users reported problems with installation due to having pinned an older version of Dask in our config so it's an important issue for us. Thank you for your time.

from dask.

phofl commented on September 26, 2024

This is non-trivial to add and the semantics aren't completely clear either. This can't live on the collection level, it has to be on the expression level since we are constantly recreating the underlying expression, which makes it non-trivial. That said, the following is unclear to me:

df = dd.from_pandas(...)
df.attrs = "foo"
df = df.fillna(100)
df = df["a"]
df.attrs = "bar"

dask-expr will reorder the query and push the projection in front of the fillna. So what attr should take precedence here?

Contributions are very welcome. I won't have much time to think about this though

from dask.

LucaMarconato commented on September 26, 2024

Thanks for the answer. From my understanding (and for the purpose of our use case) the df.attrs slot should not be treated as lazy but always be executed immediately. This was the behavior implemented before Dask 2024.5.1. In particular in your example the computational graph would never contain nodes related to modifying .atttrs.

I think making a PR for this should be quick, if you agree with this semantic I could give it a try.

from dask.

phofl commented on September 26, 2024

That would be fine but this still doesn't cover what should happen after df.optimize() is called, which is the tricky part here

from dask.

LucaMarconato commented on September 26, 2024

I will try making some experiments and get back to you.

from dask.

fjetter commented on September 26, 2024

This can't live on the collection level, it has to be on the expression level

can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.

pandas-dev/pandas#52166 reads like there are a lot of questions around this feature and I wouldn't be surprised if some of this is subject to change.

from dask.

phofl commented on September 26, 2024

can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.

We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.

from dask.

fjetter commented on September 26, 2024

To summarize a bit of an offline conversation since I got confused about some of the earlier comments

attrs is currently a poorly defined API in pandas where semantics are not always clearly defined (some examples are defined in pandas-dev/pandas#52166 for instance around copy-on-write)
This lack of specification makes it very hard for us to implement this. While we could attach high level metadata to a collection we could not rely this metadata to be there on an intermediate layer. This is pretty much impossible right now with how the optimizer works
This may be a problem for libraries that rely on this to define/control behavior

Note: this is not related to 2024.5.1 but was introduced in 2024.3.0 when we enabled query optimization by default

from dask.

Dask 2024.5.1 removed `.attrs` about dask HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent