Git Product home page Git Product logo

Comments (20)

shoyer avatar shoyer commented on September 27, 2024 2

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

I think this really comes down to two questions:

  1. Do we want isinstance(DataTree, Dataset) to return True?

  2. Do we want to automatically inherit all Dataset methods, or only wrap the methods we have specifically chosen?

Pros of automatic inheritance:

  • We would not have to write out the entire massive API of xarray.Dataset again within the definition of DataTree.
  • Methods which rely on other Dataset methods to work should then automatically work, for example lots of methods internally rely on Dataset._construct_direct() to rebuild the result before returning. These don't need to be wrapped with map_over_subtree, so inheriting them undecorated is what we want.
  • We also get all property methods defined immediately.

Pros of manual wrapping:

  • There won't be any unexpected behaviour resulting from inheriting a method we hadn't thought about. Any method we didn't explicitly consider would not be defined on DataTree.
  • We still have to decorate most of the Dataset methods with @map_over_subtree, and this is easier to control without mistakes if they are wrapped rather than inherited.
  • We have a separate namespace for Dataset methods vs DataTree methods. This is helpful if we want to make identically-named methods on both classes easily available to the user. For example DataTree.merge(another_dt) and Dataset.merge(another_ds) clearly need different implementations, but would collide if both defined on DataTree. Calling the latter via DataTree.ds.merge(another_ds) would allow both to be accessible from the DataTree interface.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

Another, probably much better, approach to inheritance would be to make benign changes to xarray.Dataset's inheritance pattern upstream that make it easier to pick and choose the correct methods to inherit on DataTree. For example, we currently have to awkwardly create our own DatasetPropertiesMixin by wrapping properties in Dataset, but we could instead factor out those property definitions in xarray into a xarray.core.Dataset.DatasetPropertiesMixin class, then inherit something like this:

xarray.core.dataset.py:

class DatasetPropertiesMixin:
    @property
    def dims(self):
        ...  # etc.

class DatasetMethodsMixin:
    def isel(self, indexers, ...):
        ...  # etc.

class Dataset(DatasetPropertiesMixin, DatasetMethodsMixin, DataWithCoords, DatasetArithmetic):
    ...

datatree.datatree.py:

from xarray.core.dataset import DatasetPropertiesMixin, DatasetMethodsMixin

DataTreeMethodsMixin = map_all_methods_over_subtree(DatasetMethodsMixin)
DataTreeArithmetic = map_all_methods_over_subtree(DatasetArithmetic)

class Dataset(DatasetPropertiesMixin, DataTreeMethodsMixin, DataWithCoords, DataTreeArithmetic):
    ...

(where that example also has another mixin class for mapped Dataset methods too.)

This would not make isinstance(DataTree, Dataset) return True, but it would probably be the shortest and least error-prone way to get all the right methods defined on all the right classes.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

@shoyer or @max-sixty, I would appreciate either of your opinions on the above fundamental design question for DataTree, because this is practically the most core question, and at the moment I'm just talking to myself 😅

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

This also came up in #26, in the context of whether an empty netCDF group is better represented by an empty Dataset or None (a question also related to #4.)

from datatree.

max-sixty avatar max-sixty commented on September 27, 2024

this is practically the most core question, and at the moment I'm just talking to myself

eventually software ends up at philosophy...


I do think it's an interesting question. I'm really not sure where we'll end up.

Without having thought as deeply as you, I would probably start with composing them. Basically for the reasons you describe — inheritance a) couples the classes together and b) stepping back, it relies on a known ontology — so mixins like DatasetMethodsMixin & DatasetPropertiesMixin might seem like a good structure now — but then for another case we're going to need a slightly different structure, and the number of Mixins propagate until they're no longer useful groupings.

Having everything behind a .ds property does seem like a worse API, but it might be a reasonable way of getting some cool examples going (I already see some good ones!). Or to reduce the amount of boilerplate to get started, you could intercept methods with __getattribute__ and decide where to route them?

And this doesn't preclude a Dataset becoming a DataTree in the future — it's much easier to go composition -> inheritance (or even integration) than the other way around.

To confirm, I haven't thought about this sufficiently, so please weigh my view appropriately!

from datatree.

shoyer avatar shoyer commented on September 27, 2024

I would also start with composition. It's much more verbose, but otherwise you'll never be quite sure when DataTree methods will work properly. It should not be too painful to use a DataTree.dataset property for any missing methods.

Making issubclass(DataTree, Dataset) true is also problematic, unless a DataTree is in fact 100% substitutable for a Dataset. Even if this can be made true (which I doubt), most of us have a hard time reasoning about inheritance.

Reusing Xarray's mixins would be reasonable, but only if/when DataTree lives inside the Xarray codebase. Otherwise it would be too easy to introduce errors.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

Thanks both for your input.

I did in fact start with composition, so I'll keep it that way for now.

Reusing Xarray's mixins would be reasonable, but only if/when DataTree lives inside the Xarray codebase. Otherwise it would be too easy to introduce errors.

I'm not sure if this is quite what you meant, but I think what I'll do for now is define my own mixins internally, but make sure none of them actually inherit from any of xarray's mixins. That way I can keep the internal organisational difference between "methods on Dataset that need wrapping", "methods on Dataset that need wrapping and mapping over child nodes", "properties on Dataset that need wrapping" etc., but without actually depending on any private xarray API.

For example, this mixin for Datatree to inherit from does not itself inherit from (or even reference) any of xarray's mixins:

class MappedDatasetMethodsMixin:
    """
    Mixin to add Dataset methods like .mean(), but wrapped to map over all nodes in the subtree.
    """

    __slots__ = ()
    _wrap_then_attach_to_cls(
        cls_dict=vars(), 
        copy_methods_from=Dataset, 
        methods_to_copy=_ALL_DATASET_METHODS_TO_MAP, 
        wrap_func=map_over_subtree,
    )


class DataTree(
    TreeNode,
    DatasetPropertiesMixin,
    MappedDatasetMethodsMixin,
    MappedDataWithCoordsMixin,
    DataTreeArithmeticMixin,
):

from datatree.

jhamman avatar jhamman commented on September 27, 2024

It seems that the current discussion may be hedging toward inheritance over composition to address issues like name collisions (#38) and mutability. However, it may be worth considering the possibilities of a third option that doesn't use the Dataset as a container at the node level. The concept would basically define each node as either a DataArray or a TreeNode (subtree). TreeNodes would be responsible for insuring alignment, coordinating ops among children, and would be directly convertible to Datasets with 0 or more variables.

This would solve the name collisions problem but would require much more logic here in datatree. Its possible this would be much harder than a direct inheritance implementation but I thought I would through the idea out there.

from datatree.

shoyer avatar shoyer commented on September 27, 2024

The concept would basically define each node as either a DataArray or a TreeNode (subtree). TreeNodes would be responsible for insuring alignment, coordinating ops among children, and would be directly convertible to Datasets with 0 or more variables.

I like this idea a lot, but let me suggest a slight variation: store Variable rather than DataArray objects in the dict.

The core data model for xarray.Dataset includes:

  1. attrs: dict of attributes
  2. variables: dict of xarray.Variable
  3. coord_names: set of coordinate names
  4. dims: dict of dimension names to sizes
  5. indexes: dict from dimension name to xarray.Index objects (currently optional and can be left as None, but will be required after the explicit indexes refactor).

My suggestion would be to make DataTree quite similar, except that variables can also hold DataTree objects.

To implement methods on DataTree, you would filter DataTree objects out of variables, and then convert the remainder directly into an xarray.Dataset.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

Thanks @jhamman and @shoyer - I started typing this before Stephan's comment just now so I will finish the thought - sounds like we're on the same page though:

What you're suggesting is basically what I was thinking of as a last-resort solution to solve #38 . Basically Dataset is very roughly like a dict whose values are Variables, whereas you could imagine re-implementing DataTree as a similar dict whose values can be Variables but also can be more DataTree nodes. This would get around #38 because you would no longer have a "dict storing a dict", so to speak, instead treating variables and nodes at the same level.

You could implement that with inheritance, which sounds conceptually neat but Stephan is probably right that it would cause more problems than it would solve. The alternative as you say Joe is "integration": essentially writing something that reimplements all the high-level aligning behaviour of Dataset, but also allows for storing DataTrees. More effort but basically guaranteed to work in the intended manner. Would probably also mean more direct importing from/coupling with xarray's internals.

You would then have some kind of DataTree.to_dataset() method defined for when you really need the data to be stored in an instance of Dataset.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

I would do this by having an API something like:

  • .items: can be either Variables or DataTrees
  • .variables: only the Variables
  • .nodes: only the DataTrees

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

This idea is also closely related to https://github.com/TomNicholas/datatree/issues/3, and would re-open the question of whether to expose only local dims etc. like was discussed in https://github.com/TomNicholas/datatree/issues/36.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

If we did implement it like this I personally think there would then be a much stronger case for eventually integrating DataTree into xarray, rather than it just living in xarray-contrib. The current approach simply wraps (and sometimes emulates) Dataset, so it makes sense for it to live in a separate library that lies atop xarray, but this new suggestion is a Dataset-like object which shares a huge amount of functionality and probably internal code with xarray.Dataset. A lot of the functionality could probably be tested by the same tests that are currently used for xarray Dataset too... Not sure what your thoughts on that are Stephan.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

(@alexamici you will probably find this whole discussion interesting, and I would like to hear if you do have any thoughts!)

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

One disadvantage of the "integration" approach is a lack of clear separation for users between operations that act on just that single node, and operations which map over all nodes in that subtree. For example whilst .coords should return only the coordinates stored in that node, what should assign_coords do? It could assign coords to that node alone, or attempt to assign the same coords to every node in the subtree. Sure, the method could have a new kwarg map_over_subtree, but it's not going to be intuitively obvious to the user whether the pattern is that that would default to True or False.

Compare to the current implementation, where to act on a single node's dataset you always have to pull out .ds, and every method on the DataTree object maps over the whole subtree. That is definitely clearer.

from datatree.

shoyer avatar shoyer commented on September 27, 2024

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

I think we could probably still make the interface using .ds work. This would require modifying xarray.Dataset so it can support a custom mapping object for storing variables, but that seems doable?

I'm sorry but I'm very much not following 😅 Could you elaborate?

What would the custom mapping object do? How would that help with the original name collision problem? If .ds is Dataset, even a modified one, how can it check what children are contained in the class which wraps it? Or are you suggesting some kind of FrozenDataset?

Or alternatively are you imagining that variables are still stored privately in the DataTree level, but only accessed via DataTree.ds? That makes more sense, but what would the .ds property return? An actual (mutable) Dataset? A frozen Dataset, with a mutable one only via .to_dataset()? Some kind of DatasetProxy? And would that require sending operations on .ds back up to DataTree, e.g. dt.ds.assign_coords() would look at the .children of dt before altering anything? That would almost be like having a .ds accessor on the DataTree object...

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

I asked Stephan about his suggestion in the xarray dev meeting just now, and I'm going to write down my understanding of it here now (by answering my own questions), because I think it's a cool idea and I don't want to forget it!

What would the custom mapping object do?

It would be a dict-like container storing variables and possibly children, but the key point is that it would be accessible separately from the Dataset object, so that both a DataTree node and the wrapped Dataset would be able to point to a common record of variables/children stored.

How would that help with the original name collision problem?

If someone tries to add a variable to the wrapped Dataset (or a child to the wrapping DataTree), then by consulting the custom mapping object you would be able to check all variables and children for any name collisions.

And would that require sending operations on .ds back up to DataTree?

That is essentially what the custom mapping object would allow - any change to the variables in the Dataset would then be automatically reflected in the wrapping DataTree.

what would the .ds property return? An actual (mutable) Dataset? A frozen Dataset, with a mutable one only via .to_dataset()?

This is still a valid question though I think - if .ds returns a Dataset that is still linked to the wrapping DataTree, then you could end up with some behaviour like

ds = dt.ds
ds = many_inplace_operations_on_ds_later(ds)

dt['new_variable'] = blah
# but now ds has also been changed

Maybe that's fine, but it's also perhaps unintuitive - something to consider.

from datatree.

TomNicholas avatar TomNicholas commented on September 27, 2024

I wanted to convince myself that this was the only way to solve this problem, so I wrote out the problem and all the possible approaches in this gist.

I think storing a custom mapping object under ._variables is probably the best way overall though.

from datatree.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.