Comments (20)
from datatree.
I think this really comes down to two questions:
-
Do we want
isinstance(DataTree, Dataset)
to returnTrue
? -
Do we want to automatically inherit all Dataset methods, or only wrap the methods we have specifically chosen?
Pros of automatic inheritance:
- We would not have to write out the entire massive API of
xarray.Dataset
again within the definition ofDataTree
. - Methods which rely on other Dataset methods to work should then automatically work, for example lots of methods internally rely on
Dataset._construct_direct()
to rebuild the result before returning. These don't need to be wrapped withmap_over_subtree
, so inheriting them undecorated is what we want. - We also get all property methods defined immediately.
Pros of manual wrapping:
- There won't be any unexpected behaviour resulting from inheriting a method we hadn't thought about. Any method we didn't explicitly consider would not be defined on
DataTree
. - We still have to decorate most of the Dataset methods with
@map_over_subtree
, and this is easier to control without mistakes if they are wrapped rather than inherited. - We have a separate namespace for Dataset methods vs DataTree methods. This is helpful if we want to make identically-named methods on both classes easily available to the user. For example
DataTree.merge(another_dt)
andDataset.merge(another_ds)
clearly need different implementations, but would collide if both defined onDataTree
. Calling the latter viaDataTree.ds.merge(another_ds)
would allow both to be accessible from the DataTree interface.
from datatree.
Another, probably much better, approach to inheritance would be to make benign changes to xarray.Dataset
's inheritance pattern upstream that make it easier to pick and choose the correct methods to inherit on DataTree
. For example, we currently have to awkwardly create our own DatasetPropertiesMixin
by wrapping properties in Dataset
, but we could instead factor out those property definitions in xarray into a xarray.core.Dataset.DatasetPropertiesMixin
class, then inherit something like this:
xarray.core.dataset.py
:
class DatasetPropertiesMixin:
@property
def dims(self):
... # etc.
class DatasetMethodsMixin:
def isel(self, indexers, ...):
... # etc.
class Dataset(DatasetPropertiesMixin, DatasetMethodsMixin, DataWithCoords, DatasetArithmetic):
...
datatree.datatree.py
:
from xarray.core.dataset import DatasetPropertiesMixin, DatasetMethodsMixin
DataTreeMethodsMixin = map_all_methods_over_subtree(DatasetMethodsMixin)
DataTreeArithmetic = map_all_methods_over_subtree(DatasetArithmetic)
class Dataset(DatasetPropertiesMixin, DataTreeMethodsMixin, DataWithCoords, DataTreeArithmetic):
...
(where that example also has another mixin class for mapped Dataset methods too.)
This would not make isinstance(DataTree, Dataset)
return True, but it would probably be the shortest and least error-prone way to get all the right methods defined on all the right classes.
from datatree.
@shoyer or @max-sixty, I would appreciate either of your opinions on the above fundamental design question for DataTree
, because this is practically the most core question, and at the moment I'm just talking to myself 😅
from datatree.
This also came up in #26, in the context of whether an empty netCDF group is better represented by an empty Dataset or None (a question also related to #4.)
from datatree.
this is practically the most core question, and at the moment I'm just talking to myself
eventually software ends up at philosophy...
I do think it's an interesting question. I'm really not sure where we'll end up.
Without having thought as deeply as you, I would probably start with composing them. Basically for the reasons you describe — inheritance a) couples the classes together and b) stepping back, it relies on a known ontology — so mixins like DatasetMethodsMixin
& DatasetPropertiesMixin
might seem like a good structure now — but then for another case we're going to need a slightly different structure, and the number of Mixins propagate until they're no longer useful groupings.
Having everything behind a .ds
property does seem like a worse API, but it might be a reasonable way of getting some cool examples going (I already see some good ones!). Or to reduce the amount of boilerplate to get started, you could intercept methods with __getattribute__
and decide where to route them?
And this doesn't preclude a Dataset
becoming a DataTree
in the future — it's much easier to go composition -> inheritance (or even integration) than the other way around.
To confirm, I haven't thought about this sufficiently, so please weigh my view appropriately!
from datatree.
I would also start with composition. It's much more verbose, but otherwise you'll never be quite sure when DataTree
methods will work properly. It should not be too painful to use a DataTree.dataset
property for any missing methods.
Making issubclass(DataTree, Dataset)
true is also problematic, unless a DataTree is in fact 100% substitutable for a Dataset. Even if this can be made true (which I doubt), most of us have a hard time reasoning about inheritance.
Reusing Xarray's mixins would be reasonable, but only if/when DataTree
lives inside the Xarray codebase. Otherwise it would be too easy to introduce errors.
from datatree.
Thanks both for your input.
I did in fact start with composition, so I'll keep it that way for now.
Reusing Xarray's mixins would be reasonable, but only if/when DataTree lives inside the Xarray codebase. Otherwise it would be too easy to introduce errors.
I'm not sure if this is quite what you meant, but I think what I'll do for now is define my own mixins internally, but make sure none of them actually inherit from any of xarray's mixins. That way I can keep the internal organisational difference between "methods on Dataset that need wrapping", "methods on Dataset that need wrapping and mapping over child nodes", "properties on Dataset that need wrapping" etc., but without actually depending on any private xarray API.
For example, this mixin for Datatree to inherit from does not itself inherit from (or even reference) any of xarray's mixins:
class MappedDatasetMethodsMixin:
"""
Mixin to add Dataset methods like .mean(), but wrapped to map over all nodes in the subtree.
"""
__slots__ = ()
_wrap_then_attach_to_cls(
cls_dict=vars(),
copy_methods_from=Dataset,
methods_to_copy=_ALL_DATASET_METHODS_TO_MAP,
wrap_func=map_over_subtree,
)
class DataTree(
TreeNode,
DatasetPropertiesMixin,
MappedDatasetMethodsMixin,
MappedDataWithCoordsMixin,
DataTreeArithmeticMixin,
):
from datatree.
It seems that the current discussion may be hedging toward inheritance over composition to address issues like name collisions (#38) and mutability. However, it may be worth considering the possibilities of a third option that doesn't use the Dataset
as a container at the node level. The concept would basically define each node as either a DataArray
or a TreeNode
(subtree). TreeNode
s would be responsible for insuring alignment, coordinating ops among children, and would be directly convertible to Datasets
with 0 or more variables.
This would solve the name collisions problem but would require much more logic here in datatree. Its possible this would be much harder than a direct inheritance implementation but I thought I would through the idea out there.
from datatree.
The concept would basically define each node as either a
DataArray
or aTreeNode
(subtree).TreeNode
s would be responsible for insuring alignment, coordinating ops among children, and would be directly convertible toDatasets
with 0 or more variables.
I like this idea a lot, but let me suggest a slight variation: store Variable
rather than DataArray
objects in the dict.
The core data model for xarray.Dataset
includes:
attrs
: dict of attributesvariables
: dict of xarray.Variablecoord_names
: set of coordinate namesdims
: dict of dimension names to sizesindexes
: dict from dimension name toxarray.Index
objects (currently optional and can be left asNone
, but will be required after the explicit indexes refactor).
My suggestion would be to make DataTree
quite similar, except that variables
can also hold DataTree
objects.
To implement methods on DataTree
, you would filter DataTree
objects out of variables
, and then convert the remainder directly into an xarray.Dataset
.
from datatree.
Thanks @jhamman and @shoyer - I started typing this before Stephan's comment just now so I will finish the thought - sounds like we're on the same page though:
What you're suggesting is basically what I was thinking of as a last-resort solution to solve #38 . Basically Dataset
is very roughly like a dict whose values are Variables
, whereas you could imagine re-implementing DataTree
as a similar dict whose values can be Variables
but also can be more DataTree
nodes. This would get around #38 because you would no longer have a "dict storing a dict", so to speak, instead treating variables and nodes at the same level.
You could implement that with inheritance, which sounds conceptually neat but Stephan is probably right that it would cause more problems than it would solve. The alternative as you say Joe is "integration": essentially writing something that reimplements all the high-level aligning behaviour of Dataset
, but also allows for storing DataTrees
. More effort but basically guaranteed to work in the intended manner. Would probably also mean more direct importing from/coupling with xarray's internals.
You would then have some kind of DataTree.to_dataset()
method defined for when you really need the data to be stored in an instance of Dataset
.
from datatree.
I would do this by having an API something like:
.items
: can be eitherVariables
orDataTrees
.variables
: only theVariables
.nodes
: only theDataTrees
from datatree.
This idea is also closely related to https://github.com/TomNicholas/datatree/issues/3, and would re-open the question of whether to expose only local dims
etc. like was discussed in https://github.com/TomNicholas/datatree/issues/36.
from datatree.
If we did implement it like this I personally think there would then be a much stronger case for eventually integrating DataTree into xarray, rather than it just living in xarray-contrib. The current approach simply wraps (and sometimes emulates) Dataset, so it makes sense for it to live in a separate library that lies atop xarray, but this new suggestion is a Dataset-like object which shares a huge amount of functionality and probably internal code with xarray.Dataset
. A lot of the functionality could probably be tested by the same tests that are currently used for xarray Dataset too... Not sure what your thoughts on that are Stephan.
from datatree.
(@alexamici you will probably find this whole discussion interesting, and I would like to hear if you do have any thoughts!)
from datatree.
One disadvantage of the "integration" approach is a lack of clear separation for users between operations that act on just that single node, and operations which map over all nodes in that subtree. For example whilst .coords
should return only the coordinates stored in that node, what should assign_coords
do? It could assign coords
to that node alone, or attempt to assign the same coords to every node in the subtree. Sure, the method could have a new kwarg map_over_subtree
, but it's not going to be intuitively obvious to the user whether the pattern is that that would default to True or False.
Compare to the current implementation, where to act on a single node's dataset you always have to pull out .ds
, and every method on the DataTree
object maps over the whole subtree. That is definitely clearer.
from datatree.
from datatree.
I think we could probably still make the interface using .ds work. This would require modifying xarray.Dataset so it can support a custom mapping object for storing variables, but that seems doable?
I'm sorry but I'm very much not following 😅 Could you elaborate?
What would the custom mapping object do? How would that help with the original name collision problem? If .ds
is Dataset
, even a modified one, how can it check what children are contained in the class which wraps it? Or are you suggesting some kind of FrozenDataset
?
Or alternatively are you imagining that variables are still stored privately in the DataTree
level, but only accessed via DataTree.ds
? That makes more sense, but what would the .ds
property return? An actual (mutable) Dataset
? A frozen Dataset
, with a mutable one only via .to_dataset()
? Some kind of DatasetProxy
? And would that require sending operations on .ds
back up to DataTree, e.g. dt.ds.assign_coords()
would look at the .children
of dt
before altering anything? That would almost be like having a .ds
accessor on the DataTree object...
from datatree.
I asked Stephan about his suggestion in the xarray dev meeting just now, and I'm going to write down my understanding of it here now (by answering my own questions), because I think it's a cool idea and I don't want to forget it!
What would the custom mapping object do?
It would be a dict-like container storing variables and possibly children, but the key point is that it would be accessible separately from the Dataset
object, so that both a DataTree
node and the wrapped Dataset
would be able to point to a common record of variables/children stored.
How would that help with the original name collision problem?
If someone tries to add a variable to the wrapped Dataset (or a child to the wrapping DataTree), then by consulting the custom mapping object you would be able to check all variables and children for any name collisions.
And would that require sending operations on
.ds
back up toDataTree
?
That is essentially what the custom mapping object would allow - any change to the variables in the Dataset
would then be automatically reflected in the wrapping DataTree
.
what would the
.ds
property return? An actual (mutable) Dataset? A frozen Dataset, with a mutable one only via.to_dataset()
?
This is still a valid question though I think - if .ds
returns a Dataset that is still linked to the wrapping DataTree, then you could end up with some behaviour like
ds = dt.ds
ds = many_inplace_operations_on_ds_later(ds)
dt['new_variable'] = blah
# but now ds has also been changed
Maybe that's fine, but it's also perhaps unintuitive - something to consider.
from datatree.
I wanted to convince myself that this was the only way to solve this problem, so I wrote out the problem and all the possible approaches in this gist.
I think storing a custom mapping object under ._variables
is probably the best way overall though.
from datatree.
Related Issues (20)
- setting node name breaks tree linkage HOT 8
- When creating a DataTree from a Dataset with path-like variable, subgroups are expected to be created HOT 8
- Actually you're right, I don't know if the docs currently mention anywhere that assigning to `.ds` is allowed! HOT 1
- Typing: `DataTree[Unknown]` HOT 1
- `drop_vars` issue? HOT 1
- Creating DataTree from DataArrays HOT 1
- Opening a datatree from S3 bucket HOT 3
- Typing issue: Pylance complains with DataTree inequalities HOT 3
- open_datatree() keeps the hdf file open preventing writes HOT 2
- open_datatree() from zarr creates issues with `kwargs` HOT 4
- `open_datatree` performance HOT 2
- decision analysis in datatree? HOT 1
- Collapse by default the "Attributes" section in rich display HOT 1
- Add an `attrs` keyword argument to the constructor: `DataTree(attrs={})` HOT 1
- Describe a DataTree: adidng visualization and summarization capabilities HOT 1
- Rich display width is broken HOT 4
- Auto-plotting capabilities HOT 1
- Loosing attributes with .chunk and .pad HOT 1
- Losing top level name attribute when saving and then reopening using h5netcdf HOT 4
- Supporting Excel Spreadsheets? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datatree.