xarray-contrib / datatree Goto Github PK

View Code? Open in Web Editor NEW

169.0 169.0 44.0 404 KB

WIP implementation of a tree-like hierarchical data structure for xarray.

Home Page: https://xarray-datatree.readthedocs.io

License: Apache License 2.0

Python 100.00%

datatree's Issues

Write to zarr region of specific nodes

Thanks for putting this library together,

I wonder if it would be useful / make sense allowing to write zarr regions of specific nodes / datasets with to_zarr.

Release?

Hi there!
At B-Open we would like to start testing and using this datatree implementation for a couple of projects.
I think it would be better if we work with a released version. Are you planning to make a release soon?
If there's anything we can do to help with it, feel free to ping @aurghs, @alexamici, and myself.

cc: @joshmoore

assert_tree_equal

We need an assert_tree_equal function, which would allow us to simplify several tests.

Should the root node be a different class to the other nodes?

Currently I have the root node of the tree being a DataTree, and a subclass of DatasetNode, where DatasetNode is used for all the nodes apart from the root. We could instead have every node be an instance of the same class, with the root node distinguished only by having .parent=None.

The motivation for different classes was:

To have a different __init__ function for creating a whole tree as opposed to creating a single node. The former accepts a dictionary of path, object pairs to store in the tree, while the latter accepts only the information needed to create that specific node (i.e. DatasetNode(name, ds, parent, children)).
Wanting to have a .attrs dict only on the root node, so you only have one set of attrs for the entire tree.

However:

We could add a second init method as a private classmethod on the same class instead (similar to xarray.Dataset._construct_direct()), i.e. DataTree._init_single_node(name, data, parent, children).
We might just want to allow a different .attrs dictionary on every node of the tree?

So should all nodes just instead be instances of the same class?

Remove dependency on anytree

The code currently has a hard dependency on the anytree library, which it uses primarily to implement the actual .parent & .children structure of the TreeNode class, through inheritance from anytree.NodeMixin.

We don't need this dependency - anytree is not a big project, and we can simply reimplement the classes we need within this project instead. We should give due credit in the code because anytree has an Apache 2.0 license (plus Scout's honor).

Reimplementing anytree's functionality would also allow us to change various parts of it. For example we don't need support for python 2, we can standardize the error types (e.g. to return KeyError instead of e.g. anytree.ResolverError), and get rid of the mis-spelled .anchestors property. We shouldn't need to worry about breaking the code because the tree functionality is already covered by the unit tests in test_treenode.py.

Allow any valid type as value for setitem

Currently this works

dt["a"] = xr.DataArray(0)

but this fails

dt["a"] = 0

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 dt["a"] = 0

File ~/Documents/Work/Code/datatree/datatree/datatree.py:704, in DataTree.__setitem__(self, key, value)
    700 elif isinstance(key, str):
    701     # TODO should possibly deal with hashables in general?
    702     # path-like: a name of a node/variable, or path to a node/variable
    703     path = NodePath(key)
--> 704     return self._set_item(path, value, new_nodes_along_path=True)
    705 else:
    706     raise ValueError("Invalid format for key")

File ~/Documents/Work/Code/datatree/datatree/treenode.py:444, in TreeNode._set_item(self, path, item, new_nodes_along_path, allow_overwrite)
    442         raise KeyError(f"Already a node object at path {path}")
    443 else:
--> 444     current_node._set(name, item)

File ~/Documents/Work/Code/datatree/datatree/datatree.py:684, in DataTree._set(self, key, val)
    682     self.update({key: val})
    683 else:
--> 684     raise TypeError(f"Type {type(val)} cannot be assigned to a DataTree")

TypeError: Type <class 'int'> cannot be assigned to a DataTree

The latter syntax is convenient and intuitive, so we should relax this to accept any type, and raise an error if the DataArray constructor can't parse it.

However it's not totally trivial to implement because it breaks some assumptions in the code that walks the tree.

Save to netCDF

xarray.Dataset.to_netcdf() allows you to save a dataset into a netcdf file as a group, but to make a DataTree.to_netcdf() we need to save many groups into the same file. I'm not sure how to do this with xarray.Dataset.to_netcdf(), and xarray's backends code is pretty complicated. It could probably be done with the netCDF4 python library at a lower level but I havent' really tried that yet, and it would be nice to not have to use that.

The hard bit is saving multiple groups to the same file, actually iterating over the groups is easy, just something like

def to_netcdf(dt, filepath):
    for node in dt.subtree_nodes:
        group_name = node.pathstr
        
        node.ds.to_netcdf(filepath, group=group_name)

version doesn't work

Currently trying to get the version of datatree interactively raises an AttributeError:

import datatree

datatree.__version__

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 datatree.__version__

AttributeError: module 'datatree' has no attribute '__version__'

That's weird because we do have code to set the version number in __init__.py.

The latest release is missing on PyPI

#113 unintentionally broke the PyPI workflow. Consequently, the latest release didn't get published on PyPI and conda-forge.

See: https://github.com/xarray-contrib/datatree/runs/7306127762?check_suite_focus=true

Example datatree for use in tutorial documentation

What would help me enormously with writing documentation would be a killer example datatree, which I could open and use to demonstrate use of all types of methods. Just like we have the "air_temperature" example dataset used in the main xarray documentation.

To be as useful as possible, this example tree should hit a few criteria:

Nested - there needs to be some reason why you wouldn't just use a Dataset to organise this data. Multiple resolutions is a simple reason, but it also should be >1 level deep.
Common coordinates - it should have a least one common coordinate stored closer to the root of the tree. For example a reference normalisation value of some quantity, or perhaps some grid-related information that applies to the data in multiple sub-groups.
Heterogenous data - there is no restriction on the relationship between data in different nodes, so we should demonstrate this by storing data that is as different as possible (but still somehow related). I'm thinking maybe some demographic data vs geographical, or model data vs observational.
Small - though we would download this with pooch instead of uploading the data files in the repo, we still want this to be small enough that we don't cause problems when building or viewing our docs.
Multidimensional - the data stored in the leaves needs to have enough dimensions so that I can reduce/aggregate it and still have something interesting left to plot.
Recognisable - Ideally it would contain some relatable data. The existing Dataset example is nice because you can immediately see you are looking at a (low-resolution) map of North America. Maybe a satellite image of Manhattan Island or something?

A really good inspiration is this pseudo-structure provided in pydata/xarray#4118:

This would hit all of the criteria above, if it actually existed somewhere I could find!

What I would like is for people who have more familiarity with real geo-science data products to help me make this killer example tree, or at least point me towards data that I might use.

If we have multiple good suggestions I could make multiple different examples to use, but I think I would prefer one really good one to multiple quite good ones. Alternatively any extras could end up getting used for some future example notebooks though.

@jhamman @jbusecke @rabernat @alexamici @joshmoore

Easy way to set the compression level for all dataarrays in a datatree?

Question
Is there an easy/convenient way to set the compression level for all DataArrays in a datatree?
I've been using the solution here when saving xarray.Datasets, but this solution doesn't seem to map conveniently to datatree. Maybe I'm missing something?

Background
I often deal with multiple datasets, each on the order of 10 GBs, and this fills up harddrives fast. Typically when I write xarray.Datasets to file, I like to use gzip compression with level around 5, and this tends to reduce the file size by around 50%.

Thanks!

Edit:
Or is recommended practice to update the compression info for each DataArray? (The second answer to the above link)

add to/from zarr support

Following up on #26, it would be great to have Zarr support in datatree. The API would be the same as to_netcdf and, I think we can actually plug into xarray's open_dataset and engine argument to get most of the way there with open_datatree.

Unable to open a datatree using the h5netcdf engine

Hello. I've recently been introduced to this repository. This is a feature that xarray definitely needs. Thanks for creating it.

Issue
I can create a datatree, save it to file using the h5netcdf engine, but then I cannot open it. I need to use the h5netcdf engine for my work because I deal with complex numbers.

Error

runfile('C:/Users/jwbrooks/python/untitled0.py', wdir='C:/Users/jwbrooks/python')
Reloaded modules: imp, _strptime, encodings.ascii, stringprep, encodings.idna, xmlrpc, xml.parsers, pyexpat, xml.parsers.expat, xmlrpc.client, socketserver, http.server, xmlrpc.server, plistlib, tarfile, xml.etree, xml.etree.ElementPath, _elementtree, xml.etree.ElementTree
Traceback (most recent call last):

  File "C:\Users\jwbrooks\python\untitled0.py", line 44, in <module>
    data2 = dt.open_datatree('example_data_1.hdf5',

  File "C:\Users\jwbrooks\python\datatree\datatree\io.py", line 68, in open_datatree
    return _open_datatree_netcdf(filename_or_obj, engine=engine, **kwargs)

  File "C:\Users\jwbrooks\python\datatree\datatree\io.py", line 72, in _open_datatree_netcdf
    ncDataset = _get_nc_dataset_class(kwargs.get("engine", None))

  File "C:\Users\jwbrooks\python\datatree\datatree\io.py", line 36, in _get_nc_dataset_class
    from h5netcdf import Dataset

ImportError: cannot import name 'Dataset' from 'h5netcdf' (C:\Users\jwbrooks\Environments\python38_spyder515\lib\site-packages\h5netcdf\__init__.py)

initial investigation
In your datatree\io.py file, you use the command from h5netcdf import Dataset, but the h5netcdf library does not contain anything called Dataset... I started digging around in the xarray library trying to figure out how they handle it, but quickly got lost.

code to reproduce the error
Below, I create a fake data, convert it to a datatree, write it to file, and then try to read the file.

import numpy as np
import xarray as xr

import datatree as dt


## create example datatree
t1 = np.arange(0, 1e0, 1e-4)
t1 = xr.DataArray(t1, dims='t', coords=[t1], attrs={'units': 's', 'long_name': 'time'})
t2 = np.arange(0, 1e-1, 1e-6)
t2 = xr.DataArray(t2, dims='t', coords=[t2], attrs={'units': 's', 'long_name': 'time'})

a1 = xr.DataArray(np.random.rand(len(t1)), dims='t', coords=[t1], name='data_A1', attrs={'units': 'au', 'long_name': 'data A1'})
a2 = xr.DataArray(np.random.rand(len(t1)), dims='t', coords=[t1], name='data_A2', attrs={'units': 'au', 'long_name': 'data A2'})
a = xr.Dataset({a1.name: a1, a2.name: a2})

b1 = xr.DataArray(np.random.rand(len(t2)), dims='t', coords=[t2], name='data_B', attrs={'units': 'au', 'long_name': 'data B'})
b = b1.to_dataset()

c1 = xr.DataArray(np.random.rand(len(t2)), dims='t', coords=[t2], name='data_C1', attrs={'units': 'au', 'long_name': 'data C1'})
c2 = xr.DataArray(np.random.rand(len(t2)), dims='t', coords=[t2], name='data_C2', attrs={'units': 'au', 'long_name': 'data C2'})
c = xr.Dataset({c1.name: c1, c2.name: c2})

data = dt.DataTree.from_dict({'data1': a, 'data2/b': b, 'data2/c': c})

## write data to file
data.to_netcdf(	'example_data_1.hdf5', 
				mode='w',
# 				format='NETCDF4', 
				engine='h5netcdf',
				# encoding=encoding,
				invalid_netcdf=True
				)

## read data from file
data2 = dt.open_datatree('example_data_1.hdf5', 
						  engine='h5netcdf')

Details on my setup

Windows 10
Virtual environment using virtualenv
python 3.8.5
Spyder 3.8

Edits

I forgot to add, I opened the file with hdfview, and everything looks like it was written correctly.

Store variables in DataTree instead of storing Dataset

Should we prefer inheritance or composition when making the node of a datatree behave like an xarray Dataset?

Inheritance

We really want the data-containing nodes of the datatree to behave as much like xarray datasets as possible, as we will likely be calling functions/methods on them, assigning them, extracting from them and saving them as if they were actually xarray.Dataset objects. We could imagine a tree node class which directly inherits from xarray.Dataset:

class DatasetNode(xarray.Dataset, NodeMixin):
    ...

This would have all the attributes and API of a Dataset, and pass isinstance() checks, but also the attributes and methods needed to function as a node in a tree (e.g. .children, .parent). We would still need to decorate most inherited methods in order to apply them to all child nodes in the tree though.

Mostly these don't collide, except in the important case of getting/setting children of a node. xarray.Datasets already use up __getitem__ for variable selection (i.e. ds[var]) as well as the .some_variable namespace via property-like access. This means we can't immediately have an API allowing operations like dt.weather = dt.weather.mean('time') because .weather is a child of the node, not a dataset variable. (It's possible we could have both behaviours simultaneously by overwriting __getitem__, but then we might restrict the possible names of children/variables.)

I think this approach would also have the side-effect that accessor methods registered with @register_dataset_accessor would also be callable on the tree nodes.

Composition

The alternative is instead of each node being a Dataset, each node merely wraps a Dataset. This has the advantage of keeping the Data class and the Node class separate, though they would still share a large API to allow applying a method (e.g. .mean() to all child nodes in a tree.

The disadvantage is that then all the variables and dataset attributes are behind a .ds property.

This type of syntax dt.weather = dt.weather.mean('time') would then be possible (at least if we didn't allow the tree objects to have their own .attrs, else it would have to be dt['weather'] = dt['weather'].mean('time')) because we would be calling the method of a DatasetNode (rather than Dataset) and then assigning to a DatasetNode.

Selecting a particular variable from a dataset stored at a particular node would then look like dt['weather'].ds['pressure'], which has the advantage of clarifying which one is the variable, but the disadvantage of breaking up the path-like structure to get from the root down to the variable. EDIT: As there is no problem with collisions between names of groups and variables, we can actually just override __getitem__ to check in both the data variables and the children, so we can have access like dt['weather']['pressure'].

(There is also a possible third option described in #4)

For now the second approach seemed better, but I'm looking for other opinions!

Indexing tree should create new tree

Inspired by this example in the stackstac documentation

lowcloud = stack[stack["eo:cloud_cover"] < 20]

we should ensure that you can index a datatree with another (isomorphic) datatree, so that the above operation would work even if stack is a DataTree instance.

This is another map_over_subtree-type operation, but it needs careful testing because the __getitem__ function in xarray objects already does so many different things. This won't work with the code as-is because at the moment the DataTree naively dispatches the __getitem__ call down to the wrapped dataset.

datatree/datatree/datatree.py

Line 238 in cd06951

return self.ds[key]

Zarr PathNotFoundError workaround

Opening this so I can close #90.

@jhamman I'm not sure exactly what you mean - dataless groups are represented by nodes with empty datasets, so this seems like reasonable code to me?

Are you saying you would prefer to get rid of the try/except and only iterate over paths that we know exist?

Originally posted by @jhamman in #90 (comment)

Name permanence

Xarray Datasets don't alter the names of the objects they store:

In [1]: import xarray as xr

In [2]: da = xr.DataArray(name="b")

In [3]: ds = xr.Dataset()

In [4]: ds['a'] = da

In [5]: ds["a"].name
Out[5]: 'a'

In [6]: da.name
Out[6]: 'b'

After #41 (and #115 ensures it) then DataTree objects behave similarly for data variables they store

In [7]: from datatree import DataTree

In [8]: root = DataTree()

In [9]: root["a"] = da

In [10]: root["a"].name
Out[10]: 'a'

In [11]: da.name
Out[11]: 'b'

However, currently DataTree objects do alter the name of child DataTree nodes that they store.

In [12]: subtree = DataTree(name="b")

In [13]: root = DataTree()

In [14]: root["a"] = subtree

In [15]: root["a"].name
Out[15]: 'a'

In [16]: subtree.name
Out[16]: 'a'

I noticed this in #115, but fixing it might be a bit complex.

Failing to compress DataTree when saving to NetCDF

I'm failing to get DataTree.to_netcdf() to compress data when saving. The structure of my DataTree dt looks like this:

dt.groups:

('/',
 '/0',
 '/0/lp',
 '/0/hp',
 '/1',
 '/1/lp',
 '/1/hp',
 '/2',
 '/2/lp',
 '/2/hp',
 '/3',
 '/3/lp',
 '/3/hp',
 '/4',
 '/4/lp',
 '/4/hp',
 '/5',
 '/5/lp',
 '/5/hp',
 '/6',
 '/6/lp',
 '/6/hp',
 '/7',
 '/7/lp',
 '/7/hp',
...
 '/15/lp',
 '/15/hp',
 '/16',
 '/16/lp',
 '/16/hp')

The encoding dictionary looks like this:

{'0/lp': {'gain': {'zlib': True, 'complevel': 5},
  'irl': {'zlib': True, 'complevel': 5},
  'nf': {'zlib': True, 'complevel': 5},
  'oip2': {'zlib': True, 'complevel': 5},
  'oip3': {'zlib': True, 'complevel': 5},
  'op1db': {'zlib': True, 'complevel': 5},
  'opsat': {'zlib': True, 'complevel': 5},
  'orl': {'zlib': True, 'complevel': 5},
  'pmax': {'zlib': True, 'complevel': 5}},
 '0/hp': {'gain': {'zlib': True, 'complevel': 5},
  'irl': {'zlib': True, 'complevel': 5},
  'nf': {'zlib': True, 'complevel': 5},
  'oip2': {'zlib': True, 'complevel': 5},
  'oip3': {'zlib': True, 'complevel': 5},
  'op1db': {'zlib': True, 'complevel': 5},
  'opsat': {'zlib': True, 'complevel': 5},
  'orl': {'zlib': True, 'complevel': 5},
  'pmax': {'zlib': True, 'complevel': 5}},
 '1/lp': {'gain': {'zlib': True, 'complevel': 5},
  'irl': {'zlib': True, 'complevel': 5},
  'nf': {'zlib': True, 'complevel': 5},
  'oip2': {'zlib': True, 'complevel': 5},
  'oip3': {'zlib': True, 'complevel': 5},
  'op1db': {'zlib': True, 'complevel': 5},
  'opsat': {'zlib': True, 'complevel': 5},
...
  'oip3': {'zlib': True, 'complevel': 5},
  'op1db': {'zlib': True, 'complevel': 5},
  'opsat': {'zlib': True, 'complevel': 5},
  'orl': {'zlib': True, 'complevel': 5},
  'pmax': {'zlib': True, 'complevel': 5}}}

Finally, saving to a HDF5/NetCDF file:

dt.to_netcdf("myhdf5file.hdf5", mode="w", encoding=encoding)  # no compression: 3.33 MB

The file is the same size whether I pass the encoding dictionary or not.

Set up docs

We need some basic documentation, which needs to be set up from scratch.

Allow symbolic links between nodes

It would be nice if the tree could support internal symbolic links between nodes.

anytree has a SymLinkNodeMixin class to complement its NodeMixin class to provide this sort of functionality. What I'm not sure is if this approach would allow for loops within the tree, or merely for the same node to appear multiple times in the tree.

Should .children be a dictionary instead of a tuple?

A tree-like structure must have nodes, each of which can contain multiple children, and those children have to be selectable via some kind of name. However, those names can either be keys to access the child objects, or inherent properties of the child objects.

In the former case we would have a node.children=tuple(child1, child2), where child1.name = 'steve', child2.name = 'mary' etc. In the latter case we would have node.children=dict('steve': child1, 'mary': child2), where each child need not have a name. It's not clear to me which of these approaches is better in our case.

It's easy to ensure that all nodes have names (and if we make nodes inherit from Dataset they will inherit a name), but storing children in tuples leads to annoying code like child_we_want = next(c for c in node.children if c.name == name_we_want), instead of just child_we_want = node[name_we_want]. A DataTree is also quite intuitively represented by a nested dictionary where keys are parts of a path and values are either datasets or child nodes, and in that description we would not say that the name key is an inherent property of the value.

Using a dictionary also means that the path to an object is distinct from the name of that object.

This also means that a node doesn't need a name at all, and becomes defined only in terms of its parent and children. In effect, the name of the node would be the key for which self.parent.children[key] returns self. Parentless nodes would be nameless.

A disadvantage of this is that a stored Dataset object has no idea who its parent is.

None of the tree implementations I've seen work like this, and it appears to deviate from the way that a "tree" is defined mathematically.

The anytree library uses named nodes and tuples to store the children, so to use dictionaries we would need to reimplement the NodeMixin class to use a dictionary instead.

Re-implement path-like access using pathlib?

We want to support both path-like (e.g. 'simulation/highres/temperature') and tag-like access (e.g. ('simulation', 'highres', 'temperature')) to nodes, but at the moment the implementation of this is incomplete and probably buggy. For example right now you can't select a node further up the tree using '../', and there is no real .relative_path(target) method.

A better implementation might use python's built-in pathlib. If our paths were all instances of PurePosixPath then they would always use forwards slashes, but not actually have any connection to the filesystem. If we wanted to convert paths to tags and vice versa we could probably do it neatly by subclassing PurePosixPath:

class NodePath(PurePosixPath):
    def as_tags(self):
        return tuple(str(self).split('/'))

It might only make sense to do this after #7 , so that we have full control over the method namespace.

Dask-specific methods

xr.Dataset implements a bunch of dask-specific methods, such as __dask_tokenize__ and __dask_graph__. It also obviously has public methods that involve dask such as .compute() and .load().

In DataTree on the other hand, I haven't yet implemented any methods like these, or even written any tests that involve dask! You can probably still use dask with datatree right now, but from dask's perspective the datatree is presumably merely a set of unconnected Dataset objects.

We could choose to implement methods like .load() as just a mapping over the tree, i.e.

def load(self):
    for node in self.subtree:
        if node.has_data:
            node.ds.load()

but really this should probably wait for #41, or be done as part of that refactor.

I don't really understand what the double-underscore methods do though yet, so would appreciate input on that.

Ignore missing dims when mapping over tree

This tree has a dimension present in some nodes and not others (the "people" dimension).

DataTree('root', parent=None)
│   Dimensions:  (people: 2)
│   Coordinates:
│     * people   (people) <U5 'alice' 'bob'
│       species  <U5 'human'
│   Data variables:
│       heights  (people) float64 1.57 1.82
└── DataTree('simulation')
    ├── DataTree('coarse')
    │   Dimensions:  (x: 2, y: 3)
    │   Coordinates:
    │     * x        (x) int64 10 20
    │   Dimensions without coordinates: y
    │   Data variables:
    │       foo      (x, y) float64 0.1242 -0.2324 0.2469 0.5168 0.8391 0.8686
    │       bar      (x) int64 1 2
    │       baz      float64 3.142
    └── DataTree('fine')
        Dimensions:  (x: 6, y: 3)
        Coordinates:
          * x        (x) int64 10 12 14 16 18 20
        Dimensions without coordinates: y
        Data variables:
            foo      (x, y) float64 0.1242 -0.2324 0.2469 ... 0.5168 0.8391 0.8686
            bar      (x) float64 1.0 1.2 1.4 1.6 1.8 2.0
            baz      float64 3.142

If a user calls dt.mean(dim='people'), then at the moment this will raise an error. That's because it maps the .mean call over each group, and when it gets to either the 'coarse' group or the 'fine' group it will not find a dimension called 'people'.

However the user might want to take the mean of groups only where this makes sense, and ignore the rest.

I think the best solution is to have a missing_dims argument, like xarray's .isel already has. Then the user can do dt.mean(dim='people', missing_dims='ignore').

To actually implement this I think only requires changes in xarray, not here, because those changes should propagate down to datatree. pydata/xarray#5030

How to efficiently test the huge inherited Dataset API

xarray.Dataset has a ginormous API, and eventually the entire thing should also be available on DataTree. However we probably still need to test this copied API, because the majority of it will be altered via the @map_over_subtree decorator. How can we do that without also copying thousands of lines of tests from xarray.tests?

Allow deleting a node with delitem

Raises AttributeError right now.

html repr

It would be awesome to have a rich html repr that can show off both the data in each node and the nested structure of the nodes in the whole tree.

Ideally it would be collapsible at each level, else the amount of information could get overwhelming.

We would want to combine xarray's html repr with some method of displaying the tree, (see xgcm/xgcm#474 for a similar problem).

I have never done much with html so would appreciate help with this feature.

Documentation plans

Datatree needs some documentation, even if it has to change in future.

I think most of the documentation would remain relevant even after some changes, as long as we keep the same basic data model (e.g. DataTree vs DataGroups, with no hierarchy).

I really like this breakdown for documentation, which theorizes that there are 4 types of documentation, along two axes, as shown in this diagram:

Another thing to consider is how the documentation we write now might eventually be incorporated into xarray's documentation upstream. We don't need to duplicate anything, and we want things we write to neatly slot into sections in xarray's existing documentation.

Some ideas:

Tutorials

Quick overview, to add to xarray's "Getting Started/Quick overview" page. #62
Creating a tree with basic heterogenous data in it
#100

How-to Guides

(Some of these could possibly go in xarray's Gallery section)

How to define a function which maps an operation over a whole tree.
How to work with multi-resolution data.
How to convert unusual file structures to DataTrees and vice versa. This is where we could discuss tricky gotchas with Zarr files that can't be immediately represented as trees etc.
"How do I" but for various tree manipulation operations. Might need to split up xarray's "how do I" page for clarity. (This one might want to wait for the API to be more solidified)

Explanation

(A lot of this could be grouped under one page on "Working with hierarchical data".)

The data model of DataTree (to go in xarray's Data Structures page) #103
Reading and writing files to and from DataTrees, and how the datatree model compares to various file formats (could go under the "Groups" section of xarray's page on reading and writing data) #158.
Organising a "family tree" (i.e. how to create and manipulate a tree structure from scratch node-by-node, with no data in it). #179
Tree manipulation - perhaps showing how to calculate things like "depth" and "breadth" #180
Mapping behaviour of a tree
File-like access to nodes explained in detail (absolute/relative paths etc.) #179
Terminology used for the tree (add to xarray's page) #174

Reference

This should be pretty much covered by ensuring that the auto-generated API docs work properly. The hard bit will be copying / duplicating the large API of xarray.Dataset that DataTree inherits.

How to treat name of root node?

In #76 I refactored the tree structure to use a path-like syntax. This includes referring to the root of a tree as "/", same as in cd / in a unix-like filesystem.

This makes accessing nodes and variables of nodes quite neat, because you can reference nodes via absolute or relative paths:

In [23]: from datatree.tests.test_datatree import create_test_datatree

In [24]: dt = create_test_datatree()

In [25]: dt['set2/a']
Out[25]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

In [26]: dt['/set2/a']
Out[26]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

In [27]: dt['./set2/a']
Out[27]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

This refactor also made DataTree objects only optionally have a name, as opposed to be before when they were required to have a name. (They still have a .name attribute now, it just can be None.)

In [28]: dt.name

Normally this doesn't matter, because when assigned a .parent a node's .name property will just point to the key under which it is stored as a child. This echoes the way an unnamed DataArray can be stored in a Dataset.

In [29]: import xarray as xr

In [30]: ds = xr.Dataset()

In [31]: da = xr.DataArray(0)

In [32]: ds['foo'] = da

In [33]: ds['foo'].name
Out[33]: 'foo'

However this means that the root node of a tree is no longer required to have a name in general.

This is good because

As a user you normally don't care about the name of the root when manipulating the tree, only the names of the nodes,
It makes the __init__ signature simpler as name is no longer a required arg,
It most closely echoes how filepaths work (the filesystem root "/" doesn't have another name),
Roundtripping from Zarr/netCDF files still seems to work (see test_io.py),

Roundtripping from dictionaries still works if the root node is unnamed

In [35]: d = {node.path: node.ds for node in dt.subtree}

In [36]: roundtrip = DataTree.from_dict(d)

In [37]: roundtrip
Out[37]: 
DataTree('None', parent=None)
│   Dimensions:  (y: 3, x: 2)
│   Dimensions without coordinates: y, x
│   Data variables:
│       a        (y) int64 6 7 8
│       set0     (x) int64 9 10
├── DataTree('set1')
│   │   Dimensions:  ()
│   │   Data variables:
│   │       a        int64 0
│   │       b        int64 1
│   ├── DataTree('set1')
│   └── DataTree('set2')
├── DataTree('set2')
│   │   Dimensions:  (x: 2)
│   │   Dimensions without coordinates: x
│   │   Data variables:
│   │       a        (x) int64 2 3
│   │       b        (x) float64 0.1 0.2
│   └── DataTree('set1')
└── DataTree('set3')

In [38]: dt.equals(roundtrip)
Out[38]: True

But it's bad because

Roundtripping from dictionaries doesn't work anymore if the root node is named

In [39]: dt2 = dt

In [40]: dt2.name = "root"

In [41]: d2 = {node.path: node.ds for node in dt2.subtree}

In [42]: roundtrip2 = DataTree.from_dict(d2)

In [43]: roundtrip2
Out[43]: 
DataTree('None', parent=None)
│   Dimensions:  (y: 3, x: 2)
│   Dimensions without coordinates: y, x
│   Data variables:
│       a        (y) int64 6 7 8
│       set0     (x) int64 9 10
├── DataTree('set1')
│   │   Dimensions:  ()
│   │   Data variables:
│   │       a        int64 0
│   │       b        int64 1
│   ├── DataTree('set1')
│   └── DataTree('set2')
├── DataTree('set2')
│   │   Dimensions:  (x: 2)
│   │   Dimensions without coordinates: x
│   │   Data variables:
│   │       a        (x) int64 2 3
│   │       b        (x) float64 0.1 0.2
│   └── DataTree('set1')
└── DataTree('set3')

In [44]: dt2.equals(roundtrip2)
Out[44]: False

The signature of the DataTree.from_dict becomes a bit weird because if you want to name the root node the only way to do it is to pass a separate name argument, i.e.

In [45]: dt3 = DataTree.from_dict(d, name='root')

In [46]: dt3
Out[46]: 
DataTree('root', parent=None)
├── DataTree('set1')
│   │   Dimensions:  ()
│   │   Data variables:
│   │       a        int64 0
│   │       b        int64 1
│   ├── DataTree('set1')
│   └── DataTree('set2')
├── DataTree('set2')
│   │   Dimensions:  (x: 2)
│   │   Dimensions without coordinates: x
│   │   Data variables:
│   │       a        (x) int64 2 3
│   │       b        (x) float64 0.1 0.2
│   └── DataTree('set1')
└── DataTree('set3')

What do we think about this behaviour? Does this seem like a good design, or annoyingly finicky?

@jhamman I notice that in the code you wrote for the io you put a note about not being able to specify a root group for the tree. Is that related to this question? Do you have any other thoughts on this?

numpy ufuncs currently fail

Currently calling a numpy ufunc on a DataTree (e.g. np.sin(dt)) fails with a recursion error (below).

That's probably because I just naively included Dataset.__array_ufunc__ in the list of method to decorate with @map_over_subtree, and it's needs a smarter definition.

________________________________________TestUFuncs.test_root___________________________________________

self = <test_dataset_api.TestUFuncs object at 0x7f486a542100>

    def test_root(self):
        da = xr.DataArray(name="a", data=[1, 2, 3])
        dt = DataNode("root", data=da)
        expected_ds = np.sin(da.to_dataset())
>       result_ds = np.sin(dt).ds

datatree/tests/test_dataset_api.py:179: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
datatree/datatree.py:92: in _map_over_subtree
    out_tree.ds = func(out_tree.ds, *args, **kwargs)
../xarray/xarray/core/arithmetic.py:79: in __array_ufunc__
    return apply_ufunc(
../xarray/xarray/core/computation.py:1184: in apply_ufunc
    return apply_array_ufunc(func, *args, dask=dask)
../xarray/xarray/core/computation.py:811: in apply_array_ufunc
    return func(*args)
datatree/datatree.py:92: in _map_over_subtree
    out_tree.ds = func(out_tree.ds, *args, **kwargs)
../xarray/xarray/core/arithmetic.py:79: in __array_ufunc__
    return apply_ufunc(
../xarray/xarray/core/computation.py:1184: in apply_ufunc
    return apply_array_ufunc(func, *args, dask=dask)
../xarray/xarray/core/computation.py:811: in apply_array_ufunc
    return func(*args)
datatree/datatree.py:92: in _map_over_subtree
    out_tree.ds = func(out_tree.ds, *args, **kwargs)
datatree/datatree.py:92: in _map_over_subtree
    out_tree.ds = func(out_tree.ds, *args, **kwargs)
../xarray/xarray/core/arithmetic.py:79: in __array_ufunc__
    return apply_ufunc(
../xarray/xarray/core/computation.py:1184: in apply_ufunc
    return apply_array_ufunc(func, *args, dask=dask)
../xarray/xarray/core/computation.py:793: in apply_array_ufunc
    if any(is_duck_dask_array(arg) for arg in args):
../xarray/xarray/core/computation.py:793: in <genexpr>
    if any(is_duck_dask_array(arg) for arg in args):
../xarray/xarray/core/pycompat.py:47: in is_duck_dask_array
    if DuckArrayModule("dask").available:
../xarray/xarray/core/pycompat.py:21: in __init__
    duck_array_module = import_module(mod)
/home/tom/miniconda3/envs/py38-mamba/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1011: in _gcd_import
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

name = 'dask', package = None, level = 0

>   ???
E   RecursionError: maximum recursion depth exceeded while calling a Python object

<frozen importlib._bootstrap>:939: RecursionError
!!! Recursion error detected, but an error occurred locating the origin of recursion.
  The following exception happened when comparing locals in the stack frame:
    ValueError: dimensions ('dim_0',) must have the same length as the number of data dimensions, ndim=0
  Displaying first and last 10 stack frames out of 767.

Loading all groups from a netCDF file

We want to be able to load all groups from a netCDF file as nodes in a DataTree, simply via

dt = open_datatree('some_data.nc')

To populate this tree we need to know the structure of all the groups in the netCDF file, but currently in xarray the open_dataset function makes it pretty difficult to see this information. Instead it only allows you to open one group (whose name you have to know in advance!) via the group keyword argument.

Proposed API and design for .ds data access

Having come back to this project for the first time in a while, I want to propose a design for the .ds property that I think will solve a few problems at once.

Accessing data specific to a node via .ds is intuitive for our tree design, and if the .ds object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.

But returning an actual xr.Dataset as we do now causes a couple of problems:

Modifying state with .ds.__setitem__ causes consistency (1) headaches (2),
You can't refer to other variables in different nodes (e.g. .ds['../common_var']), which limits the usefulness of map_over_subtree and of the tree concept in general.

After we refactor DataTree to store Variable objects under ._variables directly instead of storing a Dataset object under ._ds, then .ds will have to reconstruct a dataset from its private attributes.

I propose that rather than constructing a Dataset object we instead construct and return a NodeDataView object, which has mostly the same API as xr.Dataset, but with a couple of key differences:

No mutation allowed (for now)

Whilst it could be nice if e.g. dt.ds[var] = da actually updated dt, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towards dt[var] = da instead.

Allow path-like access to DataArray objects stored in other nodes via __getitem__

One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":

print(dt)

DataTree(parent=None)
│   Dimensions:  ()
│   Coordinates:
│     * reference_pressure  () 100.0
├── DataTree('observations')
│   └── DataTree('satellite')
│       Dimensions:  (x: 6, y: 3)
│       Dimensions without coordinates: x,y
│       Data variables:
│           p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
└── DataTree('simulations')
    ├── DataTree('coarse')
    │   Dimensions:  (x: 2, y: 3)
    │   Coordinates:
    │     * x        (x) int64 10 20
    │   Dimensions without coordinates: y
    │   Data variables:
    │       p        (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
    └── DataTree('fine')
        Dimensions:  (x: 6, y: 3)
        Coordinates:
          * x        (x) int64 10 12 14 16 18 20
        Dimensions without coordinates: y
        Data variables:
            p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298

I have a function which accepts and consumes datasets

def normalize_pressure(ds):
    ds['p'] = ds['p'] / ds['/reference_pressure']
    return ds

then map it over the tree

result_tree = dt.map_over_subtree(normalise_pressure)

If we allowed path-like access to data in other nodes from .ds then this would work because map_over_subtree applies normalise_pressure to the .ds attribute of every node, and '/reference_pressure' means "look for the variable 'reference_pressure' in the root node of the tree".

(In this case referring to the reference pressure with ds['../../reference_pressure'] would also have worked.)

(PPS if we chose to support the CF conventions' upwards proximity search behaviour then ds['reference_pressure'] would have worked too, because then __getitem__ would search upwards through the nodes for the first with a variable matching the desired name.)

A simple implementation could then just subclass xr.Dataset:

class NodeDataView(Dataset):
    _wrapping_node = DataTree

    @classmethod
    def _construct_direct(wrapper_node, variables, coord_names, ...) -> NodeDataView:
        ...

    def __setitem__(self, key, val):
        raise("NOT ALLOWED, please use __setitem__ on the wrapping DataTree node, "
              "or use `DataTree.to_dataset()` if you want a mutable dataset")

    def __getitem__(self, key) -> DataArray:
        # calling the `_get_item` method of DataTree allows path-like access to contents of other nodes
        obj = self._wrapping_node._get_item(key)
        if isinstance(obj, DataArray):
            return obj
        else:
            raise ValueError("NodeDataView is only allowed to return variables, not entire tree nodes")

    # all API that doesn't modify state in-place can just be inherited from Dataset
    ...


class DataTree:
    _variables = Mapping[Hashable, Variable]
    _coord_names = set[Hashable]
    ...

    @property
    def ds(self) -> NodeDataView:
        return NodeDataView._construct_direct(self, self._variables, self._coord_names, ...)

    def to_dataset(self) -> Dataset:
        return Dataset._construct_direct(self._variables, self._coord_names, ...)

    ...

If we don't like subclassing Dataset then we could cook up something similar using getattr instead.

(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)

Bug with mapping mean over tree

For some reason taking a .mean() over part of the tree obtained by opening this netCDF file creates a new tree with the wrong structure.

Bug looks like this:

from datatree import open_datatree

dt = open_datatree('../../Code/datatree/datafiles/epm044463.nc')

print(dt)

printed output

DataTree('root')
│   Dimensions:  ()
│   Data variables:
│       *empty*
│   Attributes: (12/15)
│       generator:              IDAM PutData VERSION 1 (Apr 20 2021)
│       Conventions:            Fusion-1.1
│       class:                  analysed data
│       title:                  EFIT++ Equilbrium Reconstruction epm
│       date:                   2021-07-20
│       time:                   13:51:58
│       ...                     ...
│       comment:                magnetics
│       runcmd:                 efit++ -p 44463 -t mastu --pass_number 0 --runmod...
│       executableGitCommitId:  c9c55b35e653f64af97db6f4a06e0181898b7a9b
│       executableGitRepo:      https://git.ccfe.ac.uk/EFIT_PP/efitpp.git
│       runScriptsGitCommitId:  922855671166ea5b8aac2c95858cb9cedf5795d7
│       runScriptsGitRepo:      https://git.ccfe.ac.uk/EFIT_PP/efitpp.git
└── DataTree('epm')
  │   Dimensions:                   (time: 56)
  │   Coordinates:
  │     * time                      (time) timedelta64[ns] 00:00:00.020000 ... 00:0...
  │   Data variables:
  │       equilibriumStatusInteger  (time) int32 ...
  │   Attributes:
  │       badChi2Flag:             1
  │       badChi2MagneticProbe:    b_nl2_p05,b_nu1_n08,b_nu1_p09
  │       badChi2MagneticProbeId:  [ 69  75 239]
  ├── DataTree('input')
  │   │   Dimensions:            (time: 56)
  │   │   Dimensions without coordinates: time
  │   │   Data variables:
  │   │       bVacRadiusProduct  (time) float64 ...
  │   ├── DataTree('constraints')
  │   │   ├── DataTree('pfCircuits')
  │   │   │   Dimensions:          (pfCircuitsDim: 64, time: 56)
  │   │   │   Dimensions without coordinates: pfCircuitsDim, time
  │   │   │   Data variables:
  │   │   │       shortName        (pfCircuitsDim) |S13 ...
  │   │   │       id               (pfCircuitsDim) int32 ...
  │   │   │       computed         (time, pfCircuitsDim) float64 ...
  │   │   │       weights          (time, pfCircuitsDim) float64 ...
  │   │   │       target           (time, pfCircuitsDim) float64 ...
  │   │   │       sigmas           (time, pfCircuitsDim) float64 ...
  │   │   │       timeSliceSource  (pfCircuitsDim) int8 ...
  │   │   ├── DataTree('plasmaCurrent')
  │   │   │   Dimensions:   (time: 56)
  │   │   │   Dimensions without coordinates: time
  │   │   │   Data variables:
  │   │   │       weights   (time) float64 ...
  │   │   │       computed  (time) float64 ...
  │   │   │       sigma     (time) float64 ...
  │   │   │       target    (time) float64 ...
  │   │   ├── DataTree('fluxLoops')
  │   │   │   Dimensions:             (fluxLoopDim: 102, fluxLoopElementDim: 1, time: 56)
  │   │   │   Dimensions without coordinates: fluxLoopDim, fluxLoopElementDim, time
  │   │   │   Data variables:
  │   │   │       shortName           (fluxLoopDim) |S8 ...
  │   │   │       toroidalAngleEnd    (fluxLoopDim, fluxLoopElementDim) float64 ...
  │   │   │       zValues             (fluxLoopDim, fluxLoopElementDim) float64 ...
  │   │   │       id                  (fluxLoopDim) int32 ...
  │   │   │       computed            (time, fluxLoopDim) float64 ...
  │   │   │       weights             (time, fluxLoopDim) float64 ...
  │   │   │       target              (time, fluxLoopDim) float64 ...
  │   │   │       rValues             (fluxLoopDim, fluxLoopElementDim) float64 ...
  │   │   │       sigmas              (time, fluxLoopDim) float64 ...
  │   │   │       toroidalAngleBegin  (fluxLoopDim, fluxLoopElementDim) float64 ...
  │   │   ├── DataTree('magneticProbes')
  │   │   │   Dimensions:              (magneticProbeDim: 354, time: 56)
  │   │   │   Dimensions without coordinates: magneticProbeDim, time
  │   │   │   Data variables: (12/14)
  │   │   │       shortName            (magneticProbeDim) |S9 ...
  │   │   │       axialLength          (magneticProbeDim) float64 ...
  │   │   │       poloidalOrientation  (magneticProbeDim) float64 ...
  │   │   │       rCentre              (magneticProbeDim) float64 ...
  │   │   │       toroidalSector       (magneticProbeDim) int32 ...
  │   │   │       area                 (magneticProbeDim) float64 ...
  │   │   │       ...                   ...
  │   │   │       turnCount            (magneticProbeDim) float64 ...
  │   │   │       computed             (time, magneticProbeDim) float64 ...
  │   │   │       weights              (time, magneticProbeDim) float64 ...
  │   │   │       target               (time, magneticProbeDim) float64 ...
  │   │   │       sigmas               (time, magneticProbeDim) float64 ...
  │   │   │       zCentre              (magneticProbeDim) float64 ...
  │   │   ├── DataTree('diamagneticFlux')
  │   │   │   Dimensions:   (time: 56)
  │   │   │   Dimensions without coordinates: time
  │   │   │   Data variables:
  │   │   │       weights   (time) float64 ...
  │   │   │       computed  (time) float64 ...
  │   │   │       sigma     (time) float64 ...
  │   │   │       target    (time) float64 ...
  │   │   └── DataTree('q0')
  │   │       Dimensions:   (time: 56)
  │   │       Dimensions without coordinates: time
  │   │       Data variables:
  │   │           weights   (time) float64 ...
  │   │           computed  (time) float64 ...
  │   │           sigma     (time) float64 ...
  │   │           target    (time) float64 ...
  │   ├── DataTree('limiter')
  │   │   Dimensions:  (limiterCoord: 91, unityDim: 1)
  │   │   Dimensions without coordinates: limiterCoord, unityDim
  │   │   Data variables:
  │   │       rValues  (limiterCoord) float64 ...
  │   │       zValues  (limiterCoord) float64 ...
  │   │       id       (unityDim) int32 ...
  │   ├── DataTree('numericalControls')
  │   │   │   Dimensions:                    (time: 56)
  │   │   │   Dimensions without coordinates: time
  │   │   │   Data variables: (12/14)
  │   │   │       relaxFactor                (time) float64 ...
  │   │   │       mxiter                     (time) int32 ...
  │   │   │       shiftPressureWhenNegative  (time) int32 ...
  │   │   │       saimin                     (time) float64 ...
  │   │   │       suppressRevCurr            (time) int32 ...
  │   │   │       SOLCurrentsOption          (time) int32 ...
  │   │   │       ...                         ...
  │   │   │       initialZCoord              (time) float64 ...
  │   │   │       fieldLineResol             (time) float64 ...
  │   │   │       scalea                     (time) int32 ...
  │   │   │       useBin                     (time) int32 ...
  │   │   │       conditionFactor            (time) float64 ...
  │   │   │       error                      (time) float64 ...
  │   │   ├── DataTree('pp')
  │   │   │   Dimensions:  (time: 56, r: 65)
  │   │   │   Dimensions without coordinates: time, r
  │   │   │   Data variables:
  │   │   │       knt      (time, r) float64 ...
  │   │   │       bdry     (time, r) float64 ...
  │   │   │       ndeg     (time) int32 ...
  │   │   │       func     (time) int32 ...
  │   │   │       bdry2    (time, r) float64 ...
  │   │   │       tens     (time) float64 ...
  │   │   │       kbdry2   (time, r) int32 ...
  │   │   │       minPsin  (time) float64 ...
  │   │   │       maxPsin  (time) float64 ...
  │   │   │       kbdry    (time, r) int32 ...
  │   │   │       kknt     (time) int32 ...
  │   │   │       edge     (time) int32 ...
  │   │   ├── DataTree('ffp')
  │   │   │   Dimensions:  (time: 56, z: 65)
  │   │   │   Dimensions without coordinates: time, z
  │   │   │   Data variables:
  │   │   │       knt      (time, z) float64 ...
  │   │   │       bdry     (time, z) float64 ...
  │   │   │       ndeg     (time) int32 ...
  │   │   │       func     (time) int32 ...
  │   │   │       bdry2    (time, z) float64 ...
  │   │   │       tens     (time) float64 ...
  │   │   │       kbdry2   (time, z) int32 ...
  │   │   │       minPsin  (time) float64 ...
  │   │   │       maxPsin  (time) float64 ...
  │   │   │       kbdry    (time, z) int32 ...
  │   │   │       kknt     (time) int32 ...
  │   │   │       edge     (time) int32 ...
  │   │   ├── DataTree('ne')
  │   │   │   Dimensions:  (time: 56, radialCoord: 65)
  │   │   │   Dimensions without coordinates: time, radialCoord
  │   │   │   Data variables:
  │   │   │       knt      (time, radialCoord) float64 ...
  │   │   │       bdry     (time, radialCoord) float64 ...
  │   │   │       ndeg     (time) int32 ...
  │   │   │       func     (time) int32 ...
  │   │   │       bdry2    (time, radialCoord) float64 ...
  │   │   │       tens     (time) float64 ...
  │   │   │       kbdry2   (time, radialCoord) int32 ...
  │   │   │       minPsin  (time) float64 ...
  │   │   │       maxPsin  (time) float64 ...
  │   │   │       kbdry    (time, radialCoord) int32 ...
  │   │   │       kknt     (time) int32 ...
  │   │   │       edge     (time) int32 ...
  │   │   └── DataTree('ww')
  │   │       Dimensions:  (time: 56, normalizedPoloidalFlux: 65)
  │   │       Dimensions without coordinates: time, normalizedPoloidalFlux
  │   │       Data variables:
  │   │           knt      (time, normalizedPoloidalFlux) float64 ...
  │   │           bdry     (time, normalizedPoloidalFlux) float64 ...
  │   │           ndeg     (time) int32 ...
  │   │           func     (time) int32 ...
  │   │           bdry2    (time, normalizedPoloidalFlux) float64 ...
  │   │           tens     (time) float64 ...
  │   │           kbdry2   (time, normalizedPoloidalFlux) int32 ...
  │   │           minPsin  (time) float64 ...
  │   │           maxPsin  (time) float64 ...
  │   │           kbdry    (time, normalizedPoloidalFlux) int32 ...
  │   │           kknt     (time) int32 ...
  │   │           edge     (time) int32 ...
  │   └── DataTree('pfSystem')
  │       ├── DataTree('pfCoils')
  │       │   Dimensions:    (pfCoilElements: 1566)
  │       │   Dimensions without coordinates: pfCoilElements
  │       │   Data variables:
  │       │       coilId     (pfCoilElements) int32 ...
  │       │       circuitId  (pfCoilElements) int64 ...
  │       │       turnCount  (pfCoilElements) float64 ...
  │       │       rCentre    (pfCoilElements) float64 ...
  │       │       zCentre    (pfCoilElements) float64 ...
  │       │       dR         (pfCoilElements) float64 ...
  │       │       dZ         (pfCoilElements) float64 ...
  │       │       angle1     (pfCoilElements) float64 ...
  │       │       angle2     (pfCoilElements) float64 ...
  │       └── DataTree('passiveStructures')
  │           Dimensions:    (passiveStructureElements: 70)
  │           Dimensions without coordinates: passiveStructureElements
  │           Data variables:
  │               coilId     (passiveStructureElements) int32 ...
  │               circuitId  (passiveStructureElements) int32 ...
  │               turnCount  (passiveStructureElements) float64 ...
  │               rCentre    (passiveStructureElements) float64 ...
  │               zCentre    (passiveStructureElements) float64 ...
  │               dR         (passiveStructureElements) float64 ...
  │               dZ         (passiveStructureElements) float64 ...
  │               angle1     (passiveStructureElements) float64 ...
  │               angle2     (passiveStructureElements) float64 ...
  ├── DataTree('regularGrid')
  │   Dimensions:  (unityDim: 1)
  │   Dimensions without coordinates: unityDim
  │   Data variables:
  │       rMin     (unityDim) float64 ...
  │       zMax     (unityDim) float64 ...
  │       zMin     (unityDim) float64 ...
  │       rMax     (unityDim) float64 ...
  │       nz       (unityDim) int32 ...
  │       nr       (unityDim) int32 ...
  └── DataTree('output')
      ├── DataTree('profiles2D')
      │   Dimensions:       (r: 65, z: 65, time: 56)
      │   Coordinates:
      │     * r             (r) float64 0.06 0.09031 0.1206 0.1509 ... 1.939 1.97 2.0
      │     * z             (z) float64 -2.2 -2.131 -2.062 -1.994 ... 2.062 2.131 2.2
      │   Dimensions without coordinates: time
      │   Data variables:
      │       jphi          (time, r, z) float64 ...
      │       Bphi          (time, r, z) float64 ...
      │       Bpol          (time, r, z) float64 ...
      │       Br            (time, r, z) float64 ...
      │       poloidalFlux  (time, r, z) float64 ...
      │       Bz            (time, r, z) float64 ...
      │       psiNorm       (time, r, z) float64 ...
      ├── DataTree('globalParameters')
      │   │   Dimensions:                 (time: 56)
      │   │   Dimensions without coordinates: time
      │   │   Data variables: (12/32)
      │   │       rt                      (time) float64 ...
      │   │       q2Radius                (time) float64 ...
      │   │       plasmaEnergy            (time) float64 ...
      │   │       s3                      (time) float64 ...
      │   │       btorVacuumEnergy        (time) float64 ...
      │   │       q0                      (time) float64 ...
      │   │       ...                      ...
      │   │       btorEnergy              (time) float64 ...
      │   │       plasmaVolume            (time) float64 ...
      │   │       psiBoundary             (time) float64 ...
      │   │       li1                     (time) float64 ...
      │   │       li2                     (time) float64 ...
      │   │       li3                     (time) float64 ...
      │   ├── DataTree('currentCentroid')
      │   │   Dimensions:  (time: 56)
      │   │   Dimensions without coordinates: time
      │   │   Data variables:
      │   │       R        (time) float64 ...
      │   │       Z        (time) float64 ...
      │   └── DataTree('magneticAxis')
      │       Dimensions:  (time: 56)
      │       Dimensions without coordinates: time
      │       Data variables:
      │           R        (time) float64 ...
      │           Z        (time) float64 ...
      ├── DataTree('radialProfiles')
      │   Dimensions:                 (time: 56, radialCoord: 65)
      │   Dimensions without coordinates: time, radialCoord
      │   Data variables: (12/17)
      │       rotationalPressure      (time, radialCoord) float64 ...
      │       Bt                      (time, radialCoord) float64 ...
      │       q                       (time, radialCoord) float64 ...
      │       Br                      (time, radialCoord) float64 ...
      │       poloidalArea            (time, radialCoord) float64 ...
      │       ffPrime                 (time, radialCoord) float64 ...
      │       ...                      ...
      │       jphi                    (time, radialCoord) float64 ...
      │       staticPressure          (time, radialCoord) float64 ...
      │       r                       (time, radialCoord) float64 ...
      │       Bz                      (time, radialCoord) float64 ...
      │       plasmaVolume            (time, radialCoord) float64 ...
      │       staticPPrime            (time, radialCoord) float64 ...
      ├── DataTree('fluxFunctionProfiles')
      │   Dimensions:                 (normalizedPoloidalFlux: 65, time: 56)
      │   Coordinates:
      │     * normalizedPoloidalFlux  (normalizedPoloidalFlux) float64 0.0 0.01562 ... 1.0
      │   Dimensions without coordinates: time
      │   Data variables: (12/25)
      │       rotationalPressure      (time, normalizedPoloidalFlux) float64 ...
      │       gOutside                (time, normalizedPoloidalFlux) float64 ...
      │       javrg                   (time, normalizedPoloidalFlux) float64 ...
      │       IOutside                (time, normalizedPoloidalFlux) float64 ...
      │       q                       (time, normalizedPoloidalFlux) float64 ...
      │       poloidalFlux            (time, normalizedPoloidalFlux) float64 ...
      │       ...                      ...
      │       plasmaFluxVolume        (time, normalizedPoloidalFlux) float64 ...
      │       elongation              (time, normalizedPoloidalFlux) float64 ...
      │       staticPPrime            (time, normalizedPoloidalFlux) float64 ...
      │       lowerTriangularity      (time, normalizedPoloidalFlux) float64 ...
      │       rInboard                (time, normalizedPoloidalFlux) float64 ...
      │       iota                    (time, normalizedPoloidalFlux) float64 ...
      ├── DataTree('separatrixGeometry')
      │   Dimensions:                        (time: 56, strikepointDim: 4, xpointDim: 2, boundaryCoordsDim: 361)
      │   Dimensions without coordinates: time, strikepointDim, xpointDim, boundaryCoordsDim
      │   Data variables: (12/33)
      │       dndXpointCount                 (time) float64 ...
      │       dndXpoint1InnerStrikepointR    (time) float64 ...
      │       limiterZ                       (time) float64 ...
      │       dndXpoint2InnerStrikepointZ    (time) float64 ...
      │       dndXpoint2OuterStrikepointZ    (time) float64 ...
      │       strikepointR                   (time, strikepointDim) float64 ...
      │       ...                             ...
      │       dndXpoint2OuterStrikepointR    (time) float64 ...
      │       rBoundary                      (time, boundaryCoordsDim) float64 ...
      │       rmidplaneIn                    (time) float64 ...
      │       dndXpoint2InnerStrikepointR    (time) float64 ...
      │       zBoundary                      (time, boundaryCoordsDim) float64 ...
      │       drsepIn                        (time) float64 ...
      └── DataTree('numericalDetails')
          │   Dimensions:                 (time: 56, maximumIterationCount: 30)
          │   Dimensions without coordinates: time, maximumIterationCount
          │   Data variables:
          │       chiSquared              (time, maximumIterationCount) float64 ...
          │       iterationCount          (time) int32 ...
          │       poloidalFluxError       (time, maximumIterationCount) float64 ...
          │       finalChiSquared         (time) float64 ...
          │       finalPoloidalFluxError  (time) float64 ...
          └── DataTree('degreesOfFreedom')
              Dimensions:                   (time: 56, pfCircuitDim: 64, fFPrimeDim: 2, pPrimeDim: 2)
              Dimensions without coordinates: time, pfCircuitDim, fFPrimeDim, pPrimeDim
              Data variables: (12/14)
                  freePfCurrents            (time, pfCircuitDim) int32 ...
                  countAllPfCurrent         (time) int32 ...
                  offsetPressure            (time) float64 ...
                  offsetRotationalPressure  (time) float64 ...
                  count                     (time) int32 ...
                  ffprimeCoeffs             (time, fFPrimeDim) float64 ...
                  ...                        ...
                  pprimeCoeffs              (time, pPrimeDim) float64 ...
                  pfCurrents                (time, pfCircuitDim) float64 ...
                  countWPrime               (time) int32 ...
                  countFreePfCurrent        (time) int32 ...
                  cdelz                     (time) float64 ...
                  countFFPrime              (time) int32 ...

constraints = dt['epm/input/constraints']

print(constraints)

printed output

DataTree('constraints')
├── DataTree('pfCircuits')
│   Dimensions:          (pfCircuitsDim: 64, time: 56)
│   Dimensions without coordinates: pfCircuitsDim, time
│   Data variables:
│       shortName        (pfCircuitsDim) |S13 ...
│       id               (pfCircuitsDim) int32 ...
│       computed         (time, pfCircuitsDim) float64 ...
│       weights          (time, pfCircuitsDim) float64 ...
│       target           (time, pfCircuitsDim) float64 ...
│       sigmas           (time, pfCircuitsDim) float64 ...
│       timeSliceSource  (pfCircuitsDim) int8 ...
├── DataTree('plasmaCurrent')
│   Dimensions:   (time: 56)
│   Dimensions without coordinates: time
│   Data variables:
│       weights   (time) float64 ...
│       computed  (time) float64 ...
│       sigma     (time) float64 ...
│       target    (time) float64 ...
├── DataTree('fluxLoops')
│   Dimensions:             (fluxLoopDim: 102, fluxLoopElementDim: 1, time: 56)
│   Dimensions without coordinates: fluxLoopDim, fluxLoopElementDim, time
│   Data variables:
│       shortName           (fluxLoopDim) |S8 ...
│       toroidalAngleEnd    (fluxLoopDim, fluxLoopElementDim) float64 ...
│       zValues             (fluxLoopDim, fluxLoopElementDim) float64 ...
│       id                  (fluxLoopDim) int32 ...
│       computed            (time, fluxLoopDim) float64 ...
│       weights             (time, fluxLoopDim) float64 ...
│       target              (time, fluxLoopDim) float64 ...
│       rValues             (fluxLoopDim, fluxLoopElementDim) float64 ...
│       sigmas              (time, fluxLoopDim) float64 ...
│       toroidalAngleBegin  (fluxLoopDim, fluxLoopElementDim) float64 ...
├── DataTree('magneticProbes')
│   Dimensions:              (magneticProbeDim: 354, time: 56)
│   Dimensions without coordinates: magneticProbeDim, time
│   Data variables: (12/14)
│       shortName            (magneticProbeDim) |S9 ...
│       axialLength          (magneticProbeDim) float64 ...
│       poloidalOrientation  (magneticProbeDim) float64 ...
│       rCentre              (magneticProbeDim) float64 ...
│       toroidalSector       (magneticProbeDim) int32 ...
│       area                 (magneticProbeDim) float64 ...
│       ...                   ...
│       turnCount            (magneticProbeDim) float64 ...
│       computed             (time, magneticProbeDim) float64 ...
│       weights              (time, magneticProbeDim) float64 ...
│       target               (time, magneticProbeDim) float64 ...
│       sigmas               (time, magneticProbeDim) float64 ...
│       zCentre              (magneticProbeDim) float64 ...
├── DataTree('diamagneticFlux')
│   Dimensions:   (time: 56)
│   Dimensions without coordinates: time
│   Data variables:
│       weights   (time) float64 ...
│       computed  (time) float64 ...
│       sigma     (time) float64 ...
│       target    (time) float64 ...
└── DataTree('q0')
    Dimensions:   (time: 56)
    Dimensions without coordinates: time
    Data variables:
        weights   (time) float64 ...
        computed  (time) float64 ...
        sigma     (time) float64 ...
        target    (time) float64 ...

avg_constraints = constraints.mean(dim='time')
print(avg_constraints)

printed output

DataTree('constraints')
└── DataTree('root')
    └── DataTree('epm')
        └── DataTree('input')
            ├── DataTree('constraints')
            ├── DataTree('pfCircuits')
            │   Dimensions:          (pfCircuitsDim: 64)
            │   Dimensions without coordinates: pfCircuitsDim
            │   Data variables:
            │       shortName        (pfCircuitsDim) |S13 b'p1' b'pc' ... b'VS6U' b'VS7U'
            │       id               (pfCircuitsDim) int32 1 2 3 4 5 6 7 ... 15 16 17 18 19 20
            │       computed         (pfCircuitsDim) float64 3.387e+03 7.398e-06 ... 9.465e+03
            │       weights          (pfCircuitsDim) float64 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
            │       target           (pfCircuitsDim) float64 3.193e+03 0.0 ... 1.112e+04
            │       sigmas           (pfCircuitsDim) float64 254.5 1.0 ... 5.422e+03 5.246e+03
            │       timeSliceSource  (pfCircuitsDim) int8 2 2 2 2 2 2 2 2 2 ... 3 3 3 3 3 3 3 3
            ├── DataTree('plasmaCurrent')
            │   Dimensions:   ()
            │   Data variables:
            │       weights   float64 1.0
            │       computed  float64 5.564e+05
            │       sigma     float64 1.084e+05
            │       target    float64 5.418e+05
            ├── DataTree('fluxLoops')
            │   Dimensions:             (fluxLoopDim: 102, fluxLoopElementDim: 1)
            │   Dimensions without coordinates: fluxLoopDim, fluxLoopElementDim
            │   Data variables:
            │       shortName           (fluxLoopDim) |S8 b'f_nu_01' b'f_nu_02' ... b'f_c_b12'
            │       toroidalAngleEnd    (fluxLoopDim, fluxLoopElementDim) float64 6.283 ... 6...
            │       zValues             (fluxLoopDim, fluxLoopElementDim) float64 1.358 ... -...
            │       id                  (fluxLoopDim) int32 1 2 3 4 5 6 ... 97 98 99 100 101 102
            │       computed            (fluxLoopDim) float64 -0.08992 -0.08818 ... 0.05828
            │       weights             (fluxLoopDim) float64 1.0 1.0 1.0 1.0 ... 1.0 0.0 1.0
            │       target              (fluxLoopDim) float64 -0.09163 -0.09075 ... 0.0 0.05939
            │       rValues             (fluxLoopDim, fluxLoopElementDim) float64 0.901 ... 0...
            │       sigmas              (fluxLoopDim) float64 0.02 0.02 0.02 ... 0.02 0.02 0.02
            │       toroidalAngleBegin  (fluxLoopDim, fluxLoopElementDim) float64 0.0 ... 0.0
            ├── DataTree('magneticProbes')
            │   Dimensions:              (magneticProbeDim: 354)
            │   Dimensions without coordinates: magneticProbeDim
            │   Data variables: (12/14)
            │       shortName            (magneticProbeDim) |S9 b'b_bl1_n02' ... b'b_xu2_p36'
            │       axialLength          (magneticProbeDim) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
            │       poloidalOrientation  (magneticProbeDim) float64 1.614 1.614 ... 5.501 5.501
            │       rCentre              (magneticProbeDim) float64 1.1 1.175 ... 1.681 1.735
            │       toroidalSector       (magneticProbeDim) int32 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
            │       area                 (magneticProbeDim) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
            │       ...                   ...
            │       turnCount            (magneticProbeDim) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
            │       computed             (magneticProbeDim) float64 -0.03454 ... -0.02538
            │       weights              (magneticProbeDim) float64 1.0 1.0 1.0 ... 0.0 1.0 1.0
            │       target               (magneticProbeDim) float64 -0.03664 ... -0.02521
            │       sigmas               (magneticProbeDim) float64 0.015 0.015 ... 0.015 0.015
            │       zCentre              (magneticProbeDim) float64 -1.569 -1.566 ... 1.749
            ├── DataTree('diamagneticFlux')
            │   Dimensions:   ()
            │   Data variables:
            │       weights   float64 0.0
            │       computed  float64 -0.1356
            │       sigma     float64 0.0
            │       target    float64 0.0
            └── DataTree('q0')
                Dimensions:   ()
                Data variables:
                    weights   float64 0.0
                    computed  float64 -2.497
                    sigma     float64 0.0
                    target    float64 0.0

It will probably be easier to debug this and prevent it happening again after we have an assert_tree_equal function like in #27.

open_mfdatatree

Currently we have an open_datatree function which opens a single netcdf file (or zarr store). We could imagine an open_mfdatatree function which is analogous to open_mfdataset, which can open multiple files at once.

As DataTree has a structure essentially the same as that of a filesystem, I'm imagining a use case where the user has a bunch of data files stored in nested directories, e.g.

project
    /experimental
        data.nc
    /simulation
        /highres
            output.nc
        /lowres
            output.nc

We could look through all of these folders recursively, open any files found of the correct format, and store them in a single tree.

We could even allow for multiple data files in each folder if we called open_mfdataset on all the files found in each folder.

EDIT: We could also save a tree out to multiple folders like this using a save_mfdatatree method.

This might be particularly useful for users who want the benefit of a tree-like structure but are using a file format that doesn't support groups.

Automatically close files using open_datatree context manager

In xarray it's possible to automatically close a dataset after opening by opening it using a context manager. From the documentation:

Datasets have a Dataset.close() method to close the associated netCDF file. However, it’s often cleaner to use a with statement:
# this automatically closes the dataset after use
In [5]: with xr.open_dataset("saved_on_disk.nc") as ds:
   ...:     print(ds.keys())
   ...: 

We currently don't have a DataTree.close() method, or any context manager behaviour for open_datatree. To add them presumably we would need to iterate over all file handles (i.e. groups) and close them one by one.

Related to #90 @jhamman @thewtex

API for filtering / subsetting

So far we've only really implemented dictionary-like get/setitem syntax, but we should add a variety of other ways to select nodes from a tree too. Here are some suggestions:

class DataTree:
    ...

    def __getitem__(self, key: str) -> DataTree | DataArray:
        """
        Accepts node/variable names, or file-like paths to nodes/variables (inc. '../var').

        (Also needs to accommodate indexing somehow.)
        """
        ...

    def subset(self, keys: Sequence[str]) -> DataTree:
        """
        Return new tree containing only nodes with names matching keys.

        (Could probably be combined with `__getitem__`. 
        Also unsure what the return type should be.)
        """
        ...

    @property
    def subtree(self) -> Iterator[DataTree]:
        """An iterator over all nodes in this tree, including both self and all descendants."""
        ...

    def filter(self, filterfunc: Callable) -> Iterator[DataTree]:
        """Filters subtree by returning only nodes for which `filterfunc(node)` is True."""
        ...

Are there other types of access that we're missing here? Filtering by regex match? Getting nodes where at least one part of the path matches ("tag-like" access)? Glob?

Add release notes for 0.0.9

There were no release notes for v0.0.8 and v0.0.9. I bumped the current version in #123 but I assume there is more content missing?

Don't expose properties of wrapped dataset?

Currently I have exposed the properties of the dataset which a node (optionally) wraps, so that a user can do

dt = DataNode(data=xr.Dataset())
dt.dims  # returns .dims property of underlying dataset

However as @jbusecke pointed out this behaviour is inconsistent with the way that all other dataset methods are dispatched over all datasets in the tree - this forwarded property only looks at the Dataset in the node it was called on, ignoring all child nodes.

I see 4 ways to think about this:

Keep it as is. This would allow access to the properties without having to go via .ds first, but is also possibly counterintuitive.
Return the value of the property for all nodes in the subtree. You might imagine returning something like a nested dictionary of the properties for every node in the tree. I can't really see a use case for this though, and the user can achieve something similar simply via
```
dims_on_all_nodes = {node.pathstr: node.ds.dims for node in dt.subtree}
```
Don't expose these properties at all. Then the user would have to call dt.ds.dims to access the properties. This would reinforce the distinction between a DataTree and a Dataset, even though you could still call a lot of the dataset API on the datatree object (e.g. dt.mean().) (This would also make #17 obsolete.)
Make other changes that decide this issue (i.e. make DataTree inherit directly from Dataset, as discussed in #2 ). This would effectively be the same API as (1) though, just with automatic inheritance of the properties instead of manual wrapping of them.

I'm now leaning quite heavily towards (3).

Plotting methods

It would be nice to have plotting methods that allow for easily comparing variables in different parts of a tree.

One simple way would be to have a dt.plot(variable, **kwargs) method which uses a legend to distinguish between variables, i.e.

def plot(self, variable, **kwargs):
    fig, ax = plt.subplots()

    for node in self.subtree:
        da = node[variable]
        da.plot(ax=ax, label=node.name, **kwargs)

    ax.legend()
    ax.set_title(variable)
    plt.show()

The use cases for this would be multi-resolution datasets, or multi-model ensembles, where the user wants to see how a single variable varies across different datasets.

Care would have to be taken to ensure that incompatible dimensions or coordinates were handled smoothly.

pytest.requires

The tests shouldn't fail in a local environment that doesn't have zarr/netcdf, it should just skip those tests.

Conda-forge version of package stuck on 0.0.6

@andersy005 and @malmans2 whilst we (you) managed to fix the release version on pypi I think we might still be stuck in the past on conda-forge? 😩

html repr wrapping in firefox

It's a little too narrow. I couldn't figure out how to fix it by playing in the Developer Console. I'm on the "developer edition" of firefox but it's probably in the release too.

Improvements to html repr

by prepending repeated symbol(s) (dot, cross, etc.) to those names so that it informs about their hierarchical level.

Makes sense benbovy - that is basically how anytree achieves this:

DataTree('root', parent=None)
│ Dimensions: (y: 3, x: 2)
│ Dimensions without coordinates: y, x
│ Data variables:
│ a (y) int64 6 7 8
│ set0 (x) int64 9 10
├── DataTree('set1')
│ │ Dimensions: ()
│ │ Data variables:
│ │ a int64 0
│ │ b int64 1
│ ├── DataTree('set1')
│ └── DataTree('set2')
├── DataTree('set2')
│ │ Dimensions: (x: 2)
│ │ Dimensions without coordinates: x
│ │ Data variables:
│ │ a (x) int64 2 3
│ │ b (x) float64 0.1 0.2
│ └── DataTree('set1')
└── DataTree('set3')

I imagine I could use similar logic to how anytree's RenderTree just uses 3 different characters (vertical=│, continuation=├──, end=└──).

Originally posted by @TomNicholas in #78 (comment)

Having a custom `engine` for `open_mfdatatree`

Hi @TomNicholas !

I am one of the core devs of satpy (https://github.com/pytroll/satpy), which makes use of xarray/dask to handle satellite data for earth-observing satellites.
In this context, we have many times satellite data which have different resolutions for a same dataset, hence xarray's dataset can't really be used for these data, as the coords for the different variables don't match, and DataTree makes a lot of sense for us.

The satellite data, more often than not, is in some binary format, and we read it and convert it to xarray.DataArrays, and I'm now started experimenting placing them in a DataTree by hand.
So it would be really nice if there was an interface for adding custom engines to read that data (multiple files). Did you already consider that? Do you maybe already have an idea on how this would work?

We have been wanting to stick closer to the data model of xarray in our library, and datatree looks like something we could really use :) let's hope we can contribute here, at least with ideas in the future.

Should data nodes and tree nodes be unrelated classes?

Currently a DatasetNode is both a node of the tree (so can have children) and can wrap a single xarray.Dataset. If we followed #3 , we could instead choose to make TreeNodes unable to wrap Datasets directly, in favour of instead storing Dataset objects as children.

This would mean that multiple Datasets could be stored as the children of a single node, but I'm not sure if that's desirable or not. It would also ensure that the class representing a node of the tree is totally distinct from an xarray.Dataset (neither inheriting from xarray.Dataset nor wrapping it, only pointing to it as a child).

In conjunction with #3 this would mean that the syntax for selecting the variable 'pressure' from the dataset stored under the node 'weather' would be simply dt['weather']['pressure']. (Although then selecting via dt['weather/pressure'] would become trickier.)

Name collisions between Dataset variables and child tree nodes

I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means __getitem__ becomes ambiguous.

See this example:

In [3]: dt = DataNode('root', data=xr.Dataset({'a': [0], 'b': 1}))

In [4]: child = DataNode('a', data=None, parent=dt)

In [5]: print(dt)
DataNode('root')
│   Dimensions:  (a: 1)
│   Coordinates:
│     * a        (a) int64 0
│   Data variables:
│       b        int64 1
└── DataNode('a')

In [6]: dt['a']
Out[6]: 
<xarray.DataArray 'a' (a: 1)>
array([0])
Coordinates:
  * a        (a) int64 0

In [7]: dt.get_node('a')
Out[7]: DataNode(name='a', parent='root', children=[],data=None)

Here print(dt) shows that dt is in a form forbidden by netCDF, because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).

Furthermore, when choosing an item via DataTree.__getitem__ it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node 'a' instead.

The node is still accessible via .get_node, but only because .get_node is inherited from TreeNode, which has no concept of data variables.

Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:

In [8]: child = DataNode('a', data=None, parent=dt)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-8c4a956bcdd5> in <module>
----> 1 child = DataNode('a', data=None, parent=dt)

~/Documents/Work/Code/datatree/datatree/datatree.py in _init_single_datatree_node(cls, name, data, parent, children)
    162         # This approach was inspired by xarray.Dataset._construct_direct()
    163         obj = object.__new__(cls)
--> 164         obj = _init_single_treenode(obj, name=name, parent=parent, children=children)
    165         obj.ds = data
    166         return obj

~/Documents/Work/Code/datatree/datatree/treenode.py in _init_single_treenode(obj, name, parent, children)
     13     obj.name = name
     14 
---> 15     obj.parent = parent
     16     if children:
     17         obj.children = children

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in parent(self, value)
    133             self.__check_loop(value)
    134             self.__detach(parent)
--> 135             self.__attach(value)
    136 
    137     def __check_loop(self, node):

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in __attach(self, parent)
    157     def __attach(self, parent):
    158         if parent is not None:
--> 159             self._pre_attach(parent)
    160             parentchildren = parent.__children_or_empty
    161             assert not any(child is self for child in parentchildren), "Tree is corrupt."  # pragma: no cover

~/Documents/Work/Code/datatree/datatree/treenode.py in _pre_attach(self, parent)
     84         """
     85         if self.name in list(c.name for c in parent.children):
---> 86             raise KeyError(
     87                 f"parent {parent.name} already has a child named {self.name}"
     88             )

KeyError: 'parent root already has a child named a'

To prevent this we need better checks on assignment between variables and children. For example TreeNode.set_node(key, new_child) currently checks for any existing children with name key, but it also needs to check for any variables in the dataset with name key. (That's not too hard to implement, it could be done by overloading set_node on DataTree to check against variables as well as children, for example.)

What is more difficult is if a child with name key exists, but the user tries to assign a variable with name key to the wrapped dataset. If the user does this via node.ds.assign(key=new_da) then that's manageable - in that case assign() has a return value, which they need to assign to the node via node.ds = node.ds.assign(key=new_da). We could check for name conflicts with children in the .ds property setter method.

However if the user adds a variable via node.ds[key] = new_da then I think node.ds will be updated in-place without it's wrapping DataTree class ever having a chance to intervene. A similar issue with node[key] = new_da is preventable by improving checking in DataTree.__setitem__, but I don't know how we can prevent this happening when all that is being called is Dataset.__setitem__.

I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition 😕 Any ideas @dcherian maybe?

Add dataset methods at class definition time rather than object instantiation time?

Currently I'm adding the xarray.Dataset methods to DataTree via a pattern basically like this:

_DATASET_API_TO_COPY = ['isel', '__add__', ...]

class DatasetAPIMixin:
    def _add_dataset_api(self):
        for method_name in _DATASET_API_TO_COPY:
            ds_method = getattr(xarray.Dataset, method_name)

            # Decorate method so that when called it acts over whole subtree
            mapped_method = map_over_subtree(ds_method)

            setattr(self, method_name, mapped_method)


class DataTree(DatasetAPIMixin):
    def __init__(self, *args):
        self._add_dataset_api()

The idea was that the use of Mixins would echo how these methods were defined on xarray.Dataset originally, and also keep a distinction between methods that are actually unique to DataTree objects (such as .groups), and methods that are merely copied over from xarray.Dataset like .isel (albeit with modifications such as mapping over child nodes).

I like my Mixin idea, but one weird thing about this pattern is that the Dataset methods are only added to the DataTree once a dt instance is instantiated, not when the DataTree class is defined. I don't know if this is likely to cause problems, but at the very least it seems inefficient, because we are running the code to loop through and attach all these methods every single time we create a new DataTree object. It's also not really an example of class inheritance right now - the mixins aren't actually doing anything other than being a different place for me to put the definition of _add_dataset_api().

What would be better would be if the dataset methods were actually added at class definition time rather than object instantiation time, and ideally fully defined on the mixin before it is inherited. Then we wouldn't need to call any _add_dataset_api() method on the dt instance because the methods would already be there.

The only way I can think of to actually to do this within the class definitions is using a metaclass.

I could also possibly set the attribute outside of the mixin definition but before the definition of DataTree like this:

class DatasetAPIMixin:
    pass


for method_name in _DATASET_API_TO_COPY:
    ds_method = getattr(xarray.Dataset, method_name)

    # Decorate method so that when called it acts over whole subtree
    mapped_method = map_over_subtree(ds_method)

    setattr(DatasetAPIMixin, method_name, mapped_method)


class DataTree(DatasetAPIMixin):
    ...

Assigning a new node to an existing tree

I feel like I'm missing something but this is the best I could come up with

a05["natre"] = datatree.DataTree.from_dict({"natre": xarray_dataset})["natre"]

Is this right?

map over multiple subtrees

I realised that part of the reason that arithmetic (#24) and ufuncs (#25) don't yet work is because the map_over_subtree decorator currently only maps over a single subtree.

This works fine for mapping unary functions such as .isel, because they only accept one tree-like argument (i.e. self for the .isel method). However for any type of binary function such as add(dt1, dt2) then pairs of respective nodes in each tree need to be operated on together, as result_ds = add(dt1[node].ds, dt2[node].ds), before the output tree is built up from the results.

In the most general case we need to be able to map functions like

def func(*args, **kwargs)
    # do stuff involving multiple Dataset objects
    return output_trees

where any number of the args and kwargs could be DataTrees, and output_trees could be a list of any number of DataTrees.

To implement this the map_over_subtree decorator has to become a lot more general. It needs to

Identify which of args and kwargs are DataTree objects,
Check that all of those trees are isomorphic to one another, (EDIT: this was implemented in #31)
Walk along the nodes of all N trees simultaneously,
Pass the respective N nodes from that position in each tree to func, as Datasets, without losing their position in *args, **kwargs,
Use the M output Datasets from func to rebuild M DataTree objects (which all have the same structure as the input trees), and return them.

We therefore have to decide what we mean by "isomorphic". The strictest definition would be that all node names are the same, so that

dt_1:
DataNode('foo')
|   Data A
+---DataNode('bar')
    +   Data B

could be mapped alongside

dt_2:
DataNode('foo')
|   Data C
+---DataNode('bar')
    +   Data D

but not alongside

dt_3:
DataNode('baz')
|   Data C
+---DataNode('woz')
    +   Data D

A more lenient definition would be that each node's ordered set of children must each have the same number of children as it's counterpart in the other tree. (In other words the tree structure must be the same, but the node names need not be. This requires the children to be ordered to avoid ambiguities.) This definition would allow dt_3 to be mapped over alongside dt_1 or dt_2 (or both simultaneously for a func that accepts 3 Dataset arguments).

Bug allowing loops

There is a bug in the _check_loop method, which meant that a tree with a cycle in it could potentially be created, which should never be allowed.

I think this bug was introduced in #76, which made this check datatree's responsibility instead of anytree's responsibility.

Originally posted by @TomNicholas in #103 (comment)

xarray-contrib / datatree Goto Github PK

datatree's Issues

Inheritance

Composition

Tutorials

How-to Guides

Explanation

Reference

Recommend Projects

Recommend Topics

Recommend Org