xarray-contrib / datatree Goto Github PK
View Code? Open in Web Editor NEWWIP implementation of a tree-like hierarchical data structure for xarray.
Home Page: https://xarray-datatree.readthedocs.io
License: Apache License 2.0
WIP implementation of a tree-like hierarchical data structure for xarray.
Home Page: https://xarray-datatree.readthedocs.io
License: Apache License 2.0
Thanks for putting this library together,
I wonder if it would be useful / make sense allowing to write zarr regions of specific nodes / datasets with to_zarr.
Hi there!
At B-Open we would like to start testing and using this datatree implementation for a couple of projects.
I think it would be better if we work with a released version. Are you planning to make a release soon?
If there's anything we can do to help with it, feel free to ping @aurghs, @alexamici, and myself.
cc: @joshmoore
We need an assert_tree_equal
function, which would allow us to simplify several tests.
Currently I have the root node of the tree being a DataTree
, and a subclass of DatasetNode
, where DatasetNode
is used for all the nodes apart from the root. We could instead have every node be an instance of the same class, with the root node distinguished only by having .parent=None
.
The motivation for different classes was:
To have a different __init__
function for creating a whole tree as opposed to creating a single node. The former accepts a dictionary of path, object
pairs to store in the tree, while the latter accepts only the information needed to create that specific node (i.e. DatasetNode(name, ds, parent, children)
).
Wanting to have a .attrs
dict only on the root node, so you only have one set of attrs for the entire tree.
However:
We could add a second init method as a private classmethod on the same class instead (similar to xarray.Dataset._construct_direct()
), i.e. DataTree._init_single_node(name, data, parent, children)
.
We might just want to allow a different .attrs
dictionary on every node of the tree?
So should all nodes just instead be instances of the same class?
The code currently has a hard dependency on the anytree library, which it uses primarily to implement the actual .parent
& .children
structure of the TreeNode
class, through inheritance from anytree.NodeMixin
.
We don't need this dependency - anytree is not a big project, and we can simply reimplement the classes we need within this project instead. We should give due credit in the code because anytree has an Apache 2.0 license (plus Scout's honor).
Reimplementing anytree's functionality would also allow us to change various parts of it. For example we don't need support for python 2, we can standardize the error types (e.g. to return KeyError
instead of e.g. anytree.ResolverError
), and get rid of the mis-spelled .anchestors
property. We shouldn't need to worry about breaking the code because the tree functionality is already covered by the unit tests in test_treenode.py
.
Currently this works
dt["a"] = xr.DataArray(0)
but this fails
dt["a"] = 0
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 dt["a"] = 0
File ~/Documents/Work/Code/datatree/datatree/datatree.py:704, in DataTree.__setitem__(self, key, value)
700 elif isinstance(key, str):
701 # TODO should possibly deal with hashables in general?
702 # path-like: a name of a node/variable, or path to a node/variable
703 path = NodePath(key)
--> 704 return self._set_item(path, value, new_nodes_along_path=True)
705 else:
706 raise ValueError("Invalid format for key")
File ~/Documents/Work/Code/datatree/datatree/treenode.py:444, in TreeNode._set_item(self, path, item, new_nodes_along_path, allow_overwrite)
442 raise KeyError(f"Already a node object at path {path}")
443 else:
--> 444 current_node._set(name, item)
File ~/Documents/Work/Code/datatree/datatree/datatree.py:684, in DataTree._set(self, key, val)
682 self.update({key: val})
683 else:
--> 684 raise TypeError(f"Type {type(val)} cannot be assigned to a DataTree")
TypeError: Type <class 'int'> cannot be assigned to a DataTree
The latter syntax is convenient and intuitive, so we should relax this to accept any type, and raise an error if the DataArray
constructor can't parse it.
However it's not totally trivial to implement because it breaks some assumptions in the code that walks the tree.
xarray.Dataset.to_netcdf()
allows you to save a dataset into a netcdf file as a group, but to make a DataTree.to_netcdf()
we need to save many groups into the same file. I'm not sure how to do this with xarray.Dataset.to_netcdf()
, and xarray's backends code is pretty complicated. It could probably be done with the netCDF4
python library at a lower level but I havent' really tried that yet, and it would be nice to not have to use that.
The hard bit is saving multiple groups to the same file, actually iterating over the groups is easy, just something like
def to_netcdf(dt, filepath):
for node in dt.subtree_nodes:
group_name = node.pathstr
node.ds.to_netcdf(filepath, group=group_name)
Currently trying to get the version of datatree
interactively raises an AttributeError
:
import datatree
datatree.__version__
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 datatree.__version__
AttributeError: module 'datatree' has no attribute '__version__'
That's weird because we do have code to set the version number in __init__.py
.
See: https://github.com/xarray-contrib/datatree/runs/7306127762?check_suite_focus=true
What would help me enormously with writing documentation would be a killer example datatree, which I could open and use to demonstrate use of all types of methods. Just like we have the "air_temperature"
example dataset used in the main xarray documentation.
To be as useful as possible, this example tree should hit a few criteria:
A really good inspiration is this pseudo-structure provided in pydata/xarray#4118:
This would hit all of the criteria above, if it actually existed somewhere I could find!
What I would like is for people who have more familiarity with real geo-science data products to help me make this killer example tree, or at least point me towards data that I might use.
If we have multiple good suggestions I could make multiple different examples to use, but I think I would prefer one really good one to multiple quite good ones. Alternatively any extras could end up getting used for some future example notebooks though.
Question
Is there an easy/convenient way to set the compression level for all DataArrays in a datatree?
I've been using the solution here when saving xarray.Datasets, but this solution doesn't seem to map conveniently to datatree. Maybe I'm missing something?
Background
I often deal with multiple datasets, each on the order of 10 GBs, and this fills up harddrives fast. Typically when I write xarray.Datasets to file, I like to use gzip compression with level around 5, and this tends to reduce the file size by around 50%.
Thanks!
Edit:
Or is recommended practice to update the compression info for each DataArray? (The second answer to the above link)
Following up on #26, it would be great to have Zarr support in datatree. The API would be the same as to_netcdf
and, I think we can actually plug into xarray's open_dataset
and engine
argument to get most of the way there with open_datatree
.
Hello. I've recently been introduced to this repository. This is a feature that xarray definitely needs. Thanks for creating it.
Issue
I can create a datatree, save it to file using the h5netcdf engine, but then I cannot open it. I need to use the h5netcdf engine for my work because I deal with complex numbers.
Error
runfile('C:/Users/jwbrooks/python/untitled0.py', wdir='C:/Users/jwbrooks/python')
Reloaded modules: imp, _strptime, encodings.ascii, stringprep, encodings.idna, xmlrpc, xml.parsers, pyexpat, xml.parsers.expat, xmlrpc.client, socketserver, http.server, xmlrpc.server, plistlib, tarfile, xml.etree, xml.etree.ElementPath, _elementtree, xml.etree.ElementTree
Traceback (most recent call last):
File "C:\Users\jwbrooks\python\untitled0.py", line 44, in <module>
data2 = dt.open_datatree('example_data_1.hdf5',
File "C:\Users\jwbrooks\python\datatree\datatree\io.py", line 68, in open_datatree
return _open_datatree_netcdf(filename_or_obj, engine=engine, **kwargs)
File "C:\Users\jwbrooks\python\datatree\datatree\io.py", line 72, in _open_datatree_netcdf
ncDataset = _get_nc_dataset_class(kwargs.get("engine", None))
File "C:\Users\jwbrooks\python\datatree\datatree\io.py", line 36, in _get_nc_dataset_class
from h5netcdf import Dataset
ImportError: cannot import name 'Dataset' from 'h5netcdf' (C:\Users\jwbrooks\Environments\python38_spyder515\lib\site-packages\h5netcdf\__init__.py)
initial investigation
In your datatree\io.py file, you use the command from h5netcdf import Dataset
, but the h5netcdf library does not contain anything called Dataset
... I started digging around in the xarray library trying to figure out how they handle it, but quickly got lost.
code to reproduce the error
Below, I create a fake data, convert it to a datatree, write it to file, and then try to read the file.
import numpy as np
import xarray as xr
import datatree as dt
## create example datatree
t1 = np.arange(0, 1e0, 1e-4)
t1 = xr.DataArray(t1, dims='t', coords=[t1], attrs={'units': 's', 'long_name': 'time'})
t2 = np.arange(0, 1e-1, 1e-6)
t2 = xr.DataArray(t2, dims='t', coords=[t2], attrs={'units': 's', 'long_name': 'time'})
a1 = xr.DataArray(np.random.rand(len(t1)), dims='t', coords=[t1], name='data_A1', attrs={'units': 'au', 'long_name': 'data A1'})
a2 = xr.DataArray(np.random.rand(len(t1)), dims='t', coords=[t1], name='data_A2', attrs={'units': 'au', 'long_name': 'data A2'})
a = xr.Dataset({a1.name: a1, a2.name: a2})
b1 = xr.DataArray(np.random.rand(len(t2)), dims='t', coords=[t2], name='data_B', attrs={'units': 'au', 'long_name': 'data B'})
b = b1.to_dataset()
c1 = xr.DataArray(np.random.rand(len(t2)), dims='t', coords=[t2], name='data_C1', attrs={'units': 'au', 'long_name': 'data C1'})
c2 = xr.DataArray(np.random.rand(len(t2)), dims='t', coords=[t2], name='data_C2', attrs={'units': 'au', 'long_name': 'data C2'})
c = xr.Dataset({c1.name: c1, c2.name: c2})
data = dt.DataTree.from_dict({'data1': a, 'data2/b': b, 'data2/c': c})
## write data to file
data.to_netcdf( 'example_data_1.hdf5',
mode='w',
# format='NETCDF4',
engine='h5netcdf',
# encoding=encoding,
invalid_netcdf=True
)
## read data from file
data2 = dt.open_datatree('example_data_1.hdf5',
engine='h5netcdf')
Details on my setup
virtualenv
Edits
Should we prefer inheritance or composition when making the node of a datatree behave like an xarray Dataset?
We really want the data-containing nodes of the datatree to behave as much like xarray datasets as possible, as we will likely be calling functions/methods on them, assigning them, extracting from them and saving them as if they were actually xarray.Dataset objects. We could imagine a tree node class which directly inherits from xarray.Dataset
:
class DatasetNode(xarray.Dataset, NodeMixin):
...
This would have all the attributes and API of a Dataset, and pass isinstance()
checks, but also the attributes and methods needed to function as a node in a tree (e.g. .children
, .parent
). We would still need to decorate most inherited methods in order to apply them to all child nodes in the tree though.
Mostly these don't collide, except in the important case of getting/setting children of a node. xarray.Datasets
already use up __getitem__
for variable selection (i.e. ds[var]
) as well as the .some_variable
namespace via property-like access. This means we can't immediately have an API allowing operations like dt.weather = dt.weather.mean('time')
because .weather
is a child of the node, not a dataset variable. (It's possible we could have both behaviours simultaneously by overwriting __getitem__
, but then we might restrict the possible names of children/variables.)
I think this approach would also have the side-effect that accessor methods registered with @register_dataset_accessor
would also be callable on the tree nodes.
The alternative is instead of each node being a Dataset, each node merely wraps a Dataset. This has the advantage of keeping the Data class and the Node class separate, though they would still share a large API to allow applying a method (e.g. .mean()
to all child nodes in a tree.
The disadvantage is that then all the variables and dataset attributes are behind a .ds
property.
This type of syntax dt.weather = dt.weather.mean('time')
would then be possible (at least if we didn't allow the tree objects to have their own .attrs
, else it would have to be dt['weather'] = dt['weather'].mean('time')
) because we would be calling the method of a DatasetNode (rather than Dataset) and then assigning to a DatasetNode.
Selecting a particular variable from a dataset stored at a particular node would then look like dt['weather'].ds['pressure']
, which has the advantage of clarifying which one is the variable, but the disadvantage of breaking up the path-like structure to get from the root down to the variable. EDIT: As there is no problem with collisions between names of groups and variables, we can actually just override __getitem__
to check in both the data variables and the children, so we can have access like dt['weather']['pressure']
.
(There is also a possible third option described in #4)
For now the second approach seemed better, but I'm looking for other opinions!
Inspired by this example in the stackstac documentation
lowcloud = stack[stack["eo:cloud_cover"] < 20]
we should ensure that you can index a datatree with another (isomorphic) datatree, so that the above operation would work even if stack
is a DataTree
instance.
This is another map_over_subtree
-type operation, but it needs careful testing because the __getitem__
function in xarray objects already does so many different things. This won't work with the code as-is because at the moment the DataTree naively dispatches the __getitem__
call down to the wrapped dataset.
Line 238 in cd06951
Opening this so I can close #90.
@jhamman I'm not sure exactly what you mean - dataless groups are represented by nodes with empty datasets, so this seems like reasonable code to me?
Are you saying you would prefer to get rid of the try/except and only iterate over paths that we know exist?
Originally posted by @jhamman in #90 (comment)
Xarray Datasets don't alter the names of the objects they store:
In [1]: import xarray as xr
In [2]: da = xr.DataArray(name="b")
In [3]: ds = xr.Dataset()
In [4]: ds['a'] = da
In [5]: ds["a"].name
Out[5]: 'a'
In [6]: da.name
Out[6]: 'b'
After #41 (and #115 ensures it) then DataTree
objects behave similarly for data variables they store
In [7]: from datatree import DataTree
In [8]: root = DataTree()
In [9]: root["a"] = da
In [10]: root["a"].name
Out[10]: 'a'
In [11]: da.name
Out[11]: 'b'
However, currently DataTree
objects do alter the name of child DataTree
nodes that they store.
In [12]: subtree = DataTree(name="b")
In [13]: root = DataTree()
In [14]: root["a"] = subtree
In [15]: root["a"].name
Out[15]: 'a'
In [16]: subtree.name
Out[16]: 'a'
I noticed this in #115, but fixing it might be a bit complex.
I'm failing to get DataTree.to_netcdf() to compress data when saving. The structure of my DataTree dt
looks like this:
dt.groups
:
('/',
'/0',
'/0/lp',
'/0/hp',
'/1',
'/1/lp',
'/1/hp',
'/2',
'/2/lp',
'/2/hp',
'/3',
'/3/lp',
'/3/hp',
'/4',
'/4/lp',
'/4/hp',
'/5',
'/5/lp',
'/5/hp',
'/6',
'/6/lp',
'/6/hp',
'/7',
'/7/lp',
'/7/hp',
...
'/15/lp',
'/15/hp',
'/16',
'/16/lp',
'/16/hp')
The encoding
dictionary looks like this:
{'0/lp': {'gain': {'zlib': True, 'complevel': 5},
'irl': {'zlib': True, 'complevel': 5},
'nf': {'zlib': True, 'complevel': 5},
'oip2': {'zlib': True, 'complevel': 5},
'oip3': {'zlib': True, 'complevel': 5},
'op1db': {'zlib': True, 'complevel': 5},
'opsat': {'zlib': True, 'complevel': 5},
'orl': {'zlib': True, 'complevel': 5},
'pmax': {'zlib': True, 'complevel': 5}},
'0/hp': {'gain': {'zlib': True, 'complevel': 5},
'irl': {'zlib': True, 'complevel': 5},
'nf': {'zlib': True, 'complevel': 5},
'oip2': {'zlib': True, 'complevel': 5},
'oip3': {'zlib': True, 'complevel': 5},
'op1db': {'zlib': True, 'complevel': 5},
'opsat': {'zlib': True, 'complevel': 5},
'orl': {'zlib': True, 'complevel': 5},
'pmax': {'zlib': True, 'complevel': 5}},
'1/lp': {'gain': {'zlib': True, 'complevel': 5},
'irl': {'zlib': True, 'complevel': 5},
'nf': {'zlib': True, 'complevel': 5},
'oip2': {'zlib': True, 'complevel': 5},
'oip3': {'zlib': True, 'complevel': 5},
'op1db': {'zlib': True, 'complevel': 5},
'opsat': {'zlib': True, 'complevel': 5},
...
'oip3': {'zlib': True, 'complevel': 5},
'op1db': {'zlib': True, 'complevel': 5},
'opsat': {'zlib': True, 'complevel': 5},
'orl': {'zlib': True, 'complevel': 5},
'pmax': {'zlib': True, 'complevel': 5}}}
Finally, saving to a HDF5/NetCDF file:
dt.to_netcdf("myhdf5file.hdf5", mode="w", encoding=encoding) # no compression: 3.33 MB
The file is the same size whether I pass the encoding
dictionary or not.
We need some basic documentation, which needs to be set up from scratch.
It would be nice if the tree could support internal symbolic links between nodes.
anytree
has a SymLinkNodeMixin
class to complement its NodeMixin
class to provide this sort of functionality. What I'm not sure is if this approach would allow for loops within the tree, or merely for the same node to appear multiple times in the tree.
A tree-like structure must have nodes, each of which can contain multiple children, and those children have to be selectable via some kind of name. However, those names can either be keys to access the child objects, or inherent properties of the child objects.
In the former case we would have a node.children=tuple(child1, child2)
, where child1.name = 'steve'
, child2.name = 'mary'
etc. In the latter case we would have node.children=dict('steve': child1, 'mary': child2)
, where each child need not have a name. It's not clear to me which of these approaches is better in our case.
It's easy to ensure that all nodes have names (and if we make nodes inherit from Dataset they will inherit a name), but storing children in tuples leads to annoying code like child_we_want = next(c for c in node.children if c.name == name_we_want)
, instead of just child_we_want = node[name_we_want]
. A DataTree
is also quite intuitively represented by a nested dictionary where keys are parts of a path and values are either datasets or child nodes, and in that description we would not say that the name key is an inherent property of the value.
Using a dictionary also means that the path to an object is distinct from the name of that object.
This also means that a node doesn't need a name at all, and becomes defined only in terms of its parent and children. In effect, the name of the node would be the key for which self.parent.children[key]
returns self
. Parentless nodes would be nameless.
A disadvantage of this is that a stored Dataset object has no idea who its parent is.
None of the tree implementations I've seen work like this, and it appears to deviate from the way that a "tree" is defined mathematically.
The anytree library uses named nodes and tuples to store the children, so to use dictionaries we would need to reimplement the NodeMixin
class to use a dictionary instead.
We want to support both path-like (e.g. 'simulation/highres/temperature'
) and tag-like access (e.g. ('simulation', 'highres', 'temperature')
) to nodes, but at the moment the implementation of this is incomplete and probably buggy. For example right now you can't select a node further up the tree using '../', and there is no real .relative_path(target)
method.
A better implementation might use python's built-in pathlib
. If our paths were all instances of PurePosixPath
then they would always use forwards slashes, but not actually have any connection to the filesystem. If we wanted to convert paths to tags and vice versa we could probably do it neatly by subclassing PurePosixPath
:
class NodePath(PurePosixPath):
def as_tags(self):
return tuple(str(self).split('/'))
It might only make sense to do this after #7 , so that we have full control over the method namespace.
xr.Dataset
implements a bunch of dask-specific methods, such as __dask_tokenize__
and __dask_graph__
. It also obviously has public methods that involve dask such as .compute()
and .load()
.
In DataTree
on the other hand, I haven't yet implemented any methods like these, or even written any tests that involve dask! You can probably still use dask with datatree right now, but from dask's perspective the datatree is presumably merely a set of unconnected Dataset
objects.
We could choose to implement methods like .load()
as just a mapping over the tree, i.e.
def load(self):
for node in self.subtree:
if node.has_data:
node.ds.load()
but really this should probably wait for #41, or be done as part of that refactor.
I don't really understand what the double-underscore methods do though yet, so would appreciate input on that.
This tree has a dimension present in some nodes and not others (the "people" dimension).
DataTree('root', parent=None)
โ Dimensions: (people: 2)
โ Coordinates:
โ * people (people) <U5 'alice' 'bob'
โ species <U5 'human'
โ Data variables:
โ heights (people) float64 1.57 1.82
โโโ DataTree('simulation')
โโโ DataTree('coarse')
โ Dimensions: (x: 2, y: 3)
โ Coordinates:
โ * x (x) int64 10 20
โ Dimensions without coordinates: y
โ Data variables:
โ foo (x, y) float64 0.1242 -0.2324 0.2469 0.5168 0.8391 0.8686
โ bar (x) int64 1 2
โ baz float64 3.142
โโโ DataTree('fine')
Dimensions: (x: 6, y: 3)
Coordinates:
* x (x) int64 10 12 14 16 18 20
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 0.1242 -0.2324 0.2469 ... 0.5168 0.8391 0.8686
bar (x) float64 1.0 1.2 1.4 1.6 1.8 2.0
baz float64 3.142
If a user calls dt.mean(dim='people')
, then at the moment this will raise an error. That's because it maps the .mean
call over each group, and when it gets to either the 'coarse'
group or the 'fine'
group it will not find a dimension called 'people'
.
However the user might want to take the mean of groups only where this makes sense, and ignore the rest.
I think the best solution is to have a missing_dims
argument, like xarray's .isel
already has. Then the user can do dt.mean(dim='people', missing_dims='ignore')
.
To actually implement this I think only requires changes in xarray, not here, because those changes should propagate down to datatree. pydata/xarray#5030
xarray.Dataset has a ginormous API, and eventually the entire thing should also be available on DataTree
. However we probably still need to test this copied API, because the majority of it will be altered via the @map_over_subtree
decorator. How can we do that without also copying thousands of lines of tests from xarray.tests
?
Raises AttributeError right now.
It would be awesome to have a rich html repr that can show off both the data in each node and the nested structure of the nodes in the whole tree.
Ideally it would be collapsible at each level, else the amount of information could get overwhelming.
We would want to combine xarray's html repr with some method of displaying the tree, (see xgcm/xgcm#474 for a similar problem).
I have never done much with html so would appreciate help with this feature.
Datatree needs some documentation, even if it has to change in future.
I think most of the documentation would remain relevant even after some changes, as long as we keep the same basic data model (e.g. DataTree vs DataGroups, with no hierarchy).
I really like this breakdown for documentation, which theorizes that there are 4 types of documentation, along two axes, as shown in this diagram:
Another thing to consider is how the documentation we write now might eventually be incorporated into xarray's documentation upstream. We don't need to duplicate anything, and we want things we write to neatly slot into sections in xarray's existing documentation.
Some ideas:
(Some of these could possibly go in xarray's Gallery section)
(A lot of this could be grouped under one page on "Working with hierarchical data".)
This should be pretty much covered by ensuring that the auto-generated API docs work properly. The hard bit will be copying / duplicating the large API of xarray.Dataset
that DataTree
inherits.
In #76 I refactored the tree structure to use a path-like syntax. This includes referring to the root of a tree as "/"
, same as in cd /
in a unix-like filesystem.
This makes accessing nodes and variables of nodes quite neat, because you can reference nodes via absolute or relative paths:
In [23]: from datatree.tests.test_datatree import create_test_datatree
In [24]: dt = create_test_datatree()
In [25]: dt['set2/a']
Out[25]:
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x
In [26]: dt['/set2/a']
Out[26]:
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x
In [27]: dt['./set2/a']
Out[27]:
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x
This refactor also made DataTree objects only optionally have a name, as opposed to be before when they were required to have a name. (They still have a .name
attribute now, it just can be None
.)
In [28]: dt.name
Normally this doesn't matter, because when assigned a .parent
a node's .name
property will just point to the key under which it is stored as a child. This echoes the way an unnamed DataArray
can be stored in a Dataset
.
In [29]: import xarray as xr
In [30]: ds = xr.Dataset()
In [31]: da = xr.DataArray(0)
In [32]: ds['foo'] = da
In [33]: ds['foo'].name
Out[33]: 'foo'
However this means that the root node of a tree is no longer required to have a name in general.
This is good because
As a user you normally don't care about the name of the root when manipulating the tree, only the names of the nodes,
It makes the __init__
signature simpler as name
is no longer a required arg,
It most closely echoes how filepaths work (the filesystem root "/"
doesn't have another name),
Roundtripping from Zarr/netCDF files still seems to work (see test_io.py
),
Roundtripping from dictionaries still works if the root node is unnamed
In [35]: d = {node.path: node.ds for node in dt.subtree}
In [36]: roundtrip = DataTree.from_dict(d)
In [37]: roundtrip
Out[37]:
DataTree('None', parent=None)
โ Dimensions: (y: 3, x: 2)
โ Dimensions without coordinates: y, x
โ Data variables:
โ a (y) int64 6 7 8
โ set0 (x) int64 9 10
โโโ DataTree('set1')
โ โ Dimensions: ()
โ โ Data variables:
โ โ a int64 0
โ โ b int64 1
โ โโโ DataTree('set1')
โ โโโ DataTree('set2')
โโโ DataTree('set2')
โ โ Dimensions: (x: 2)
โ โ Dimensions without coordinates: x
โ โ Data variables:
โ โ a (x) int64 2 3
โ โ b (x) float64 0.1 0.2
โ โโโ DataTree('set1')
โโโ DataTree('set3')
In [38]: dt.equals(roundtrip)
Out[38]: True
But it's bad because
Roundtripping from dictionaries doesn't work anymore if the root node is named
In [39]: dt2 = dt
In [40]: dt2.name = "root"
In [41]: d2 = {node.path: node.ds for node in dt2.subtree}
In [42]: roundtrip2 = DataTree.from_dict(d2)
In [43]: roundtrip2
Out[43]:
DataTree('None', parent=None)
โ Dimensions: (y: 3, x: 2)
โ Dimensions without coordinates: y, x
โ Data variables:
โ a (y) int64 6 7 8
โ set0 (x) int64 9 10
โโโ DataTree('set1')
โ โ Dimensions: ()
โ โ Data variables:
โ โ a int64 0
โ โ b int64 1
โ โโโ DataTree('set1')
โ โโโ DataTree('set2')
โโโ DataTree('set2')
โ โ Dimensions: (x: 2)
โ โ Dimensions without coordinates: x
โ โ Data variables:
โ โ a (x) int64 2 3
โ โ b (x) float64 0.1 0.2
โ โโโ DataTree('set1')
โโโ DataTree('set3')
In [44]: dt2.equals(roundtrip2)
Out[44]: False
The signature of the DataTree.from_dict
becomes a bit weird because if you want to name the root node the only way to do it is to pass a separate name
argument, i.e.
In [45]: dt3 = DataTree.from_dict(d, name='root')
In [46]: dt3
Out[46]:
DataTree('root', parent=None)
โโโ DataTree('set1')
โ โ Dimensions: ()
โ โ Data variables:
โ โ a int64 0
โ โ b int64 1
โ โโโ DataTree('set1')
โ โโโ DataTree('set2')
โโโ DataTree('set2')
โ โ Dimensions: (x: 2)
โ โ Dimensions without coordinates: x
โ โ Data variables:
โ โ a (x) int64 2 3
โ โ b (x) float64 0.1 0.2
โ โโโ DataTree('set1')
โโโ DataTree('set3')
What do we think about this behaviour? Does this seem like a good design, or annoyingly finicky?
@jhamman I notice that in the code you wrote for the io you put a note about not being able to specify a root group for the tree. Is that related to this question? Do you have any other thoughts on this?
Currently calling a numpy ufunc on a DataTree
(e.g. np.sin(dt)
) fails with a recursion error (below).
That's probably because I just naively included Dataset.__array_ufunc__
in the list of method to decorate with @map_over_subtree
, and it's needs a smarter definition.
________________________________________TestUFuncs.test_root___________________________________________
self = <test_dataset_api.TestUFuncs object at 0x7f486a542100>
def test_root(self):
da = xr.DataArray(name="a", data=[1, 2, 3])
dt = DataNode("root", data=da)
expected_ds = np.sin(da.to_dataset())
> result_ds = np.sin(dt).ds
datatree/tests/test_dataset_api.py:179:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
datatree/datatree.py:92: in _map_over_subtree
out_tree.ds = func(out_tree.ds, *args, **kwargs)
../xarray/xarray/core/arithmetic.py:79: in __array_ufunc__
return apply_ufunc(
../xarray/xarray/core/computation.py:1184: in apply_ufunc
return apply_array_ufunc(func, *args, dask=dask)
../xarray/xarray/core/computation.py:811: in apply_array_ufunc
return func(*args)
datatree/datatree.py:92: in _map_over_subtree
out_tree.ds = func(out_tree.ds, *args, **kwargs)
../xarray/xarray/core/arithmetic.py:79: in __array_ufunc__
return apply_ufunc(
../xarray/xarray/core/computation.py:1184: in apply_ufunc
return apply_array_ufunc(func, *args, dask=dask)
../xarray/xarray/core/computation.py:811: in apply_array_ufunc
return func(*args)
datatree/datatree.py:92: in _map_over_subtree
out_tree.ds = func(out_tree.ds, *args, **kwargs)
datatree/datatree.py:92: in _map_over_subtree
out_tree.ds = func(out_tree.ds, *args, **kwargs)
../xarray/xarray/core/arithmetic.py:79: in __array_ufunc__
return apply_ufunc(
../xarray/xarray/core/computation.py:1184: in apply_ufunc
return apply_array_ufunc(func, *args, dask=dask)
../xarray/xarray/core/computation.py:793: in apply_array_ufunc
if any(is_duck_dask_array(arg) for arg in args):
../xarray/xarray/core/computation.py:793: in <genexpr>
if any(is_duck_dask_array(arg) for arg in args):
../xarray/xarray/core/pycompat.py:47: in is_duck_dask_array
if DuckArrayModule("dask").available:
../xarray/xarray/core/pycompat.py:21: in __init__
duck_array_module = import_module(mod)
/home/tom/miniconda3/envs/py38-mamba/lib/python3.8/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1011: in _gcd_import
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
name = 'dask', package = None, level = 0
> ???
E RecursionError: maximum recursion depth exceeded while calling a Python object
<frozen importlib._bootstrap>:939: RecursionError
!!! Recursion error detected, but an error occurred locating the origin of recursion.
The following exception happened when comparing locals in the stack frame:
ValueError: dimensions ('dim_0',) must have the same length as the number of data dimensions, ndim=0
Displaying first and last 10 stack frames out of 767.
We want to be able to load all groups from a netCDF file as nodes in a DataTree
, simply via
dt = open_datatree('some_data.nc')
To populate this tree we need to know the structure of all the groups in the netCDF file, but currently in xarray the open_dataset
function makes it pretty difficult to see this information. Instead it only allows you to open one group (whose name you have to know in advance!) via the group
keyword argument.
Having come back to this project for the first time in a while, I want to propose a design for the .ds
property that I think will solve a few problems at once.
Accessing data specific to a node via .ds
is intuitive for our tree design, and if the .ds
object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.
But returning an actual xr.Dataset
as we do now causes a couple of problems:
.ds.__setitem__
causes consistency (1) headaches (2),.ds['../common_var']
), which limits the usefulness of map_over_subtree
and of the tree concept in general.After we refactor DataTree
to store Variable
objects under ._variables
directly instead of storing a Dataset
object under ._ds
, then .ds
will have to reconstruct a dataset from its private attributes.
I propose that rather than constructing a Dataset
object we instead construct and return a NodeDataView
object, which has mostly the same API as xr.Dataset
, but with a couple of key differences:
No mutation allowed (for now)
Whilst it could be nice if e.g. dt.ds[var] = da
actually updated dt
, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towards dt[var] = da
instead.
Allow path-like access to DataArray
objects stored in other nodes via __getitem__
One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":
print(dt)
DataTree(parent=None)
โ Dimensions: ()
โ Coordinates:
โ * reference_pressure () 100.0
โโโ DataTree('observations')
โ โโโ DataTree('satellite')
โ Dimensions: (x: 6, y: 3)
โ Dimensions without coordinates: x,y
โ Data variables:
โ p (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
โโโ DataTree('simulations')
โโโ DataTree('coarse')
โ Dimensions: (x: 2, y: 3)
โ Coordinates:
โ * x (x) int64 10 20
โ Dimensions without coordinates: y
โ Data variables:
โ p (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
โโโ DataTree('fine')
Dimensions: (x: 6, y: 3)
Coordinates:
* x (x) int64 10 12 14 16 18 20
Dimensions without coordinates: y
Data variables:
p (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
I have a function which accepts and consumes datasets
def normalize_pressure(ds):
ds['p'] = ds['p'] / ds['/reference_pressure']
return ds
then map it over the tree
result_tree = dt.map_over_subtree(normalise_pressure)
If we allowed path-like access to data in other nodes from .ds
then this would work because map_over_subtree
applies normalise_pressure
to the .ds
attribute of every node, and '/reference_pressure'
means "look for the variable 'reference_pressure'
in the root node of the tree".
(In this case referring to the reference pressure with ds['../../reference_pressure']
would also have worked.)
(PPS if we chose to support the CF conventions' upwards proximity search behaviour then ds['reference_pressure']
would have worked too, because then __getitem__
would search upwards through the nodes for the first with a variable matching the desired name.)
A simple implementation could then just subclass xr.Dataset
:
class NodeDataView(Dataset):
_wrapping_node = DataTree
@classmethod
def _construct_direct(wrapper_node, variables, coord_names, ...) -> NodeDataView:
...
def __setitem__(self, key, val):
raise("NOT ALLOWED, please use __setitem__ on the wrapping DataTree node, "
"or use `DataTree.to_dataset()` if you want a mutable dataset")
def __getitem__(self, key) -> DataArray:
# calling the `_get_item` method of DataTree allows path-like access to contents of other nodes
obj = self._wrapping_node._get_item(key)
if isinstance(obj, DataArray):
return obj
else:
raise ValueError("NodeDataView is only allowed to return variables, not entire tree nodes")
# all API that doesn't modify state in-place can just be inherited from Dataset
...
class DataTree:
_variables = Mapping[Hashable, Variable]
_coord_names = set[Hashable]
...
@property
def ds(self) -> NodeDataView:
return NodeDataView._construct_direct(self, self._variables, self._coord_names, ...)
def to_dataset(self) -> Dataset:
return Dataset._construct_direct(self._variables, self._coord_names, ...)
...
If we don't like subclassing Dataset
then we could cook up something similar using getattr
instead.
(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)
For some reason taking a .mean()
over part of the tree obtained by opening this netCDF file creates a new tree with the wrong structure.
Bug looks like this:
from datatree import open_datatree
dt = open_datatree('../../Code/datatree/datafiles/epm044463.nc')
print(dt)
DataTree('root')
โ Dimensions: ()
โ Data variables:
โ *empty*
โ Attributes: (12/15)
โ generator: IDAM PutData VERSION 1 (Apr 20 2021)
โ Conventions: Fusion-1.1
โ class: analysed data
โ title: EFIT++ Equilbrium Reconstruction epm
โ date: 2021-07-20
โ time: 13:51:58
โ ... ...
โ comment: magnetics
โ runcmd: efit++ -p 44463 -t mastu --pass_number 0 --runmod...
โ executableGitCommitId: c9c55b35e653f64af97db6f4a06e0181898b7a9b
โ executableGitRepo: https://git.ccfe.ac.uk/EFIT_PP/efitpp.git
โ runScriptsGitCommitId: 922855671166ea5b8aac2c95858cb9cedf5795d7
โ runScriptsGitRepo: https://git.ccfe.ac.uk/EFIT_PP/efitpp.git
โโโ DataTree('epm')
โ Dimensions: (time: 56)
โ Coordinates:
โ * time (time) timedelta64[ns] 00:00:00.020000 ... 00:0...
โ Data variables:
โ equilibriumStatusInteger (time) int32 ...
โ Attributes:
โ badChi2Flag: 1
โ badChi2MagneticProbe: b_nl2_p05,b_nu1_n08,b_nu1_p09
โ badChi2MagneticProbeId: [ 69 75 239]
โโโ DataTree('input')
โ โ Dimensions: (time: 56)
โ โ Dimensions without coordinates: time
โ โ Data variables:
โ โ bVacRadiusProduct (time) float64 ...
โ โโโ DataTree('constraints')
โ โ โโโ DataTree('pfCircuits')
โ โ โ Dimensions: (pfCircuitsDim: 64, time: 56)
โ โ โ Dimensions without coordinates: pfCircuitsDim, time
โ โ โ Data variables:
โ โ โ shortName (pfCircuitsDim) |S13 ...
โ โ โ id (pfCircuitsDim) int32 ...
โ โ โ computed (time, pfCircuitsDim) float64 ...
โ โ โ weights (time, pfCircuitsDim) float64 ...
โ โ โ target (time, pfCircuitsDim) float64 ...
โ โ โ sigmas (time, pfCircuitsDim) float64 ...
โ โ โ timeSliceSource (pfCircuitsDim) int8 ...
โ โ โโโ DataTree('plasmaCurrent')
โ โ โ Dimensions: (time: 56)
โ โ โ Dimensions without coordinates: time
โ โ โ Data variables:
โ โ โ weights (time) float64 ...
โ โ โ computed (time) float64 ...
โ โ โ sigma (time) float64 ...
โ โ โ target (time) float64 ...
โ โ โโโ DataTree('fluxLoops')
โ โ โ Dimensions: (fluxLoopDim: 102, fluxLoopElementDim: 1, time: 56)
โ โ โ Dimensions without coordinates: fluxLoopDim, fluxLoopElementDim, time
โ โ โ Data variables:
โ โ โ shortName (fluxLoopDim) |S8 ...
โ โ โ toroidalAngleEnd (fluxLoopDim, fluxLoopElementDim) float64 ...
โ โ โ zValues (fluxLoopDim, fluxLoopElementDim) float64 ...
โ โ โ id (fluxLoopDim) int32 ...
โ โ โ computed (time, fluxLoopDim) float64 ...
โ โ โ weights (time, fluxLoopDim) float64 ...
โ โ โ target (time, fluxLoopDim) float64 ...
โ โ โ rValues (fluxLoopDim, fluxLoopElementDim) float64 ...
โ โ โ sigmas (time, fluxLoopDim) float64 ...
โ โ โ toroidalAngleBegin (fluxLoopDim, fluxLoopElementDim) float64 ...
โ โ โโโ DataTree('magneticProbes')
โ โ โ Dimensions: (magneticProbeDim: 354, time: 56)
โ โ โ Dimensions without coordinates: magneticProbeDim, time
โ โ โ Data variables: (12/14)
โ โ โ shortName (magneticProbeDim) |S9 ...
โ โ โ axialLength (magneticProbeDim) float64 ...
โ โ โ poloidalOrientation (magneticProbeDim) float64 ...
โ โ โ rCentre (magneticProbeDim) float64 ...
โ โ โ toroidalSector (magneticProbeDim) int32 ...
โ โ โ area (magneticProbeDim) float64 ...
โ โ โ ... ...
โ โ โ turnCount (magneticProbeDim) float64 ...
โ โ โ computed (time, magneticProbeDim) float64 ...
โ โ โ weights (time, magneticProbeDim) float64 ...
โ โ โ target (time, magneticProbeDim) float64 ...
โ โ โ sigmas (time, magneticProbeDim) float64 ...
โ โ โ zCentre (magneticProbeDim) float64 ...
โ โ โโโ DataTree('diamagneticFlux')
โ โ โ Dimensions: (time: 56)
โ โ โ Dimensions without coordinates: time
โ โ โ Data variables:
โ โ โ weights (time) float64 ...
โ โ โ computed (time) float64 ...
โ โ โ sigma (time) float64 ...
โ โ โ target (time) float64 ...
โ โ โโโ DataTree('q0')
โ โ Dimensions: (time: 56)
โ โ Dimensions without coordinates: time
โ โ Data variables:
โ โ weights (time) float64 ...
โ โ computed (time) float64 ...
โ โ sigma (time) float64 ...
โ โ target (time) float64 ...
โ โโโ DataTree('limiter')
โ โ Dimensions: (limiterCoord: 91, unityDim: 1)
โ โ Dimensions without coordinates: limiterCoord, unityDim
โ โ Data variables:
โ โ rValues (limiterCoord) float64 ...
โ โ zValues (limiterCoord) float64 ...
โ โ id (unityDim) int32 ...
โ โโโ DataTree('numericalControls')
โ โ โ Dimensions: (time: 56)
โ โ โ Dimensions without coordinates: time
โ โ โ Data variables: (12/14)
โ โ โ relaxFactor (time) float64 ...
โ โ โ mxiter (time) int32 ...
โ โ โ shiftPressureWhenNegative (time) int32 ...
โ โ โ saimin (time) float64 ...
โ โ โ suppressRevCurr (time) int32 ...
โ โ โ SOLCurrentsOption (time) int32 ...
โ โ โ ... ...
โ โ โ initialZCoord (time) float64 ...
โ โ โ fieldLineResol (time) float64 ...
โ โ โ scalea (time) int32 ...
โ โ โ useBin (time) int32 ...
โ โ โ conditionFactor (time) float64 ...
โ โ โ error (time) float64 ...
โ โ โโโ DataTree('pp')
โ โ โ Dimensions: (time: 56, r: 65)
โ โ โ Dimensions without coordinates: time, r
โ โ โ Data variables:
โ โ โ knt (time, r) float64 ...
โ โ โ bdry (time, r) float64 ...
โ โ โ ndeg (time) int32 ...
โ โ โ func (time) int32 ...
โ โ โ bdry2 (time, r) float64 ...
โ โ โ tens (time) float64 ...
โ โ โ kbdry2 (time, r) int32 ...
โ โ โ minPsin (time) float64 ...
โ โ โ maxPsin (time) float64 ...
โ โ โ kbdry (time, r) int32 ...
โ โ โ kknt (time) int32 ...
โ โ โ edge (time) int32 ...
โ โ โโโ DataTree('ffp')
โ โ โ Dimensions: (time: 56, z: 65)
โ โ โ Dimensions without coordinates: time, z
โ โ โ Data variables:
โ โ โ knt (time, z) float64 ...
โ โ โ bdry (time, z) float64 ...
โ โ โ ndeg (time) int32 ...
โ โ โ func (time) int32 ...
โ โ โ bdry2 (time, z) float64 ...
โ โ โ tens (time) float64 ...
โ โ โ kbdry2 (time, z) int32 ...
โ โ โ minPsin (time) float64 ...
โ โ โ maxPsin (time) float64 ...
โ โ โ kbdry (time, z) int32 ...
โ โ โ kknt (time) int32 ...
โ โ โ edge (time) int32 ...
โ โ โโโ DataTree('ne')
โ โ โ Dimensions: (time: 56, radialCoord: 65)
โ โ โ Dimensions without coordinates: time, radialCoord
โ โ โ Data variables:
โ โ โ knt (time, radialCoord) float64 ...
โ โ โ bdry (time, radialCoord) float64 ...
โ โ โ ndeg (time) int32 ...
โ โ โ func (time) int32 ...
โ โ โ bdry2 (time, radialCoord) float64 ...
โ โ โ tens (time) float64 ...
โ โ โ kbdry2 (time, radialCoord) int32 ...
โ โ โ minPsin (time) float64 ...
โ โ โ maxPsin (time) float64 ...
โ โ โ kbdry (time, radialCoord) int32 ...
โ โ โ kknt (time) int32 ...
โ โ โ edge (time) int32 ...
โ โ โโโ DataTree('ww')
โ โ Dimensions: (time: 56, normalizedPoloidalFlux: 65)
โ โ Dimensions without coordinates: time, normalizedPoloidalFlux
โ โ Data variables:
โ โ knt (time, normalizedPoloidalFlux) float64 ...
โ โ bdry (time, normalizedPoloidalFlux) float64 ...
โ โ ndeg (time) int32 ...
โ โ func (time) int32 ...
โ โ bdry2 (time, normalizedPoloidalFlux) float64 ...
โ โ tens (time) float64 ...
โ โ kbdry2 (time, normalizedPoloidalFlux) int32 ...
โ โ minPsin (time) float64 ...
โ โ maxPsin (time) float64 ...
โ โ kbdry (time, normalizedPoloidalFlux) int32 ...
โ โ kknt (time) int32 ...
โ โ edge (time) int32 ...
โ โโโ DataTree('pfSystem')
โ โโโ DataTree('pfCoils')
โ โ Dimensions: (pfCoilElements: 1566)
โ โ Dimensions without coordinates: pfCoilElements
โ โ Data variables:
โ โ coilId (pfCoilElements) int32 ...
โ โ circuitId (pfCoilElements) int64 ...
โ โ turnCount (pfCoilElements) float64 ...
โ โ rCentre (pfCoilElements) float64 ...
โ โ zCentre (pfCoilElements) float64 ...
โ โ dR (pfCoilElements) float64 ...
โ โ dZ (pfCoilElements) float64 ...
โ โ angle1 (pfCoilElements) float64 ...
โ โ angle2 (pfCoilElements) float64 ...
โ โโโ DataTree('passiveStructures')
โ Dimensions: (passiveStructureElements: 70)
โ Dimensions without coordinates: passiveStructureElements
โ Data variables:
โ coilId (passiveStructureElements) int32 ...
โ circuitId (passiveStructureElements) int32 ...
โ turnCount (passiveStructureElements) float64 ...
โ rCentre (passiveStructureElements) float64 ...
โ zCentre (passiveStructureElements) float64 ...
โ dR (passiveStructureElements) float64 ...
โ dZ (passiveStructureElements) float64 ...
โ angle1 (passiveStructureElements) float64 ...
โ angle2 (passiveStructureElements) float64 ...
โโโ DataTree('regularGrid')
โ Dimensions: (unityDim: 1)
โ Dimensions without coordinates: unityDim
โ Data variables:
โ rMin (unityDim) float64 ...
โ zMax (unityDim) float64 ...
โ zMin (unityDim) float64 ...
โ rMax (unityDim) float64 ...
โ nz (unityDim) int32 ...
โ nr (unityDim) int32 ...
โโโ DataTree('output')
โโโ DataTree('profiles2D')
โ Dimensions: (r: 65, z: 65, time: 56)
โ Coordinates:
โ * r (r) float64 0.06 0.09031 0.1206 0.1509 ... 1.939 1.97 2.0
โ * z (z) float64 -2.2 -2.131 -2.062 -1.994 ... 2.062 2.131 2.2
โ Dimensions without coordinates: time
โ Data variables:
โ jphi (time, r, z) float64 ...
โ Bphi (time, r, z) float64 ...
โ Bpol (time, r, z) float64 ...
โ Br (time, r, z) float64 ...
โ poloidalFlux (time, r, z) float64 ...
โ Bz (time, r, z) float64 ...
โ psiNorm (time, r, z) float64 ...
โโโ DataTree('globalParameters')
โ โ Dimensions: (time: 56)
โ โ Dimensions without coordinates: time
โ โ Data variables: (12/32)
โ โ rt (time) float64 ...
โ โ q2Radius (time) float64 ...
โ โ plasmaEnergy (time) float64 ...
โ โ s3 (time) float64 ...
โ โ btorVacuumEnergy (time) float64 ...
โ โ q0 (time) float64 ...
โ โ ... ...
โ โ btorEnergy (time) float64 ...
โ โ plasmaVolume (time) float64 ...
โ โ psiBoundary (time) float64 ...
โ โ li1 (time) float64 ...
โ โ li2 (time) float64 ...
โ โ li3 (time) float64 ...
โ โโโ DataTree('currentCentroid')
โ โ Dimensions: (time: 56)
โ โ Dimensions without coordinates: time
โ โ Data variables:
โ โ R (time) float64 ...
โ โ Z (time) float64 ...
โ โโโ DataTree('magneticAxis')
โ Dimensions: (time: 56)
โ Dimensions without coordinates: time
โ Data variables:
โ R (time) float64 ...
โ Z (time) float64 ...
โโโ DataTree('radialProfiles')
โ Dimensions: (time: 56, radialCoord: 65)
โ Dimensions without coordinates: time, radialCoord
โ Data variables: (12/17)
โ rotationalPressure (time, radialCoord) float64 ...
โ Bt (time, radialCoord) float64 ...
โ q (time, radialCoord) float64 ...
โ Br (time, radialCoord) float64 ...
โ poloidalArea (time, radialCoord) float64 ...
โ ffPrime (time, radialCoord) float64 ...
โ ... ...
โ jphi (time, radialCoord) float64 ...
โ staticPressure (time, radialCoord) float64 ...
โ r (time, radialCoord) float64 ...
โ Bz (time, radialCoord) float64 ...
โ plasmaVolume (time, radialCoord) float64 ...
โ staticPPrime (time, radialCoord) float64 ...
โโโ DataTree('fluxFunctionProfiles')
โ Dimensions: (normalizedPoloidalFlux: 65, time: 56)
โ Coordinates:
โ * normalizedPoloidalFlux (normalizedPoloidalFlux) float64 0.0 0.01562 ... 1.0
โ Dimensions without coordinates: time
โ Data variables: (12/25)
โ rotationalPressure (time, normalizedPoloidalFlux) float64 ...
โ gOutside (time, normalizedPoloidalFlux) float64 ...
โ javrg (time, normalizedPoloidalFlux) float64 ...
โ IOutside (time, normalizedPoloidalFlux) float64 ...
โ q (time, normalizedPoloidalFlux) float64 ...
โ poloidalFlux (time, normalizedPoloidalFlux) float64 ...
โ ... ...
โ plasmaFluxVolume (time, normalizedPoloidalFlux) float64 ...
โ elongation (time, normalizedPoloidalFlux) float64 ...
โ staticPPrime (time, normalizedPoloidalFlux) float64 ...
โ lowerTriangularity (time, normalizedPoloidalFlux) float64 ...
โ rInboard (time, normalizedPoloidalFlux) float64 ...
โ iota (time, normalizedPoloidalFlux) float64 ...
โโโ DataTree('separatrixGeometry')
โ Dimensions: (time: 56, strikepointDim: 4, xpointDim: 2, boundaryCoordsDim: 361)
โ Dimensions without coordinates: time, strikepointDim, xpointDim, boundaryCoordsDim
โ Data variables: (12/33)
โ dndXpointCount (time) float64 ...
โ dndXpoint1InnerStrikepointR (time) float64 ...
โ limiterZ (time) float64 ...
โ dndXpoint2InnerStrikepointZ (time) float64 ...
โ dndXpoint2OuterStrikepointZ (time) float64 ...
โ strikepointR (time, strikepointDim) float64 ...
โ ... ...
โ dndXpoint2OuterStrikepointR (time) float64 ...
โ rBoundary (time, boundaryCoordsDim) float64 ...
โ rmidplaneIn (time) float64 ...
โ dndXpoint2InnerStrikepointR (time) float64 ...
โ zBoundary (time, boundaryCoordsDim) float64 ...
โ drsepIn (time) float64 ...
โโโ DataTree('numericalDetails')
โ Dimensions: (time: 56, maximumIterationCount: 30)
โ Dimensions without coordinates: time, maximumIterationCount
โ Data variables:
โ chiSquared (time, maximumIterationCount) float64 ...
โ iterationCount (time) int32 ...
โ poloidalFluxError (time, maximumIterationCount) float64 ...
โ finalChiSquared (time) float64 ...
โ finalPoloidalFluxError (time) float64 ...
โโโ DataTree('degreesOfFreedom')
Dimensions: (time: 56, pfCircuitDim: 64, fFPrimeDim: 2, pPrimeDim: 2)
Dimensions without coordinates: time, pfCircuitDim, fFPrimeDim, pPrimeDim
Data variables: (12/14)
freePfCurrents (time, pfCircuitDim) int32 ...
countAllPfCurrent (time) int32 ...
offsetPressure (time) float64 ...
offsetRotationalPressure (time) float64 ...
count (time) int32 ...
ffprimeCoeffs (time, fFPrimeDim) float64 ...
... ...
pprimeCoeffs (time, pPrimeDim) float64 ...
pfCurrents (time, pfCircuitDim) float64 ...
countWPrime (time) int32 ...
countFreePfCurrent (time) int32 ...
cdelz (time) float64 ...
countFFPrime (time) int32 ...
constraints = dt['epm/input/constraints']
print(constraints)
DataTree('constraints')
โโโ DataTree('pfCircuits')
โ Dimensions: (pfCircuitsDim: 64, time: 56)
โ Dimensions without coordinates: pfCircuitsDim, time
โ Data variables:
โ shortName (pfCircuitsDim) |S13 ...
โ id (pfCircuitsDim) int32 ...
โ computed (time, pfCircuitsDim) float64 ...
โ weights (time, pfCircuitsDim) float64 ...
โ target (time, pfCircuitsDim) float64 ...
โ sigmas (time, pfCircuitsDim) float64 ...
โ timeSliceSource (pfCircuitsDim) int8 ...
โโโ DataTree('plasmaCurrent')
โ Dimensions: (time: 56)
โ Dimensions without coordinates: time
โ Data variables:
โ weights (time) float64 ...
โ computed (time) float64 ...
โ sigma (time) float64 ...
โ target (time) float64 ...
โโโ DataTree('fluxLoops')
โ Dimensions: (fluxLoopDim: 102, fluxLoopElementDim: 1, time: 56)
โ Dimensions without coordinates: fluxLoopDim, fluxLoopElementDim, time
โ Data variables:
โ shortName (fluxLoopDim) |S8 ...
โ toroidalAngleEnd (fluxLoopDim, fluxLoopElementDim) float64 ...
โ zValues (fluxLoopDim, fluxLoopElementDim) float64 ...
โ id (fluxLoopDim) int32 ...
โ computed (time, fluxLoopDim) float64 ...
โ weights (time, fluxLoopDim) float64 ...
โ target (time, fluxLoopDim) float64 ...
โ rValues (fluxLoopDim, fluxLoopElementDim) float64 ...
โ sigmas (time, fluxLoopDim) float64 ...
โ toroidalAngleBegin (fluxLoopDim, fluxLoopElementDim) float64 ...
โโโ DataTree('magneticProbes')
โ Dimensions: (magneticProbeDim: 354, time: 56)
โ Dimensions without coordinates: magneticProbeDim, time
โ Data variables: (12/14)
โ shortName (magneticProbeDim) |S9 ...
โ axialLength (magneticProbeDim) float64 ...
โ poloidalOrientation (magneticProbeDim) float64 ...
โ rCentre (magneticProbeDim) float64 ...
โ toroidalSector (magneticProbeDim) int32 ...
โ area (magneticProbeDim) float64 ...
โ ... ...
โ turnCount (magneticProbeDim) float64 ...
โ computed (time, magneticProbeDim) float64 ...
โ weights (time, magneticProbeDim) float64 ...
โ target (time, magneticProbeDim) float64 ...
โ sigmas (time, magneticProbeDim) float64 ...
โ zCentre (magneticProbeDim) float64 ...
โโโ DataTree('diamagneticFlux')
โ Dimensions: (time: 56)
โ Dimensions without coordinates: time
โ Data variables:
โ weights (time) float64 ...
โ computed (time) float64 ...
โ sigma (time) float64 ...
โ target (time) float64 ...
โโโ DataTree('q0')
Dimensions: (time: 56)
Dimensions without coordinates: time
Data variables:
weights (time) float64 ...
computed (time) float64 ...
sigma (time) float64 ...
target (time) float64 ...
avg_constraints = constraints.mean(dim='time')
print(avg_constraints)
DataTree('constraints')
โโโ DataTree('root')
โโโ DataTree('epm')
โโโ DataTree('input')
โโโ DataTree('constraints')
โโโ DataTree('pfCircuits')
โ Dimensions: (pfCircuitsDim: 64)
โ Dimensions without coordinates: pfCircuitsDim
โ Data variables:
โ shortName (pfCircuitsDim) |S13 b'p1' b'pc' ... b'VS6U' b'VS7U'
โ id (pfCircuitsDim) int32 1 2 3 4 5 6 7 ... 15 16 17 18 19 20
โ computed (pfCircuitsDim) float64 3.387e+03 7.398e-06 ... 9.465e+03
โ weights (pfCircuitsDim) float64 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
โ target (pfCircuitsDim) float64 3.193e+03 0.0 ... 1.112e+04
โ sigmas (pfCircuitsDim) float64 254.5 1.0 ... 5.422e+03 5.246e+03
โ timeSliceSource (pfCircuitsDim) int8 2 2 2 2 2 2 2 2 2 ... 3 3 3 3 3 3 3 3
โโโ DataTree('plasmaCurrent')
โ Dimensions: ()
โ Data variables:
โ weights float64 1.0
โ computed float64 5.564e+05
โ sigma float64 1.084e+05
โ target float64 5.418e+05
โโโ DataTree('fluxLoops')
โ Dimensions: (fluxLoopDim: 102, fluxLoopElementDim: 1)
โ Dimensions without coordinates: fluxLoopDim, fluxLoopElementDim
โ Data variables:
โ shortName (fluxLoopDim) |S8 b'f_nu_01' b'f_nu_02' ... b'f_c_b12'
โ toroidalAngleEnd (fluxLoopDim, fluxLoopElementDim) float64 6.283 ... 6...
โ zValues (fluxLoopDim, fluxLoopElementDim) float64 1.358 ... -...
โ id (fluxLoopDim) int32 1 2 3 4 5 6 ... 97 98 99 100 101 102
โ computed (fluxLoopDim) float64 -0.08992 -0.08818 ... 0.05828
โ weights (fluxLoopDim) float64 1.0 1.0 1.0 1.0 ... 1.0 0.0 1.0
โ target (fluxLoopDim) float64 -0.09163 -0.09075 ... 0.0 0.05939
โ rValues (fluxLoopDim, fluxLoopElementDim) float64 0.901 ... 0...
โ sigmas (fluxLoopDim) float64 0.02 0.02 0.02 ... 0.02 0.02 0.02
โ toroidalAngleBegin (fluxLoopDim, fluxLoopElementDim) float64 0.0 ... 0.0
โโโ DataTree('magneticProbes')
โ Dimensions: (magneticProbeDim: 354)
โ Dimensions without coordinates: magneticProbeDim
โ Data variables: (12/14)
โ shortName (magneticProbeDim) |S9 b'b_bl1_n02' ... b'b_xu2_p36'
โ axialLength (magneticProbeDim) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
โ poloidalOrientation (magneticProbeDim) float64 1.614 1.614 ... 5.501 5.501
โ rCentre (magneticProbeDim) float64 1.1 1.175 ... 1.681 1.735
โ toroidalSector (magneticProbeDim) int32 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
โ area (magneticProbeDim) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
โ ... ...
โ turnCount (magneticProbeDim) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
โ computed (magneticProbeDim) float64 -0.03454 ... -0.02538
โ weights (magneticProbeDim) float64 1.0 1.0 1.0 ... 0.0 1.0 1.0
โ target (magneticProbeDim) float64 -0.03664 ... -0.02521
โ sigmas (magneticProbeDim) float64 0.015 0.015 ... 0.015 0.015
โ zCentre (magneticProbeDim) float64 -1.569 -1.566 ... 1.749
โโโ DataTree('diamagneticFlux')
โ Dimensions: ()
โ Data variables:
โ weights float64 0.0
โ computed float64 -0.1356
โ sigma float64 0.0
โ target float64 0.0
โโโ DataTree('q0')
Dimensions: ()
Data variables:
weights float64 0.0
computed float64 -2.497
sigma float64 0.0
target float64 0.0
It will probably be easier to debug this and prevent it happening again after we have an assert_tree_equal
function like in #27.
Currently we have an open_datatree
function which opens a single netcdf file (or zarr store). We could imagine an open_mfdatatree
function which is analogous to open_mfdataset
, which can open multiple files at once.
As DataTree
has a structure essentially the same as that of a filesystem, I'm imagining a use case where the user has a bunch of data files stored in nested directories, e.g.
project
/experimental
data.nc
/simulation
/highres
output.nc
/lowres
output.nc
We could look through all of these folders recursively, open any files found of the correct format, and store them in a single tree.
We could even allow for multiple data files in each folder if we called open_mfdataset
on all the files found in each folder.
EDIT: We could also save a tree out to multiple folders like this using a save_mfdatatree
method.
This might be particularly useful for users who want the benefit of a tree-like structure but are using a file format that doesn't support groups.
In xarray it's possible to automatically close a dataset after opening by opening it using a context manager. From the documentation:
Datasets have a Dataset.close() method to close the associated netCDF file. However, itโs often cleaner to use a with statement:
# this automatically closes the dataset after use In [5]: with xr.open_dataset("saved_on_disk.nc") as ds: ...: print(ds.keys()) ...:
We currently don't have a DataTree.close()
method, or any context manager behaviour for open_datatree
. To add them presumably we would need to iterate over all file handles (i.e. groups) and close them one by one.
So far we've only really implemented dictionary-like get/setitem
syntax, but we should add a variety of other ways to select nodes from a tree too. Here are some suggestions:
class DataTree:
...
def __getitem__(self, key: str) -> DataTree | DataArray:
"""
Accepts node/variable names, or file-like paths to nodes/variables (inc. '../var').
(Also needs to accommodate indexing somehow.)
"""
...
def subset(self, keys: Sequence[str]) -> DataTree:
"""
Return new tree containing only nodes with names matching keys.
(Could probably be combined with `__getitem__`.
Also unsure what the return type should be.)
"""
...
@property
def subtree(self) -> Iterator[DataTree]:
"""An iterator over all nodes in this tree, including both self and all descendants."""
...
def filter(self, filterfunc: Callable) -> Iterator[DataTree]:
"""Filters subtree by returning only nodes for which `filterfunc(node)` is True."""
...
Are there other types of access that we're missing here? Filtering by regex match? Getting nodes where at least one part of the path matches ("tag-like" access)? Glob?
There were no release notes for v0.0.8 and v0.0.9. I bumped the current version in #123 but I assume there is more content missing?
Currently I have exposed the properties of the dataset which a node (optionally) wraps, so that a user can do
dt = DataNode(data=xr.Dataset())
dt.dims # returns .dims property of underlying dataset
However as @jbusecke pointed out this behaviour is inconsistent with the way that all other dataset methods are dispatched over all datasets in the tree - this forwarded property only looks at the Dataset in the node it was called on, ignoring all child nodes.
I see 4 ways to think about this:
.ds
first, but is also possibly counterintuitive.dims_on_all_nodes = {node.pathstr: node.ds.dims for node in dt.subtree}
dt.ds.dims
to access the properties. This would reinforce the distinction between a DataTree
and a Dataset
, even though you could still call a lot of the dataset API on the datatree object (e.g. dt.mean()
.) (This would also make #17 obsolete.)DataTree
inherit directly from Dataset
, as discussed in #2 ). This would effectively be the same API as (1) though, just with automatic inheritance of the properties instead of manual wrapping of them.I'm now leaning quite heavily towards (3).
It would be nice to have plotting methods that allow for easily comparing variables in different parts of a tree.
One simple way would be to have a dt.plot(variable, **kwargs)
method which uses a legend to distinguish between variables, i.e.
def plot(self, variable, **kwargs):
fig, ax = plt.subplots()
for node in self.subtree:
da = node[variable]
da.plot(ax=ax, label=node.name, **kwargs)
ax.legend()
ax.set_title(variable)
plt.show()
The use cases for this would be multi-resolution datasets, or multi-model ensembles, where the user wants to see how a single variable varies across different datasets.
Care would have to be taken to ensure that incompatible dimensions or coordinates were handled smoothly.
The tests shouldn't fail in a local environment that doesn't have zarr/netcdf, it should just skip those tests.
@andersy005 and @malmans2 whilst we (you) managed to fix the release version on pypi I think we might still be stuck in the past on conda-forge? ๐ฉ
by prepending repeated symbol(s) (dot, cross, etc.) to those names so that it informs about their hierarchical level.
Makes sense benbovy - that is basically how anytree achieves this:
DataTree('root', parent=None)
โ Dimensions: (y: 3, x: 2)
โ Dimensions without coordinates: y, x
โ Data variables:
โ a (y) int64 6 7 8
โ set0 (x) int64 9 10
โโโ DataTree('set1')
โ โ Dimensions: ()
โ โ Data variables:
โ โ a int64 0
โ โ b int64 1
โ โโโ DataTree('set1')
โ โโโ DataTree('set2')
โโโ DataTree('set2')
โ โ Dimensions: (x: 2)
โ โ Dimensions without coordinates: x
โ โ Data variables:
โ โ a (x) int64 2 3
โ โ b (x) float64 0.1 0.2
โ โโโ DataTree('set1')
โโโ DataTree('set3')
I imagine I could use similar logic to how anytree's RenderTree
just uses 3 different characters (vertical=โ
, continuation=โโโ
, end=โโโ
).
Originally posted by @TomNicholas in #78 (comment)
Hi @TomNicholas !
I am one of the core devs of satpy (https://github.com/pytroll/satpy), which makes use of xarray/dask to handle satellite data for earth-observing satellites.
In this context, we have many times satellite data which have different resolutions for a same dataset, hence xarray's dataset can't really be used for these data, as the coords for the different variables don't match, and DataTree makes a lot of sense for us.
The satellite data, more often than not, is in some binary format, and we read it and convert it to xarray.DataArrays, and I'm now started experimenting placing them in a DataTree by hand.
So it would be really nice if there was an interface for adding custom engines to read that data (multiple files). Did you already consider that? Do you maybe already have an idea on how this would work?
We have been wanting to stick closer to the data model of xarray in our library, and datatree looks like something we could really use :) let's hope we can contribute here, at least with ideas in the future.
Currently a DatasetNode
is both a node of the tree (so can have children) and can wrap a single xarray.Dataset
. If we followed #3 , we could instead choose to make TreeNodes unable to wrap Datasets directly, in favour of instead storing Dataset objects as children.
This would mean that multiple Datasets could be stored as the children of a single node, but I'm not sure if that's desirable or not. It would also ensure that the class representing a node of the tree is totally distinct from an xarray.Dataset (neither inheriting from xarray.Dataset nor wrapping it, only pointing to it as a child).
In conjunction with #3 this would mean that the syntax for selecting the variable 'pressure'
from the dataset stored under the node 'weather'
would be simply dt['weather']['pressure']
. (Although then selecting via dt['weather/pressure']
would become trickier.)
I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means __getitem__
becomes ambiguous.
See this example:
In [3]: dt = DataNode('root', data=xr.Dataset({'a': [0], 'b': 1}))
In [4]: child = DataNode('a', data=None, parent=dt)
In [5]: print(dt)
DataNode('root')
โ Dimensions: (a: 1)
โ Coordinates:
โ * a (a) int64 0
โ Data variables:
โ b int64 1
โโโ DataNode('a')
In [6]: dt['a']
Out[6]:
<xarray.DataArray 'a' (a: 1)>
array([0])
Coordinates:
* a (a) int64 0
In [7]: dt.get_node('a')
Out[7]: DataNode(name='a', parent='root', children=[],data=None)
Here print(dt)
shows that dt
is in a form forbidden by netCDF, because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).
Furthermore, when choosing an item via DataTree.__getitem__
it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node 'a'
instead.
The node is still accessible via .get_node
, but only because .get_node
is inherited from TreeNode
, which has no concept of data variables.
Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:
In [8]: child = DataNode('a', data=None, parent=dt)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-8-8c4a956bcdd5> in <module>
----> 1 child = DataNode('a', data=None, parent=dt)
~/Documents/Work/Code/datatree/datatree/datatree.py in _init_single_datatree_node(cls, name, data, parent, children)
162 # This approach was inspired by xarray.Dataset._construct_direct()
163 obj = object.__new__(cls)
--> 164 obj = _init_single_treenode(obj, name=name, parent=parent, children=children)
165 obj.ds = data
166 return obj
~/Documents/Work/Code/datatree/datatree/treenode.py in _init_single_treenode(obj, name, parent, children)
13 obj.name = name
14
---> 15 obj.parent = parent
16 if children:
17 obj.children = children
~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in parent(self, value)
133 self.__check_loop(value)
134 self.__detach(parent)
--> 135 self.__attach(value)
136
137 def __check_loop(self, node):
~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in __attach(self, parent)
157 def __attach(self, parent):
158 if parent is not None:
--> 159 self._pre_attach(parent)
160 parentchildren = parent.__children_or_empty
161 assert not any(child is self for child in parentchildren), "Tree is corrupt." # pragma: no cover
~/Documents/Work/Code/datatree/datatree/treenode.py in _pre_attach(self, parent)
84 """
85 if self.name in list(c.name for c in parent.children):
---> 86 raise KeyError(
87 f"parent {parent.name} already has a child named {self.name}"
88 )
KeyError: 'parent root already has a child named a'
To prevent this we need better checks on assignment between variables and children. For example TreeNode.set_node(key, new_child)
currently checks for any existing children with name key
, but it also needs to check for any variables in the dataset with name key
. (That's not too hard to implement, it could be done by overloading set_node
on DataTree
to check against variables as well as children, for example.)
What is more difficult is if a child with name key
exists, but the user tries to assign a variable with name key
to the wrapped dataset. If the user does this via node.ds.assign(key=new_da)
then that's manageable - in that case assign()
has a return value, which they need to assign to the node via node.ds = node.ds.assign(key=new_da)
. We could check for name conflicts with children in the .ds
property setter method.
However if the user adds a variable via node.ds[key] = new_da
then I think node.ds
will be updated in-place without it's wrapping DataTree
class ever having a chance to intervene. A similar issue with node[key] = new_da
is preventable by improving checking in DataTree.__setitem__
, but I don't know how we can prevent this happening when all that is being called is Dataset.__setitem__
.
I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition ๐ Any ideas @dcherian maybe?
Currently I'm adding the xarray.Dataset
methods to DataTree
via a pattern basically like this:
_DATASET_API_TO_COPY = ['isel', '__add__', ...]
class DatasetAPIMixin:
def _add_dataset_api(self):
for method_name in _DATASET_API_TO_COPY:
ds_method = getattr(xarray.Dataset, method_name)
# Decorate method so that when called it acts over whole subtree
mapped_method = map_over_subtree(ds_method)
setattr(self, method_name, mapped_method)
class DataTree(DatasetAPIMixin):
def __init__(self, *args):
self._add_dataset_api()
The idea was that the use of Mixins would echo how these methods were defined on xarray.Dataset
originally, and also keep a distinction between methods that are actually unique to DataTree
objects (such as .groups
), and methods that are merely copied over from xarray.Dataset
like .isel
(albeit with modifications such as mapping over child nodes).
I like my Mixin idea, but one weird thing about this pattern is that the Dataset methods are only added to the DataTree
once a dt
instance is instantiated, not when the DataTree
class is defined. I don't know if this is likely to cause problems, but at the very least it seems inefficient, because we are running the code to loop through and attach all these methods every single time we create a new DataTree
object. It's also not really an example of class inheritance right now - the mixins aren't actually doing anything other than being a different place for me to put the definition of _add_dataset_api()
.
What would be better would be if the dataset methods were actually added at class definition time rather than object instantiation time, and ideally fully defined on the mixin before it is inherited. Then we wouldn't need to call any _add_dataset_api()
method on the dt
instance because the methods would already be there.
The only way I can think of to actually to do this within the class definitions is using a metaclass.
I could also possibly set the attribute outside of the mixin definition but before the definition of DataTree
like this:
class DatasetAPIMixin:
pass
for method_name in _DATASET_API_TO_COPY:
ds_method = getattr(xarray.Dataset, method_name)
# Decorate method so that when called it acts over whole subtree
mapped_method = map_over_subtree(ds_method)
setattr(DatasetAPIMixin, method_name, mapped_method)
class DataTree(DatasetAPIMixin):
...
I feel like I'm missing something but this is the best I could come up with
a05["natre"] = datatree.DataTree.from_dict({"natre": xarray_dataset})["natre"]
Is this right?
I realised that part of the reason that arithmetic (#24) and ufuncs (#25) don't yet work is because the map_over_subtree
decorator currently only maps over a single subtree.
This works fine for mapping unary functions such as .isel
, because they only accept one tree-like argument (i.e. self
for the .isel
method). However for any type of binary function such as add(dt1, dt2)
then pairs of respective nodes in each tree need to be operated on together, as result_ds = add(dt1[node].ds, dt2[node].ds)
, before the output tree is built up from the results.
In the most general case we need to be able to map functions like
def func(*args, **kwargs)
# do stuff involving multiple Dataset objects
return output_trees
where any number of the args and kwargs could be DataTrees, and output_trees
could be a list of any number of DataTrees.
To implement this the map_over_subtree
decorator has to become a lot more general. It needs to
args
and kwargs
are DataTree objects,func
, as Datasets, without losing their position in *args
, **kwargs
,func
to rebuild M DataTree objects (which all have the same structure as the input trees), and return them.We therefore have to decide what we mean by "isomorphic". The strictest definition would be that all node names are the same, so that
dt_1:
DataNode('foo')
| Data A
+---DataNode('bar')
+ Data B
could be mapped alongside
dt_2:
DataNode('foo')
| Data C
+---DataNode('bar')
+ Data D
but not alongside
dt_3:
DataNode('baz')
| Data C
+---DataNode('woz')
+ Data D
A more lenient definition would be that each node's ordered set of children must each have the same number of children as it's counterpart in the other tree. (In other words the tree structure must be the same, but the node names need not be. This requires the children to be ordered to avoid ambiguities.) This definition would allow dt_3
to be mapped over alongside dt_1
or dt_2
(or both simultaneously for a func
that accepts 3 Dataset arguments).
There is a bug in the _check_loop
method, which meant that a tree with a cycle in it could potentially be created, which should never be allowed.
I think this bug was introduced in #76, which made this check datatree's responsibility instead of anytree's responsibility.
Originally posted by @TomNicholas in #103 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.