We need some programmatic way to create datasets, to update their metadata and to delete them. Currently people need to manage this manually by writing TOML but clearly this isn't great.
API musings
One possibility is to overload the dataset()
function itself with the ability to create a dataset. For example adding a create=true
flag:
dataset("SomeData", create=true, tags=[...], description="some desc", other_args...)
dataset(project, "SomeData", create=true, tags=[...], description="some desc", other_args...)
Another idea would be to pass a verb along as a positional argument, such as
dataset("SomeData", :create; description="some desc", other_args...)
dataset("SomeData", :delete)
dataset("SomeData", :update, description="new desc")
With :read
being the default verb. This allows us to reuse the exported dataset()
function for all dataset-related CRUD operations.
But let's be honest this is little weird other than being economical with exported names. Perhaps I've been doing too much REST recently :-) Probably a better alternative would be to just have a function per operation:
DataSets.create("SomeData"; description="some desc", other_args...)
DataSets.delete("SomeData")
DataSets.update("SomeData", description="new desc")
update()
is a bit of an odd one out of these operations β what if you wanted to delete some metadata? I guess we could pass something like description=nothing
for deleting metadata items.
Which data project?
When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.
Data ownership
Creation β and especially deletion β brings up an additional problem: How do we distinguish between data which is "owned" by a data project (so that the data itself should be deleted when the dataset is removed from the project), vs data which is merely linked to?
For existing data referenced on the filesystem this is particularly relevant. We don't want datasets()
to delete somebody's existing data which they're referring to. But neither do we want DataSets.delete()
to leave unwanted data lying around.
I think we should have an extra metadata key to distinguish between data which is managed-vs-linked-to by DataSets. Perhaps under the keys linked
, or managed
or some such. (Should this go within the storage
section or not?)