jic-dtool / dtool-create Goto Github PK
View Code? Open in Web Editor NEWDtool plugin for creating datasets and collections
License: MIT License
Dtool plugin for creating datasets and collections
License: MIT License
$ dtool readme edit ~/junk/test-unicode-ds
Traceback (most recent call last):
File "/Users/olssont/envs/dtool/bin/dtool", line 11, in <module>
sys.exit(dtool())
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/dtool_create/dataset.py", line 278, in edit
edited_content = click.edit(readme_content)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/termui.py", line 456, in edit
return editor.edit(text)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/_termui_impl.py", line 425, in edit
text = text.encode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
This is needed so that one can programatically discover where the data was copied to when transferring data from different backends. Below is the output of the current behaviour in going from iRODS to local file storage.
$ dtool copy irods:///jic_archive/f13ef963-37f0-4c3c-a96c-da99e036ea10 ~/junk2
Copying dataset [------------------------------------] 0%
Dataset copied to file:///Users/olssont/junk2/my-another-dataset
From this it is difficult to programatically work out where the dataset has been put. The behaviour below would be a solution to this problem.
$ dtool copy -q irods:///jic_archive/f13ef963-37f0-4c3c-a96c-da99e036ea10 ~/junk2
file:///Users/olssont/junk2/my-another-dataset
This information is now captured as administrative metadata in the .dtool/dtool
file.
At the moment relative paths are not expanded and a command such as the below fails:
dtool create my-first-ds --symlink-path rel/path/to/data
At the moment the user needs to do the below:
dtool create my-first-ds --symlink-path `pwd`/rel/path/to/data
Which is not intuitive.
At the moment the dtool readme
command is just used to edit/update the content of the readme file. It is not possible to get the content back. This is an issue when working with datasets in remote storage locations, such as iRODS. One does not want to have to fetch the whole dataset in order to be able to inspect the content of the readme file.
Suggest updating the behaviour of the dtool readme
command to mimic that of dtool name
which can be used both to echo back and to edit the name.
Replace:
dtool copy src dest
dtool diff src dest
With a single command:
dtool copy src dest
Implementation detail:
End user suggested that it would be useful for the dtool copy
command to feed back information about which files are being copied across.
Current output:
---
description: Dataset description
project: Project name
confidential: false
personally_identifiable_information: false
owners:
- name: Your Name
email: [email protected]
username: olssont
creation_date: 2017-10-23
# links:
# - http://doi.dx.org/your_doi
# - http://github.com/your_code_repository
# budget_codes:
# - E.g. CCBS1H10S
Expected:
---
description: Dataset description
project: Project name
confidential: false
personally_identifiable_information: false
owners:
- name: Your Name
email: [email protected]
username: olssont
creation_date: 2017-10-23
# links:
# - http://doi.dx.org/your_doi
# - http://github.com/your_code_repository
# budget_codes:
# - E.g. CCBS1H10S
Current output:
dtool create my_dataset ~/junk
Created proto dataset file:///Users/olssont/junk/my_dataset
Next steps:
1. Add descriptive metadata, e.g:
dtool readme interactive file:///Users/olssont/junk/my_dataset
2. Add raw data, eg:
dtool add item my_file.txt file:///Users/olssont/junk/my_dataset
Or use your system commands, e.g:
mv my_data_directory /Users/olssont/junk/my_dataset/data/
3. Convert the proto dataset into a dataset:
dtool freeze file:///Users/olssont/junk/my_dataset
Output with desired option:
dtool create -q my_dataset ~/junk
file:///Users/olssont/junk/my_dataset
Feedback from a user:
I would like “dtool freeze” to be more circumspect about proceeding because “freeze”ing a proto is irreversible (or should be seen that way).
I think “dtool freeze” should ask for confirmation with a warning about kicking off a (potentially) long-running process.
A verbose “—dry-run” option would also be a good addition.
Why do I ask this?
From the dtool documentation (http://dtool.readthedocs.io/en/latest/philosophy.html), a proto should resemble:
project_1
But let’s say I create the proto and then copy/move files into the proto to get
project_1
After freezing, I copy the dataset to iRODS, verify it and delete my local copy.
My understanding (tell me if I’m wrong) is when I retrieve this dataset from iRODS I get:
project_1
and XYZ is has disappeared.
In this case, if “dtool freeze” had exited with a message about file XYZ, then I wouldn’t have lost XYZ.
Additionally, the dataset structure suggested in the documentation would be enforced.
AFAICT dtool does not enforce the dataset structure (again please tell me if I’m wrong).
The dtool readme
command results in stack trace when called on a frozen dataset.
It needs to have some sanity checking that the dataset provided is a proto dataset.
Example output below:
$ dtool readme interactive symlink:///Users/olssont/junk/my_dataset
Traceback (most recent call last):
File "/Users/olssont/envs/dtool/bin/dtool", line 11, in <module>
sys.exit(dtool())
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/dtool_create/dataset.py", line 154, in interactive
config_path=CONFIG_PATH)
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/dtoolcore/__init__.py", line 318, in from_uri
return cls._from_uri_with_typecheck(uri, config_path, "protodataset")
File "/Users/olssont/envs/dtool/lib/python2.7/site-packages/dtoolcore/__init__.py", line 164, in _from_uri_with_typecheck
"{} is not a {}".format(uri, cls.__name__))
dtoolcore.DtoolCoreTypeError: symlink:///Users/olssont/junk/my_dataset is not a ProtoDataSet
At the moment dtool freeze
is very chatty.
$ dtool freeze ~/junk/test > log
$ cat log
Generating manifest
Dataset frozen file:///Users/olssont/junk/test
It would be useful to have a -q/--quiet
flag to suppress the writing of this info.
This feedback reached me via email,
dtool readme interactive wäre ja eigentlich nett, ist aber bei nested key value maps echt nicht schön, da es nur die innerste Schlüssel-Ebene anzeigt. Auch kann man kein array als Wert eingeben, oder ich versteh zumindest nicht wie. Außerdem stürzt das ganze ab, wenn im template ein null value angegeben ist - was aber m.E perfekt valides yaml wäre.
deepl translation:
dtool readme interactive would actually be nice, but is really not nice for nested key value maps, as it only shows the innermost key level. Also, you can't enter an array as a value, or at least I don't understand how. In addition, the whole thing crashes if a null value is specified in the template - which in my opinion would be perfectly valid yaml.
Using dtool create, I had the error that datetime.date has no attribute date.
The bug is caused by the following lines:
dtool-create/dtool_create/dataset.py
Lines 89 to 98 in 80563dd
The default happens to be a date object, not a datetime object, but the parse_date function returns a datetime object.
Here is the proposed fix:
elif isinstance(value, datetime.date):
def parse_date(value):
try:
date = datetime.datetime.strptime(value, "%Y-%m-%d")
except ValueError as e:
raise click.BadParameter(
"Could not parse date, {}".format(e), param=value)
return date.date()
new_value = click.prompt(key, default=value, value_proc=parse_date)
d[key] = new_value.isoformat()
This might be related to #22, since something has been changed from datetime to date.
I wonder why this error comes up now, since the code dates from may.
Current python version is 3.8.
Ability to sync the content between to base URIs. Requested by @jotelha
Currently:
dtool copy --help
Usage: dtool copy [OPTIONS] DATASET_URI [PREFIX] [STORAGE]
Desired:
dtool copy --help
Usage: dtool copy [OPTIONS] SRC_DATASET_URI DEST_DATASET_URI
Lots of tools and user think of data in terms of file names. The current command for fetching an item:
dtool item fetch
does not return the item using the original items relpath instead it uses the UUID of the dataset and the item ID.
In order to make life easier for users, and tool that really make use of information in the file path, it would be useful to be able to fetch an item returning it with the original relpath. A nice solution for this might be to implement the command:
dtool item cp
Note that this may also enable us to remove the "hack" of appending the file suffix to the end of the abspaths created by dtool item fetch
.
Original README should be not be deleted but rather be renamed as something along the lines of README.yml.timestamp
Feedback from user that it would be better to incorporate the functionality of the dtool_publish
command line tool from dtool-http
into the client as a command named dtool publish
.
If a dtool copy command fails because the connection is broken one can end up with a broken file in the dtool cache. If one then tries to resume the copy one can end up with the broken file into the dtool cache.
Dataset items that end up in the cache should never be corrupted. Some validation should therefore occur before they are put into it.
It is currently possible to create dataset names with /
and newlines. This is not good when copying data from cloud to disk.
There should also be some limit on the length of the dataset name.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.