Comments (3)
At some point last year, I put together a simple "dtool-sync" plugin for comparing and syncing base URIs for my own purposes at https://github.com/jotelha/dtool-sync. Would that be worth turning into a PyPI-package?
from dtool-create.
@jotelha looks pretty cool. :)
Queries and feedback:
- What does "changed" mean? Is it referring to metadata (I would assume that the datasets themselves cannot change)?
- I would be keen to remove "compare" from the top level. It is a very generic terms and I don't think it would be clear that it would act on two base URIs. I.e. I think it would be difficult to understand what the difference between "dtool diff" and "dtool compare" are. Perhaps one could consider using "dtool sync --dryrun" as a replacement for "dtool compare". This is what "aws s3 sync" uses.
- It looks like there are quite a few options, e.g. verbose, raw, json. I realise that these are to do with how the output is represented. However, I wonder to what extent all of them are needed. To that effect it would be great if the readme introduction had some more explanation of what the use cases are. It would then be easier to see what minimal functionality would be needed to satisfy those use cases. (You may already have been working in this way, but it is currently not clear to me from the readme what the purpose/use case of the different output options are).
- What should the behaviour be when a dataset has had metadata updated? "dtool cp" does not do anything if the dataset already exists in the destination URI. I would expect the behaviour to be the same with "dtool sync". I.e. dtool sync would effectively be a bit like "dtool cp --recursive" (that option does not exist). However, I'm willing to be persuaded that "dtool sync" could include a "--update-metadata". But this raises question about what that behaviour should do, because the metadata on the destination dataset could have been updated by something else. The fact that metadata can be updated is a strength/feature, but it also makes things difficult....
from dtool-create.
@tjelvar-olsson thanks for your detailed feedback here!
The original purpose of this script was to transfer large amounts of datasets from one base URI to the other and pick up interrupted transfers, i.e. batch dtool cp
. I have condensed the interface, there is now only one command left, dtool sync
with a --dry-run
option.
-
Let's assume I want to compare and sync two base URIs, left hand side (lhs) and right hand side (rhs):
dtool sync --dry-run file://lhs s3://rhs
"changed" can mean two things. One is a dataset being frozen on lhs, but proto on rhs. That's the common case when some some earlier transfer has been interrupted. Or it can actually differing metadata between copies of the same dataset at LHS and RHS, i.e. differing frozen time stamps before dtoolcore 3.18.1 (https://github.com/jic-dtool/dtoolcore/blob/master/CHANGELOG.rst#3181---2021-09-27). Could replace
changed
with something else, i.e.differing
or treat the two cases separately. What do you think? -
Done.
-
I have condensed the
README.md
to treat--dry-run
,--verbose
,--quiet
, and--json
only. I see these as the necessary basic set. I would like to keep the--json
option to generate machine-readible output of the differing datasets. I got rid of the "raw" option as well as "uuid". I started with simple copy-paste from dtool-info (https://github.com/jic-dtool/dtool-info/blob/27ab6f3823cdbf44242c7cc88f8f9b8ce2a35099/dtool_info/dataset.py#L132-L151), used that as basis for both, comparison and output, then added these options gradually to alter the comparison and display to my needs, they are quite obsolete now. -
Exactly, it doesn't do anything for differing metadata in datasets frozen on both sides (and it only compares admin_metadata by the keys specified at https://github.com/jotelha/dtool-sync/blob/622e7b4c652327d7baa33ce4715df36e89fe8070/dtool_sync/cli.py#L23). More precisely, it will throw an error and exit or print an exception and continue with the next dataset if
--ignore-errors
is specified. Let's keep it simple and don't include any mechanism for updating modified metadata here for now.
I have placed above changes in jotelha/dtool-sync#2.
Also added the possibility to compare against a lookup server instead of a another base URI, i.e.
$ dtool sync --dry-run --json --verbose ~/dtool lookup://server
{
"equal": [
[
{
"uuid": "387305e3-e603-4551-8ed5-1fd56f8fa911",
"dtoolcore_version": "3.18.1",
"name": "noch-ein-neuer-datensatz-mit-umlaut-im-dateinamen",
"type": "dataset",
"creator_username": "jotelha",
"created_at": 1645842296.531141,
"frozen_at": 1645842336.447319,
"uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/noch-ein-neuer-datensatz-mit-umlaut-im-dateinamen"
},
{
"base_uri": "s3://test-bucket",
"created_at": 1645842296.531,
"creator_username": "jotelha",
"dtoolcore_version": "3.18.1",
"frozen_at": 1645842336.447,
"name": "noch-ein-neuer-datensatz-mit-umlaut-im-dateinamen",
"tags": [],
"type": "dataset",
"uri": "s3://test-bucket/387305e3-e603-4551-8ed5-1fd56f8fa911",
"uuid": "387305e3-e603-4551-8ed5-1fd56f8fa911"
}
],
[
{
"uuid": "4ad55490-24bf-4cc6-a155-f8ea4d98b74d",
"dtoolcore_version": "3.18.1",
"name": "2022-02-28-another-demo-dataset",
"type": "dataset",
"creator_username": "jotelha",
"created_at": 1646042527.551805,
"frozen_at": 1646042593.176174,
"uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2022-02-28-another-demo-dataset"
},
{
"base_uri": "s3://test-bucket",
"created_at": 1646042527.551,
"creator_username": "jotelha",
"dtoolcore_version": "3.18.1",
"frozen_at": 1646042593.176,
"name": "2022-02-28-another-demo-dataset",
"tags": [],
"type": "dataset",
"uri": "s3://test-bucket/4ad55490-24bf-4cc6-a155-f8ea4d98b74d",
"uuid": "4ad55490-24bf-4cc6-a155-f8ea4d98b74d"
}
]
],
"changed": [],
"missing": [
{
"created_at": 1641051385.653753,
"creator_username": "hoermann4",
"dtoolcore_version": "3.17.0",
"frozen_at": 1641266471.748893,
"name": "2021-12-31-20-47-16-239794-wrapjoinanddpdequilibration-wrapjoindatafileJHUA",
"type": "dataset",
"uuid": "a3452e3a-3ca8-4795-950f-534223fcf916",
"uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2021-12-31-20-47-16-239794-wrapjoinanddpdequilibration-wrapjoindatafileJHUA"
},
{
"uuid": "d4d299f1-6550-4483-bc5d-014023226516",
"dtoolcore_version": "3.18.1",
"name": "ein-neuer-datensatz-mit-umlaut-im-item-namen",
"type": "protodataset",
"creator_username": "jotelha",
"created_at": 1645842221.224571,
"uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/ein-neuer-datensatz-mit-umlaut-im-item-namen"
},
{
"uuid": "d656d394-6a28-4273-bcfb-d050df31f4a3",
"dtoolcore_version": "3.18.1",
"name": "2022-02-09-test-dataset-with-umlaut-items",
"type": "dataset",
"creator_username": "jotelha",
"created_at": 1644403564.250988,
"frozen_at": 1644403787.24659,
"uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2022-02-09-test-dataset-with-umlaut-items"
}
]
}
Resume sync of changed datasets, assuming their transfer might have been interrupted earlier.
Copy missing datasets.
Dry run, would copy file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2021-12-31-20-47-16-239794-wrapjoinanddpdequilibration-wrapjoindatafileJHUA to lookup://server now.
Dry run, would copy file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/ein-neuer-datensatz-mit-umlaut-im-item-namen to lookup://server now.
Dry run, would copy file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2022-02-09-test-dataset-with-umlaut-items to lookup://server now.
and to compare against one, but sync to another base URI, by that allowing to compare against a lookup server, but transfer to an actual base URI, i.e.
dtool sync --dry-run --json --verbose ~/dtool lookup://server s3://test-bucket
If you look at the output above, there is some difference in how the timestamps come out of a direct storage broker and the lookup server. I have thus put a tolerance for timestamp comparison here, https://github.com/jotelha/dtool-sync/blob/1b60f8e62e252ac7bdabfb4b5c710f5c249ca135/dtool_sync/compare.py#L20-L29, but that is not a great solution.
from dtool-create.
Related Issues (20)
- Could you show the valid STORAGE values in ``dtool copy --help``? HOT 2
- Add ``-q/--quiet`` option to ``dtool create`` that only returns the generated URI HOT 1
- Make ``dtool copy`` use URIs for both src and dest HOT 3
- Readme generated by ``dtool readme interactive`` looses indentation HOT 1
- Make it possible to provide custom templates to ``dtool readme interactive`` HOT 2
- Sanity checking before running ``dtool freeze`` HOT 4
- Ensure corrupted files do not end up in the dtool cache HOT 6
- Ability to combine copy with verify/diff to ensure that the copy has been successful
- Add validation of dataset name on creation HOT 1
- Add ability to update README file with descriptive metadata HOT 1
- Python2 issue with unicode in readme HOT 1
- Resolve absolute path when using ``--symlink-path`` option HOT 1
- Should ``dtool copy`` be changed to ``dtool cp``? HOT 1
- Remove "created_at" from default README template HOT 1
- Add ``dtool publish`` command to CLI HOT 1
- Add -q/--quiet flag to dtool freeze command
- Add ``dtool item cp`` command to return an item using its original relpath
- inconsistency between date and datetime objects HOT 6
- dtool readme interactive shortcomings
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dtool-create.