Git Product home page Git Product logo

Comments (3)

jotelha avatar jotelha commented on August 12, 2024

At some point last year, I put together a simple "dtool-sync" plugin for comparing and syncing base URIs for my own purposes at https://github.com/jotelha/dtool-sync. Would that be worth turning into a PyPI-package?

from dtool-create.

tjelvar-olsson avatar tjelvar-olsson commented on August 12, 2024

@jotelha looks pretty cool. :)

Queries and feedback:

  1. What does "changed" mean? Is it referring to metadata (I would assume that the datasets themselves cannot change)?
  2. I would be keen to remove "compare" from the top level. It is a very generic terms and I don't think it would be clear that it would act on two base URIs. I.e. I think it would be difficult to understand what the difference between "dtool diff" and "dtool compare" are. Perhaps one could consider using "dtool sync --dryrun" as a replacement for "dtool compare". This is what "aws s3 sync" uses.
  3. It looks like there are quite a few options, e.g. verbose, raw, json. I realise that these are to do with how the output is represented. However, I wonder to what extent all of them are needed. To that effect it would be great if the readme introduction had some more explanation of what the use cases are. It would then be easier to see what minimal functionality would be needed to satisfy those use cases. (You may already have been working in this way, but it is currently not clear to me from the readme what the purpose/use case of the different output options are).
  4. What should the behaviour be when a dataset has had metadata updated? "dtool cp" does not do anything if the dataset already exists in the destination URI. I would expect the behaviour to be the same with "dtool sync". I.e. dtool sync would effectively be a bit like "dtool cp --recursive" (that option does not exist). However, I'm willing to be persuaded that "dtool sync" could include a "--update-metadata". But this raises question about what that behaviour should do, because the metadata on the destination dataset could have been updated by something else. The fact that metadata can be updated is a strength/feature, but it also makes things difficult....

from dtool-create.

jotelha avatar jotelha commented on August 12, 2024

@tjelvar-olsson thanks for your detailed feedback here!

The original purpose of this script was to transfer large amounts of datasets from one base URI to the other and pick up interrupted transfers, i.e. batch dtool cp. I have condensed the interface, there is now only one command left, dtool sync with a --dry-run option.

  1. Let's assume I want to compare and sync two base URIs, left hand side (lhs) and right hand side (rhs):

    dtool sync --dry-run file://lhs s3://rhs

    "changed" can mean two things. One is a dataset being frozen on lhs, but proto on rhs. That's the common case when some some earlier transfer has been interrupted. Or it can actually differing metadata between copies of the same dataset at LHS and RHS, i.e. differing frozen time stamps before dtoolcore 3.18.1 (https://github.com/jic-dtool/dtoolcore/blob/master/CHANGELOG.rst#3181---2021-09-27). Could replace changed with something else, i.e. differing or treat the two cases separately. What do you think?

  2. Done.

  3. I have condensed the README.md to treat --dry-run, --verbose, --quiet, and --json only. I see these as the necessary basic set. I would like to keep the --json option to generate machine-readible output of the differing datasets. I got rid of the "raw" option as well as "uuid". I started with simple copy-paste from dtool-info (https://github.com/jic-dtool/dtool-info/blob/27ab6f3823cdbf44242c7cc88f8f9b8ce2a35099/dtool_info/dataset.py#L132-L151), used that as basis for both, comparison and output, then added these options gradually to alter the comparison and display to my needs, they are quite obsolete now.

  4. Exactly, it doesn't do anything for differing metadata in datasets frozen on both sides (and it only compares admin_metadata by the keys specified at https://github.com/jotelha/dtool-sync/blob/622e7b4c652327d7baa33ce4715df36e89fe8070/dtool_sync/cli.py#L23). More precisely, it will throw an error and exit or print an exception and continue with the next dataset if --ignore-errors is specified. Let's keep it simple and don't include any mechanism for updating modified metadata here for now.

I have placed above changes in jotelha/dtool-sync#2.

Also added the possibility to compare against a lookup server instead of a another base URI, i.e.

$ dtool sync --dry-run --json --verbose ~/dtool lookup://server
{
    "equal": [
        [
            {
                "uuid": "387305e3-e603-4551-8ed5-1fd56f8fa911",
                "dtoolcore_version": "3.18.1",
                "name": "noch-ein-neuer-datensatz-mit-umlaut-im-dateinamen",
                "type": "dataset",
                "creator_username": "jotelha",
                "created_at": 1645842296.531141,
                "frozen_at": 1645842336.447319,
                "uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/noch-ein-neuer-datensatz-mit-umlaut-im-dateinamen"
            },
            {
                "base_uri": "s3://test-bucket",
                "created_at": 1645842296.531,
                "creator_username": "jotelha",
                "dtoolcore_version": "3.18.1",
                "frozen_at": 1645842336.447,
                "name": "noch-ein-neuer-datensatz-mit-umlaut-im-dateinamen",
                "tags": [],
                "type": "dataset",
                "uri": "s3://test-bucket/387305e3-e603-4551-8ed5-1fd56f8fa911",
                "uuid": "387305e3-e603-4551-8ed5-1fd56f8fa911"
            }
        ],
        [
            {
                "uuid": "4ad55490-24bf-4cc6-a155-f8ea4d98b74d",
                "dtoolcore_version": "3.18.1",
                "name": "2022-02-28-another-demo-dataset",
                "type": "dataset",
                "creator_username": "jotelha",
                "created_at": 1646042527.551805,
                "frozen_at": 1646042593.176174,
                "uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2022-02-28-another-demo-dataset"
            },
            {
                "base_uri": "s3://test-bucket",
                "created_at": 1646042527.551,
                "creator_username": "jotelha",
                "dtoolcore_version": "3.18.1",
                "frozen_at": 1646042593.176,
                "name": "2022-02-28-another-demo-dataset",
                "tags": [],
                "type": "dataset",
                "uri": "s3://test-bucket/4ad55490-24bf-4cc6-a155-f8ea4d98b74d",
                "uuid": "4ad55490-24bf-4cc6-a155-f8ea4d98b74d"
            }
        ]
    ],
    "changed": [],
    "missing": [
        {
            "created_at": 1641051385.653753,
            "creator_username": "hoermann4",
            "dtoolcore_version": "3.17.0",
            "frozen_at": 1641266471.748893,
            "name": "2021-12-31-20-47-16-239794-wrapjoinanddpdequilibration-wrapjoindatafileJHUA",
            "type": "dataset",
            "uuid": "a3452e3a-3ca8-4795-950f-534223fcf916",
            "uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2021-12-31-20-47-16-239794-wrapjoinanddpdequilibration-wrapjoindatafileJHUA"
        },
        {
            "uuid": "d4d299f1-6550-4483-bc5d-014023226516",
            "dtoolcore_version": "3.18.1",
            "name": "ein-neuer-datensatz-mit-umlaut-im-item-namen",
            "type": "protodataset",
            "creator_username": "jotelha",
            "created_at": 1645842221.224571,
            "uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/ein-neuer-datensatz-mit-umlaut-im-item-namen"
        },
        {
            "uuid": "d656d394-6a28-4273-bcfb-d050df31f4a3",
            "dtoolcore_version": "3.18.1",
            "name": "2022-02-09-test-dataset-with-umlaut-items",
            "type": "dataset",
            "creator_username": "jotelha",
            "created_at": 1644403564.250988,
            "frozen_at": 1644403787.24659,
            "uri": "file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2022-02-09-test-dataset-with-umlaut-items"
        }
    ]
}
Resume sync of changed datasets, assuming their transfer might have been interrupted earlier.
Copy missing datasets.
Dry run, would copy file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2021-12-31-20-47-16-239794-wrapjoinanddpdequilibration-wrapjoindatafileJHUA to lookup://server now.
Dry run, would copy file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/ein-neuer-datensatz-mit-umlaut-im-item-namen to lookup://server now.
Dry run, would copy file://jotelha-fujitsu-ubuntu-20/home/jotelha/dtool/2022-02-09-test-dataset-with-umlaut-items to lookup://server now.

and to compare against one, but sync to another base URI, by that allowing to compare against a lookup server, but transfer to an actual base URI, i.e.

dtool sync --dry-run --json --verbose ~/dtool lookup://server s3://test-bucket

If you look at the output above, there is some difference in how the timestamps come out of a direct storage broker and the lookup server. I have thus put a tolerance for timestamp comparison here, https://github.com/jotelha/dtool-sync/blob/1b60f8e62e252ac7bdabfb4b5c710f5c249ca135/dtool_sync/compare.py#L20-L29, but that is not a great solution.

from dtool-create.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.