Git Product home page Git Product logo

automation-tools's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

automation-tools's Issues

Delete on complete doesn't work in transfer.py

transfer.py has an option to delete the transfer source files, but doesn't seem to work. The issue appears to be that it is trying to use shutil.rmtree and the path from the transfer.db, but that path is only relative to the transfer source location so it's unable to actually locate. I think the transfer.db might need to include an absolute path for this to work.

Problem: Consider exposing status codes and response.text to calling functions

The work in #48 has introduced an error handler errors.py to provide more information to the calling client, previously we'd return None for each failed request.

As we know, there is more information in the HTTP response that we might be able to take advantage of. See the authorisation example below. This might provide more useful feedback for users.

Curl example for authorisation:

curl -v -X GET "Authorization: ApiKey test:test" -H "Content-Type: application/json" -d '{u'username': 'demo', u'path': 'U2FtcGxlVHJhbnNmZXJz', u'api_key': 'dne'}' 'http://127.0.0.1:62081/api/v2/location/2a3d8d39-9cee-495e-b7ee-5e629254934d/browse/'
* Rebuilt URL to: Authorization: ApiKey test:test/
* Port number ended with ' '
* Closing connection -1
curl: (3) Port number ended with ' '
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 62081 (#0)
> GET /api/v2/location/2a3d8d39-9cee-495e-b7ee-5e629254934d/browse/ HTTP/1.1
> Host: 127.0.0.1:62081
> User-Agent: curl/7.55.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 61
> 
* upload completely sent off: 61 out of 61 bytes
< HTTP/1.1 401 UNAUTHORIZED
< Server: nginx/1.12.2
< Date: Fri, 02 Mar 2018 04:39:19 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Language, Cookie
< X-Frame-Options: SAMEORIGIN
< WWW-Authenticate: Basic Realm="django-tastypie"
< Content-Language: en
< 
* Connection #0 to host 127.0.0.1 left intact

NB HTTP/1.1 401 UNAUTHORIZED could provide the client with a response more useful than None or out of #PR48 an Invalid Response (ERR_INVALID_RESPONSE)

Similarly, by simplifying the response from the request handler, we also lose the opportunity to get more specific information from responses without DEBUG logging, e.g. for a reingest where the transfer is already in progress we see:

WARNING   2018-04-06 15:40:58 utils.py:48   POST Request to http://127.0.0.1:62081/api/v2/file/6b0d6839-eed5-4b17-ab68-521a98b73f1e/reingest/ returned 409 CONFLICT
DEBUG     2018-04-06 15:40:58 utils.py:49   Response: {"error": true, "message": "This AIP is already being reingested on {pipeline}", "status_code": 409}
INFO      2018-04-06 15:40:58 reingest.py:192  Reingest UUID to work with None
Invalid response from server, check amclient log

Status code and message ultimately more descriptive than Invalid response from server, check amclient log.

Problem: this repo has no releases and is not using semver

Archivematica tries to use semver in all of our repositories, but this repo has not had any tagged releases.

For the Archivematica 1.7.0 release, we would like to specify which version of each associated repository is known to work with the release. Following semver would be particularly helpful in deployment repo's like https://github.com/artefactual/deploy-pub, https://github.com/artefactual-labs/am and https://github.com/JiscRDSS/rdss-archivematica

link:
http://semver.org/

Problem: transfers/amclient.py is not documented in the README

transfers/amclient.py is a nice module and CLI that holds functionality for interacting with the various Archivematica APIs (for example, it can be used to automate hiding of completed transfers/ingests in the dashboard). The repository's README does not mention it, it would be a good thing to add information about it there (amclient.py's source code has docstrings that could be used for this purpose)

Problem: Turn transfers/utils.py into a self-contained library/class

The fix for #51 will likely be a workaround to get it up and running in good time. Currently, utils.py is called by two separate module scripts but it contains its own logging.

It seems more appropriate for the scripts calling utils.py to manage their own logging and be able to retrieve what they need to report on from member variables within utils.py by turning it into one or many classes, e.g.

Instead of utils.py logging, LOGGER.debug(response['content-type']), it should store the response in a member var. On returning an exception amclient.pyortransfers.py` can opt to look into the response attributes on their own terms.

To improve this component of the transfers module, we will need to:

  • Remove logging calls
  • Raise exceptions instead of returning error codes when something goes wrong
  • Assign components of the request and response to member variables where required
  • Decide on which getters/setters will be most appropriate

This work could probably do with some discussion before it is implemented however.

Problem: installation instructions

When using the installation instructions for using the automation tools with a docker install I ran into a few problems. I think these are more general problems of the installation instructions and I think it might be a good idea to update the instructions. Before doing that I would like to know if you agree.

Proposed changes:

  1. Add sudo to the commands for creating the automation directories. All parent directories are owned by root, not the user.
  2. Add instructions for creating the archivematica group and user. In my case they didn't exist - but this might be because of the docker installation.
  3. Add instrcutions for changing ownership of the new directories from root to the user.

If you think that this is a good idea I'll go ahead and create a PR for changing this.

Problem: amclient does not use log file location specified by `--log-file`

I tried to run the amclient script with the following bash script

# cat /etc/archivematica/automation-tools/amclient-close-completed-transfers.sh
#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.amclient --log-file /var/log/archivematica/automation-tools/amclient.log  close-completed-transfers --am-user-name autobot <API_KEY>

(note that /usr/lib/archivematica/automation-tools is a symlink to /opt/archivematica/automation-tools)

However getting the following error:

# sudo -u archivematica /etc/archivematica/automation-tools/amclient-close-completed-transfers.sh
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/archivematica/automation-tools/transfers/amclient.py", line 33, in <module>
    from transfers import utils
  File "transfers/utils.py", line 17, in <module>
    LOGGER = get_logger(defaults.AMCLIENT_LOG_FILE, defaults.DEFAULT_LOG_LEVEL)
  File "transfers/utils.py", line 13, in get_logger
    return loggingconfig.setup(log_level, log_file_name, "amclient")
  File "transfers/loggingconfig.py", line 54, in setup
    logging.config.dictConfig(CONFIG)
  File "/usr/lib/python2.7/logging/config.py", line 794, in dictConfig
    dictConfigClass(config).configure()
  File "/usr/lib/python2.7/logging/config.py", line 576, in configure
    '%r: %s' % (name, e))
ValueError: Unable to configure handler 'file': [Errno 13] Permission denied: '/opt/archivematica/automation-tools/transfers/amclient.log'

So it seems that it is still using the default log file location (os.path.join(THIS_DIR, 'amclient.log'))

Problem: processing hangs because UUID for ingest is missing in the Sqlite database

From time to time the UUID is missing in the Sqlite database. Automation Tools will never get an update on the status for this unit since the UUID is missing and then constructs an api-call which is not valid.

When running the API call manually, I got status "Completed" so updating the uuid field finishes this unit in the database.

It seems to me that something gets wrong when updating the db, when changing from transfer to ingest as unit type.

sqlite> select * from unit where id > 33;
id          uuid        path                                unit_type   status      microservice  current   
----------  ----------  ----------------------------------  ----------  ----------  ------------  ----------
34          7c7a75b5-9  auto/pliktmusikk_EAN_7071245336653  ingest      COMPLETE                  0         
35                      auto/pliktmusikk_EAN_7071245339920  ingest      PROCESSING                1 

From the log:
INFO 2019-05-15 09:51:01 transfer.py:579 Automation tools waking up
INFO 2019-05-15 09:51:01 transfer.py:69 Configuration values read for pidfile: /var/archivematica/automation-tools/transfers-pid.lck
INFO 2019-05-15 09:51:01 transfer.py:69 Configuration values read for databasefile: /var/archivematica/automation-tools/transfers.db
INFO 2019-05-15 09:51:01 transfer.py:618 Current unit: <Unit(id=35, uuid=None, unit_type=ingest, path=auto/pliktmusikk_EAN_7071245339920, status=PROCESSING, current=True)>
WARNING 2019-05-15 09:51:01 utils.py:56 GET Request to http://xxxxx:81/api/ingest/status/None/ returned 404 NOT FOUND
INFO 2019-05-15 09:51:01 transfer.py:632 Status info: Invalid response from server, check amclient log
ERROR 2019-05-15 09:51:01 transfer.py:641 Cannot read response from server for <Unit(id=35, uuid=None, unit_type=ingest, path=auto/pliktmusikk_EAN_7071245339920, status=PROCESSING, current=True)>: 'str' object has no attribute 'get'
INFO 2019-05-15 09:51:01 transfer.py:49 Running post-execution clean-up. Exiting script

Problem: Apply ambv/black across AM codebase

Line-wrapping is a subjective part of code review (CR). There are ways to meet PEP8 compliance (79 characters) that look nicer than others. By way of making CR smoother, @sevein has suggested the use of black to provide some auto-formatting capability to our workflow. The Automation tools provides a good place to start and comment on. We will submit a PR against them and see how it looks to folks.

NB: The line length-flag will need to be set to 79 to comply with PEP8 guidelines: https://github.com/ambv/black#line-length - the default in black is 88 (their rationale: shorter modules using this additional width line-length).

Problem: test_create_dip.py test may fail because of DST

I'm getting a failure at https://github.com/artefactual/automation-tools/blob/master/tests/test_create_dip.py#L143 when I run test_create_dip_success locally:

>               assert aip_lastmodified - 1 <= lastmodified <= aip_lastmodified + 1
E               AssertionError: assert 1510853052 <= (1510849451 + 1)

This looks like a (DST) timezone issue because the dates are off by 1 hour (plus the expected 1 second):

  • 1510849451 is 2017-11-16 08:24:11
  • 1510853052 is 2017-11-16 09:24:12

Problem: Need a script to enable bulk re-ingest

In some cases it is desirable to re-ingest AIPs against a common processing-configuration so that they are all more-or-less consistent with each other. One use case put to the Archivematica team by CCA is the desire to have uncompressed AIPs in storage, whereby a processing configuration can be used to re-ingest the packages so that the final result is that all AIPs are decompressed and stored that way.

Archivematica has a corresponding task in Redmine: CCA:11686

Problem: Dataverse pre-transfer hook doesn't have unit tests

As a developer I need to be able to understand the current state of legacy branches and their capability. Unit tests help to provide a baseline from which to understand that and then launch the development of new functionality and possible refactoring. Unit tests should be developed for the Dataverse branch to provide that.

The three commits that make up #15 are:

  1. 271e1df
  2. 5af9594
  3. 89e301e

SQlite db connection held open

transfers.db gets locked whenever transfer.py is running – which normally isn't very long, but in the event of large transfers, this means that transfers.db can be locked for hours (because transfer.py is running until the rsync completes).

Ideally the connection to transfers.db would only be open when it is really needed…

To reproduce: try to write to transfers.db while transfer.py is running.

Problem: Adding headers to requests.request() in utils.py will result in re-recording many vcrpy fixtures

In a recent pull request: #57 we have added the ability to authenticate by header which in some Archivematica API calls is the only possible mechanism. As noted by @jrwdunham we can pass headers into the requests.requests() call whether we want to use parameters or not. The problem with this is that the request signature changes and with it the vcrpy fixtures will need to change as well.

Because of the number of fixtures we have adopted and the range of unit tests created for many different scenarios this will take some time to configure and run.

When we do this, we might also want to make an additional change in _call_url_json() outlined by Joel in the code snippet below to provide some nice logic depending on the type of request being sent.

    try:
        if method == METHOD_GET:
            response = requests.request(method, url=url, params=params,
                                        headers=headers)
        else:
            response = requests.request(method, url=url, data=params,
                                        headers=headers)
        LOGGER.debug('Response: %s', response)
        LOGGER.debug('type(response.text): %s ', type(response.text))
        LOGGER.debug('Response content-type: %s',
                    response.headers['content-type'])

Problem: The Automation Tools AMClient script could be easier to jump into and add functionality

As a developer I like to be able to see a division in responsibilities of code. For example being able to see clearly where I can add new CLI args, or new logic.

The AMClient.py script in Automation Tools currently tries to do a lot of this in a single script - args/logging/functional logic. I think there is an opportunity to pull those three pieces apart to make it easier for developers to work with and open the opportunity for smaller refactors in the future.

AMClient.py is a good candidate to have this attention paid to it as its functionality is only likely to grow as it wraps the AM API capabilities in a consistent way.

Proposal:

  • Create a logging configuration script that can provide a model for sharing across the other Automation Tools (Transfer.py would be the first to make use of this)
  • Separate the CLI argument definitions into a script on its own so that it becomes an entry point for new functionality being added by developers.
  • Leave the core features of the script in place (API calls and handling) and import the new scripts at the module level.

This will reduce the size of each of the scripts and hopefully make it a bit easier for folk to delve into.

Potential dependency conflicts between automation-tools and urllib3

Hi, as shown in the following full dependency graph of automation-tools, automation-tools requires urllib3 (the latest version), while the installed version of requests(2.22.0) requires urllib3>=1.21.1,<1.26.

According to Pip's “first found wins” installation strategy, urllib3 1.25.3 is the actually installed version.

Although the first found package version urllib3 1.25.3 just satisfies the later dependency constraint (urllib3>=1.21.1,<1.26), it will lead to a build failure once developers release a newer version of urllib3.

Dependency tree--------

automation-tools-master
| +-amclient(version range:==1.0.0rc2)
| +-enum34(version range:*)
| +-flake8(version range:==3.4.1)
| +-flake8-import-order(version range:==0.13)
| +-metsrw(version range:==0.3.8)
| +-mock(version range:*)
| +-pytest(version range:*)
| +-requests(version range:<3.0)
| | +-chardet(version range:>=3.0.2,<3.1.0)
| | +-idna(version range:>=2.5,<2.9)
| | +-urllib3(version range:>=1.21.1,<1.26)
| | +-certifi(version range:>=2017.4.17)
| +-six(version range:*)
| +-sqlalchemy(version range:*)
| +-urllib3(version range:*)
| +-vcrpy(version range:>=1.0.0)

Thanks for your attention.
Best,
Neolith

Automation error

Trying to automate archivematica.
Getting the following error:

/etc/archivematica/automation-tools# sudo -u archivematica ./transfer-script.sh
/usr/share/python/automation-tools/bin/python: No module named transfers

Problem: no issue template

This repo should have an issue template that refers to the archivematica/issues repo if the issue they are filing is specific to archivematica.

Problem: Automation Tools has no gui

Automation tools runs as cron jobs. It has no user interface, command line or gui. For Archivematica users that want to troubleshoot a system without using ssh, it is not possible to tell what is going on in the automation tools.

It might be possible to use this
https://simonwillison.net/2017/Nov/13/datasette/

to partially resolve this problem. At least this would give users without ssh access the ability to get information out of the sqlite database used by the automation tools.

Problem: There are a larger number of filetypes than we can lookup in dataverse pre-transfer

This portion of code provides a lookup for extensions that we might find when parsing dataverse.json files:

# https://github.com/IQSS/dataverse/blob/4.0.1/src/main/java/MimeTypeDisplay.properties
# https://github.com/IQSS/dataverse/blob/4.0.1/src/main/java/META-INF/mime.types
EXTENSION_MAPPING = {
'Comma Separated Values': '.csv',
'MS Excel (XLSX)': '.xlsx',
'R Data': '.RData',
'SPSS Portable': '.por',
'SPSS SAV': '.sav',
'Stata Binary': '.dta',
'Stata 13 Binary': '.dta',
'UNKNOWN': '',
}

If we follow the comments to the development branch of dataverse, we see the two lists are much longer:

https://github.com/IQSS/dataverse/blob/cc2909ca18faaf74c3ac459cc812851828c57042/src/main/java/META-INF/mime.types

https://github.com/IQSS/dataverse/blob/cc2909ca18faaf74c3ac459cc812851828c57042/src/main/java/MimeTypeDisplay.properties

We may want to take these lists and convert them into more comprehensive lookup tables.

Problem: this repo has no contributing.md

Would it be worthwhile to have a contributing.md for this repo? Questions to consider are how issues are filed in the Archivematica vs. AtoM vs. Binder projects and whether we want this repo to follow the same pattern?

Problem: python shebang in py modules that are not executable

We only need to have a shebang if the script is an executable / entry point.

699b018 introduced shebags in modules where I believe it's not needed.

> transfers/amclientargs.py:1:#!/usr/bin/env python
> transfers/defaults.py:1:#!/usr/bin/env python
> transfers/errors.py:1:#!/usr/bin/env python
> transfers/loggingconfig.py:1:#!/usr/bin/env python
> transfers/utils.py:1:#!/usr/bin/env python

Problem: Internal functions in AMClient.py aren't clearly delineated from subcommand functions

As a developer new to amclient.py I want to be able to quickly see which functions we want to expose as subcommands and which we want to use internally only. This helps me to see that a subcommand has been configured appropriately. This should promote ease of maintenance.

E.g. def get_next_package(self, next_path) we use at the module level only and cannot be exposed as a subcommand to the client when used on the terminal.

problem: when ingesting DSpace files sometimes transfers fail to approve ("Unable to find unapproved transfer directory.")

Sometimes automation tools fail to approve transfers, requiring manual intervention (i.e., approve manually in the dashboard). Log shows:

...
INFO      2017-09-06 21:05:01  transfer.py:388  Ready to start
INFO      2017-09-06 21:05:01  transfer.py:414  Approving [email protected]
INFO      2017-09-06 21:05:07  transfer.py:398  Failed approve, try 1 of 3
INFO      2017-09-06 21:05:07  transfer.py:414  Approving [email protected]
INFO      2017-09-06 21:05:13  transfer.py:398  Failed approve, try 2 of 3
INFO      2017-09-06 21:05:13  transfer.py:414  Approving [email protected]
INFO      2017-09-06 21:05:19  transfer.py:398  Failed approve, try 3 of 3
WARNING   2017-09-06 21:05:19  transfer.py:400  Not approved
...

When this happens, the corresponding database entry does not update (running the automated transfer script again does not update it), so for the package that needed to be manually approved and had the ingest completed, unit_type is "transfer" and status remains blank, also uuid is blank :

$ sqlite3 /var/archivematica/automation-tools/transfers.db "select * from unit order by id DESC limit 3;"
33150||[email protected]|transfer|||0|2017-09-06 21:05:19.837795
33149|c35028bf-a2e8-47a7-8049-20dea6fa44ad|[email protected]|ingest|COMPLETE||0|2017-09-06 20:35:07.745142
33148|62ba497b-14c7-4114-b1ff-124eb550e02f|[email protected]|ingest|COMPLETE||0|2017-09-06 20:05:08.633923

It is not clear how to be able to reproduce this problem.
(Error above occurred in the dev/dspace branch)

Problem: Sample README.md shell scripts are all on one line

Scripts on the README.md page are mostly all on one line, e.g.

#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> --api-key <apikey> --ss-user <user> --ss-api-key <apikey> --transfer-source <transfer_source_uuid> --config-file <config_file>

This becomes easier to scroll if we do something along these lines:

#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> \
                                                                    --api-key <apikey> \
                                                                    --ss-user <user> \
                                                                    --ss-api-key <apikey> \
                                                                    --transfer-source <transfer_source_uuid> 
                                                                    --config-file <config_file>

create_dip.py script changes file names from BagIt transfers

Running the create_dip.py on an AIP created from a transfer of a BagIt bag causes the initial three characters to be removed from all of the file names in the DIP objects directory. The following warning is output when this occurs:

INFO 2021-08-03 15:36:40 Moving file: objects/box4b12.bin
WARNING 2021-08-03 15:36:40 premis:originalName not starting with %transferDirectory%objects/

In the case above, the final file name in the DIP objects directory (zipped) was: 4b12.bin.

Problem: fits:fslastmodified is not always available as a source for setting the modified date of DIP files

DIPs produced using automation-tools are supposed to re-write the last modified dates to files to what is recorded in the PREMIS metadata during Archivematica processing. However, this currently relies on last modified dates extracted by fits, which does not run reliably on all files and causes the modified date re-write to fail during DIP creation without this data present.

Given that fits elements are not always present in AIPs’ PREMIS metadata, the PREMIS element dateCreatedByApplication could be a preferable source for setting the modified date of DIP files produced with automation-tools.

If using dateCreatedByApplication is a desirable solution, I have a small fix to address this issue and can open up a PR for review in a fork.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.