artefactual / automation-tools Goto Github PK
View Code? Open in Web Editor NEWTools to aid automation of Archivematica and AtoM.
License: GNU Affero General Public License v3.0
Tools to aid automation of Archivematica and AtoM.
License: GNU Affero General Public License v3.0
transfer.py has an option to delete the transfer source files, but doesn't seem to work. The issue appears to be that it is trying to use shutil.rmtree and the path from the transfer.db, but that path is only relative to the transfer source location so it's unable to actually locate. I think the transfer.db might need to include an absolute path for this to work.
Would it be worthwhile to have a contributing.md for this repo? Questions to consider are how issues are filed in the Archivematica vs. AtoM vs. Binder projects and whether we want this repo to follow the same pattern?
Trying to automate archivematica.
Getting the following error:
/etc/archivematica/automation-tools# sudo -u archivematica ./transfer-script.sh
/usr/share/python/automation-tools/bin/python: No module named transfers
Here:
automation-tools/transfers/transfer.py
Line 416 in 2a79db2
r+
open mode but we only use it to write to, so we should probably only open with w
.As a developer I like to be able to see a division in responsibilities of code. For example being able to see clearly where I can add new CLI args, or new logic.
The AMClient.py script in Automation Tools currently tries to do a lot of this in a single script - args/logging/functional logic. I think there is an opportunity to pull those three pieces apart to make it easier for developers to work with and open the opportunity for smaller refactors in the future.
AMClient.py is a good candidate to have this attention paid to it as its functionality is only likely to grow as it wraps the AM API capabilities in a consistent way.
Proposal:
This will reduce the size of each of the scripts and hopefully make it a bit easier for folk to delve into.
Make the URL update with a #link when scrolling
The defaults have changed since the last refactor of AMClient.py
and reflect those used for those working with Docker installs. Likely we need to put them back to the original values, but a good question raised by @jhsimpson was do we also take the time with this to put in a dev mode so that two sets of defaults can be easily picked from for those using automation tools.
As a developer I need to be able to understand the current state of legacy branches and their capability. Unit tests help to provide a baseline from which to understand that and then launch the development of new functionality and possible refactoring. Unit tests should be developed for the Dataverse branch to provide that.
The three commits that make up #15 are:
Running the create_dip.py on an AIP created from a transfer of a BagIt bag causes the initial three characters to be removed from all of the file names in the DIP objects directory. The following warning is output when this occurs:
INFO 2021-08-03 15:36:40 Moving file: objects/box4b12.bin
WARNING 2021-08-03 15:36:40 premis:originalName not starting with %transferDirectory%objects/
In the case above, the final file name in the DIP objects directory (zipped) was: 4b12.bin
.
transfers.db gets locked whenever transfer.py is running – which normally isn't very long, but in the event of large transfers, this means that transfers.db can be locked for hours (because transfer.py is running until the rsync completes).
Ideally the connection to transfers.db would only be open when it is really needed…
To reproduce: try to write to transfers.db while transfer.py is running.
Wiring the README manually is time-consuming, and also difficult to format consistently. Comments from a previous code-review point to the idea of auto-generating portions of the README.md
which discuss amclient.py
. Changes could be made in bulk and greater flexibility given to finding the right style to adopt.
E.g.
The fix for #51 will likely be a workaround to get it up and running in good time. Currently, utils.py
is called by two separate module scripts but it contains its own logging.
It seems more appropriate for the scripts calling utils.py
to manage their own logging and be able to retrieve what they need to report on from member variables within utils.py by turning it into one or many classes, e.g.
Instead of utils.py
logging, LOGGER.debug(response['content-type']), it should store the response in a member var. On returning an exception
amclient.pyor
transfers.py` can opt to look into the response attributes on their own terms.
To improve this component of the transfers module, we will need to:
This work could probably do with some discussion before it is implemented however.
See: https://github.com/artefactual/automation-tools/blob/master/transfers/transfer.py#L373
We're comparing a string that comes into the function with an fsencoded (byte) string, which will result in a match not being found in Python3.
Plain and encoded strings look like as follows:
Plain: transfer_1
FSEncoded: b'transfer_1'
Automation tools runs as cron jobs. It has no user interface, command line or gui. For Archivematica users that want to troubleshoot a system without using ssh, it is not possible to tell what is going on in the automation tools.
It might be possible to use this
https://simonwillison.net/2017/Nov/13/datasette/
to partially resolve this problem. At least this would give users without ssh access the ability to get information out of the sqlite database used by the automation tools.
DIPs produced using automation-tools
are supposed to re-write the last modified dates to files to what is recorded in the PREMIS metadata during Archivematica processing. However, this currently relies on last modified dates extracted by fits
, which does not run reliably on all files and causes the modified date re-write to fail during DIP creation without this data present.
Given that fits
elements are not always present in AIPs’ PREMIS metadata, the PREMIS element dateCreatedByApplication
could be a preferable source for setting the modified date of DIP files produced with automation-tools
.
If using dateCreatedByApplication
is a desirable solution, I have a small fix to address this issue and can open up a PR for review in a fork.
@sevein suggested a tearup/setup method with the possibility of overriding custom attributes later in the code. This would make AMClient available to all member functions. What other options are available to avoid providing constants like AM_URL
, SS_API_KEY
each time?
Scripts on the README.md page are mostly all on one line, e.g.
#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> --api-key <apikey> --ss-user <user> --ss-api-key <apikey> --transfer-source <transfer_source_uuid> --config-file <config_file>
This becomes easier to scroll if we do something along these lines:
#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> \
--api-key <apikey> \
--ss-user <user> \
--ss-api-key <apikey> \
--transfer-source <transfer_source_uuid>
--config-file <config_file>
Line-wrapping is a subjective part of code review (CR). There are ways to meet PEP8 compliance (79 characters) that look nicer than others. By way of making CR smoother, @sevein has suggested the use of black to provide some auto-formatting capability to our workflow. The Automation tools provides a good place to start and comment on. We will submit a PR against them and see how it looks to folks.
NB: The line length-flag will need to be set to 79 to comply with PEP8 guidelines: https://github.com/ambv/black#line-length - the default in black is 88 (their rationale: shorter modules using this additional width line-length).
Per: #46 (review) and (http://python-packaging.readthedocs.io/en/latest/command-line-scripts.html)
Additional comments from @sevein here: #47 (comment)
Latest changes in automation tools only works in python >=8
Connects to AM issue #1632
We have received a PR: #44 to add deletion on completion to Automation tools and have decided it should be merged into master following any additional code-review or enhancements.
In some cases it is desirable to re-ingest AIPs against a common processing-configuration so that they are all more-or-less consistent with each other. One use case put to the Archivematica team by CCA is the desire to have uncompressed AIPs in storage, whereby a processing configuration can be used to re-ingest the packages so that the final result is that all AIPs are decompressed and stored that way.
Archivematica has a corresponding task in Redmine: CCA:11686
I tried to run the amclient script with the following bash script
# cat /etc/archivematica/automation-tools/amclient-close-completed-transfers.sh
#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.amclient --log-file /var/log/archivematica/automation-tools/amclient.log close-completed-transfers --am-user-name autobot <API_KEY>
(note that /usr/lib/archivematica/automation-tools is a symlink to /opt/archivematica/automation-tools)
However getting the following error:
# sudo -u archivematica /etc/archivematica/automation-tools/amclient-close-completed-transfers.sh
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/opt/archivematica/automation-tools/transfers/amclient.py", line 33, in <module>
from transfers import utils
File "transfers/utils.py", line 17, in <module>
LOGGER = get_logger(defaults.AMCLIENT_LOG_FILE, defaults.DEFAULT_LOG_LEVEL)
File "transfers/utils.py", line 13, in get_logger
return loggingconfig.setup(log_level, log_file_name, "amclient")
File "transfers/loggingconfig.py", line 54, in setup
logging.config.dictConfig(CONFIG)
File "/usr/lib/python2.7/logging/config.py", line 794, in dictConfig
dictConfigClass(config).configure()
File "/usr/lib/python2.7/logging/config.py", line 576, in configure
'%r: %s' % (name, e))
ValueError: Unable to configure handler 'file': [Errno 13] Permission denied: '/opt/archivematica/automation-tools/transfers/amclient.log'
So it seems that it is still using the default log file location (os.path.join(THIS_DIR, 'amclient.log'))
This is just bad practice and, as @sevein points out, makes AMClient not thread safe.
The utility function _call_url_json
should just be _call_url
since the JSON decoding can be turned off.
Archivematica tries to use semver in all of our repositories, but this repo has not had any tagged releases.
For the Archivematica 1.7.0 release, we would like to specify which version of each associated repository is known to work with the release. Following semver would be particularly helpful in deployment repo's like https://github.com/artefactual/deploy-pub, https://github.com/artefactual-labs/am and https://github.com/JiscRDSS/rdss-archivematica
link:
http://semver.org/
Considering this comment: #58 (comment) we might introduce a config file for amclient.py
. Alternatives might be to supply values via environment variable. An example of the environment variable approach is twarc's
approach to reading Twitter authentication keys: https://github.com/ruebot/twarc#install
We only need to have a shebang if the script is an executable / entry point.
699b018 introduced shebags in modules where I believe it's not needed.
> transfers/amclientargs.py:1:#!/usr/bin/env python
> transfers/defaults.py:1:#!/usr/bin/env python
> transfers/errors.py:1:#!/usr/bin/env python
> transfers/loggingconfig.py:1:#!/usr/bin/env python
> transfers/utils.py:1:#!/usr/bin/env python
From time to time the UUID is missing in the Sqlite database. Automation Tools will never get an update on the status for this unit since the UUID is missing and then constructs an api-call which is not valid.
When running the API call manually, I got status "Completed" so updating the uuid field finishes this unit in the database.
It seems to me that something gets wrong when updating the db, when changing from transfer to ingest as unit type.
sqlite> select * from unit where id > 33;
id uuid path unit_type status microservice current
---------- ---------- ---------------------------------- ---------- ---------- ------------ ----------
34 7c7a75b5-9 auto/pliktmusikk_EAN_7071245336653 ingest COMPLETE 0
35 auto/pliktmusikk_EAN_7071245339920 ingest PROCESSING 1
From the log:
INFO 2019-05-15 09:51:01 transfer.py:579 Automation tools waking up
INFO 2019-05-15 09:51:01 transfer.py:69 Configuration values read for pidfile: /var/archivematica/automation-tools/transfers-pid.lck
INFO 2019-05-15 09:51:01 transfer.py:69 Configuration values read for databasefile: /var/archivematica/automation-tools/transfers.db
INFO 2019-05-15 09:51:01 transfer.py:618 Current unit: <Unit(id=35, uuid=None, unit_type=ingest, path=auto/pliktmusikk_EAN_7071245339920, status=PROCESSING, current=True)>
WARNING 2019-05-15 09:51:01 utils.py:56 GET Request to http://xxxxx:81/api/ingest/status/None/ returned 404 NOT FOUND
INFO 2019-05-15 09:51:01 transfer.py:632 Status info: Invalid response from server, check amclient log
ERROR 2019-05-15 09:51:01 transfer.py:641 Cannot read response from server for <Unit(id=35, uuid=None, unit_type=ingest, path=auto/pliktmusikk_EAN_7071245339920, status=PROCESSING, current=True)>: 'str' object has no attribute 'get'
INFO 2019-05-15 09:51:01 transfer.py:49 Running post-execution clean-up. Exiting script
The 'dev/dataverse' branch is currently out of date and needs to be rebased to be updated. An initial attempt shows the only conflict is requirements/base.txt
. But will investigate further.
The work in #48 has introduced an error handler errors.py
to provide more information to the calling client, previously we'd return None
for each failed request.
As we know, there is more information in the HTTP response that we might be able to take advantage of. See the authorisation example below. This might provide more useful feedback for users.
Curl example for authorisation:
curl -v -X GET "Authorization: ApiKey test:test" -H "Content-Type: application/json" -d '{u'username': 'demo', u'path': 'U2FtcGxlVHJhbnNmZXJz', u'api_key': 'dne'}' 'http://127.0.0.1:62081/api/v2/location/2a3d8d39-9cee-495e-b7ee-5e629254934d/browse/'
* Rebuilt URL to: Authorization: ApiKey test:test/
* Port number ended with ' '
* Closing connection -1
curl: (3) Port number ended with ' '
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 62081 (#0)
> GET /api/v2/location/2a3d8d39-9cee-495e-b7ee-5e629254934d/browse/ HTTP/1.1
> Host: 127.0.0.1:62081
> User-Agent: curl/7.55.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 61
>
* upload completely sent off: 61 out of 61 bytes
< HTTP/1.1 401 UNAUTHORIZED
< Server: nginx/1.12.2
< Date: Fri, 02 Mar 2018 04:39:19 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Language, Cookie
< X-Frame-Options: SAMEORIGIN
< WWW-Authenticate: Basic Realm="django-tastypie"
< Content-Language: en
<
* Connection #0 to host 127.0.0.1 left intact
NB HTTP/1.1 401 UNAUTHORIZED
could provide the client with a response more useful than None
or out of #PR48 an Invalid Response (ERR_INVALID_RESPONSE
)
Similarly, by simplifying the response from the request handler, we also lose the opportunity to get more specific information from responses without DEBUG logging, e.g. for a reingest where the transfer is already in progress we see:
WARNING 2018-04-06 15:40:58 utils.py:48 POST Request to http://127.0.0.1:62081/api/v2/file/6b0d6839-eed5-4b17-ab68-521a98b73f1e/reingest/ returned 409 CONFLICT
DEBUG 2018-04-06 15:40:58 utils.py:49 Response: {"error": true, "message": "This AIP is already being reingested on {pipeline}", "status_code": 409}
INFO 2018-04-06 15:40:58 reingest.py:192 Reingest UUID to work with None
Invalid response from server, check amclient log
Status code and message ultimately more descriptive than Invalid response from server, check amclient log
.
This portion of code provides a lookup for extensions that we might find when parsing dataverse.json
files:
automation-tools/transfers/examples/pre-transfer/dataverse.py
Lines 23 to 34 in 2e828e0
If we follow the comments to the development branch of dataverse, we see the two lists are much longer:
We may want to take these lists and convert them into more comprehensive lookup tables.
This repo should have an issue template that refers to the archivematica/issues repo if the issue they are filing is specific to archivematica.
Per line: https://github.com/artefactual/automation-tools/blob/master/tests/test_transfers.py#L209
Introduced in my previous commit: 699b018
When using the installation instructions for using the automation tools with a docker install I ran into a few problems. I think these are more general problems of the installation instructions and I think it might be a good idea to update the instructions. Before doing that I would like to know if you agree.
Proposed changes:
sudo
to the commands for creating the automation directories. All parent directories are owned by root, not the user.archivematica
group and user. In my case they didn't exist - but this might be because of the docker installation.If you think that this is a good idea I'll go ahead and create a PR for changing this.
Sometimes automation tools fail to approve transfers, requiring manual intervention (i.e., approve manually in the dashboard). Log shows:
...
INFO 2017-09-06 21:05:01 transfer.py:388 Ready to start
INFO 2017-09-06 21:05:01 transfer.py:414 Approving [email protected]
INFO 2017-09-06 21:05:07 transfer.py:398 Failed approve, try 1 of 3
INFO 2017-09-06 21:05:07 transfer.py:414 Approving [email protected]
INFO 2017-09-06 21:05:13 transfer.py:398 Failed approve, try 2 of 3
INFO 2017-09-06 21:05:13 transfer.py:414 Approving [email protected]
INFO 2017-09-06 21:05:19 transfer.py:398 Failed approve, try 3 of 3
WARNING 2017-09-06 21:05:19 transfer.py:400 Not approved
...
When this happens, the corresponding database entry does not update (running the automated transfer script again does not update it), so for the package that needed to be manually approved and had the ingest completed, unit_type is "transfer" and status remains blank, also uuid is blank :
$ sqlite3 /var/archivematica/automation-tools/transfers.db "select * from unit order by id DESC limit 3;"
33150||[email protected]|transfer|||0|2017-09-06 21:05:19.837795
33149|c35028bf-a2e8-47a7-8049-20dea6fa44ad|[email protected]|ingest|COMPLETE||0|2017-09-06 20:35:07.745142
33148|62ba497b-14c7-4114-b1ff-124eb550e02f|[email protected]|ingest|COMPLETE||0|2017-09-06 20:05:08.633923
It is not clear how to be able to reproduce this problem.
(Error above occurred in the dev/dspace branch)
transfers/amclient.py
is a nice module and CLI that holds functionality for interacting with the various Archivematica APIs (for example, it can be used to automate hiding of completed transfers/ingests in the dashboard). The repository's README does not mention it, it would be a good thing to add information about it there (amclient.py
's source code has docstrings that could be used for this purpose)
I'm getting a failure at https://github.com/artefactual/automation-tools/blob/master/tests/test_create_dip.py#L143 when I run test_create_dip_success
locally:
> assert aip_lastmodified - 1 <= lastmodified <= aip_lastmodified + 1
E AssertionError: assert 1510853052 <= (1510849451 + 1)
This looks like a (DST) timezone issue because the dates are off by 1 hour (plus the expected 1 second):
1510849451
is 2017-11-16 08:24:111510853052
is 2017-11-16 09:24:12We have a second call to getattr that can be triggered resulting in most cases a second http request being made and the wrong result being sent to the client terminal: https://github.com/artefactual/automation-tools/blob/master/transfers/amclient.py#L128
Introduced here: 699b018
In a recent pull request: #57 we have added the ability to authenticate by header which in some Archivematica API calls is the only possible mechanism. As noted by @jrwdunham we can pass headers into the requests.requests()
call whether we want to use parameters or not. The problem with this is that the request signature changes and with it the vcrpy
fixtures will need to change as well.
Because of the number of fixtures we have adopted and the range of unit tests created for many different scenarios this will take some time to configure and run.
When we do this, we might also want to make an additional change in _call_url_json()
outlined by Joel in the code snippet below to provide some nice logic depending on the type of request being sent.
try:
if method == METHOD_GET:
response = requests.request(method, url=url, params=params,
headers=headers)
else:
response = requests.request(method, url=url, data=params,
headers=headers)
LOGGER.debug('Response: %s', response)
LOGGER.debug('type(response.text): %s ', type(response.text))
LOGGER.debug('Response content-type: %s',
response.headers['content-type'])
Hi, as shown in the following full dependency graph of automation-tools, automation-tools requires urllib3 (the latest version), while the installed version of requests(2.22.0) requires urllib3>=1.21.1,<1.26.
According to Pip's “first found wins” installation strategy, urllib3 1.25.3 is the actually installed version.
Although the first found package version urllib3 1.25.3 just satisfies the later dependency constraint (urllib3>=1.21.1,<1.26), it will lead to a build failure once developers release a newer version of urllib3.
automation-tools-master
| +-amclient(version range:==1.0.0rc2)
| +-enum34(version range:*)
| +-flake8(version range:==3.4.1)
| +-flake8-import-order(version range:==0.13)
| +-metsrw(version range:==0.3.8)
| +-mock(version range:*)
| +-pytest(version range:*)
| +-requests(version range:<3.0)
| | +-chardet(version range:>=3.0.2,<3.1.0)
| | +-idna(version range:>=2.5,<2.9)
| | +-urllib3(version range:>=1.21.1,<1.26)
| | +-certifi(version range:>=2017.4.17)
| +-six(version range:*)
| +-sqlalchemy(version range:*)
| +-urllib3(version range:*)
| +-vcrpy(version range:>=1.0.0)
Thanks for your attention.
Best,
Neolith
While installing the automation tools, a person not as familiar with python, or does not know what virtualenv is may get hung up on step two of the installation instructions: Create virtualenv /usr/share/python/automation-tools and pip install requirements there
As a developer new to amclient.py
I want to be able to quickly see which functions we want to expose as subcommands and which we want to use internally only. This helps me to see that a subcommand has been configured appropriately. This should promote ease of maintenance.
E.g. def get_next_package(self, next_path) we use at the module level only and cannot be exposed as a subcommand to the client when used on the terminal.
Here:
automation-tools/transfers/amclient.py
Line 185 in 7efd17e
utils.py
can be non-None even if there is an issue trying to talk to the server, i.e. we return error codes not None
. As such, even in an error scenario a result can be added to the completed_successfully list we then return to the caller.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.