artefactual / automation-tools Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 34.0 6.13 MB

Tools to aid automation of Archivematica and AtoM.

License: GNU Affero General Public License v3.0

Python 98.05% Shell 1.64% Makefile 0.30%

automation-tools's People

Stargazers

Watchers

automation-tools's Issues

Delete on complete doesn't work in transfer.py

transfer.py has an option to delete the transfer source files, but doesn't seem to work. The issue appears to be that it is trying to use shutil.rmtree and the path from the transfer.db, but that path is only relative to the transfer source location so it's unable to actually locate. I think the transfer.db might need to include an absolute path for this to work.

Problem: Consider exposing status codes and response.text to calling functions

The work in #48 has introduced an error handler errors.py to provide more information to the calling client, previously we'd return None for each failed request.

As we know, there is more information in the HTTP response that we might be able to take advantage of. See the authorisation example below. This might provide more useful feedback for users.

Curl example for authorisation:

curl -v -X GET "Authorization: ApiKey test:test" -H "Content-Type: application/json" -d '{u'username': 'demo', u'path': 'U2FtcGxlVHJhbnNmZXJz', u'api_key': 'dne'}' 'http://127.0.0.1:62081/api/v2/location/2a3d8d39-9cee-495e-b7ee-5e629254934d/browse/'
* Rebuilt URL to: Authorization: ApiKey test:test/
* Port number ended with ' '
* Closing connection -1
curl: (3) Port number ended with ' '
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 62081 (#0)
> GET /api/v2/location/2a3d8d39-9cee-495e-b7ee-5e629254934d/browse/ HTTP/1.1
> Host: 127.0.0.1:62081
> User-Agent: curl/7.55.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 61
> 
* upload completely sent off: 61 out of 61 bytes
< HTTP/1.1 401 UNAUTHORIZED
< Server: nginx/1.12.2
< Date: Fri, 02 Mar 2018 04:39:19 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Language, Cookie
< X-Frame-Options: SAMEORIGIN
< WWW-Authenticate: Basic Realm="django-tastypie"
< Content-Language: en
< 
* Connection #0 to host 127.0.0.1 left intact

NB HTTP/1.1 401 UNAUTHORIZED could provide the client with a response more useful than None or out of #PR48 an Invalid Response (ERR_INVALID_RESPONSE)

Similarly, by simplifying the response from the request handler, we also lose the opportunity to get more specific information from responses without DEBUG logging, e.g. for a reingest where the transfer is already in progress we see:

WARNING   2018-04-06 15:40:58 utils.py:48   POST Request to http://127.0.0.1:62081/api/v2/file/6b0d6839-eed5-4b17-ab68-521a98b73f1e/reingest/ returned 409 CONFLICT
DEBUG     2018-04-06 15:40:58 utils.py:49   Response: {"error": true, "message": "This AIP is already being reingested on {pipeline}", "status_code": 409}
INFO      2018-04-06 15:40:58 reingest.py:192  Reingest UUID to work with None
Invalid response from server, check amclient log

Status code and message ultimately more descriptive than Invalid response from server, check amclient log.

Problem: Lots of repetition in setting up AM Client Unit Tests

@sevein suggested a tearup/setup method with the possibility of overriding custom attributes later in the code. This would make AMClient available to all member functions. What other options are available to avoid providing constants like AM_URL, SS_API_KEY each time?

Problem: this repo has no releases and is not using semver

Archivematica tries to use semver in all of our repositories, but this repo has not had any tagged releases.

For the Archivematica 1.7.0 release, we would like to specify which version of each associated repository is known to work with the release. Following semver would be particularly helpful in deployment repo's like https://github.com/artefactual/deploy-pub, https://github.com/artefactual-labs/am and https://github.com/JiscRDSS/rdss-archivematica

link:
http://semver.org/

Problem: Add delete on completion pull-request

We have received a PR: #44 to add deletion on completion to Automation tools and have decided it should be merged into master following any additional code-review or enhancements.

Problem: transfers/amclient.py is not documented in the README

transfers/amclient.py is a nice module and CLI that holds functionality for interacting with the various Archivematica APIs (for example, it can be used to automate hiding of completed transfers/ingests in the dashboard). The repository's README does not mention it, it would be a good thing to add information about it there (amclient.py's source code has docstrings that could be used for this purpose)

Problem: Turn transfers/utils.py into a self-contained library/class

The fix for #51 will likely be a workaround to get it up and running in good time. Currently, utils.py is called by two separate module scripts but it contains its own logging.

It seems more appropriate for the scripts calling utils.py to manage their own logging and be able to retrieve what they need to report on from member variables within utils.py by turning it into one or many classes, e.g.

Instead of utils.py logging, LOGGER.debug(response['content-type']), it should store the response in a member var. On returning an exception amclient.pyortransfers.py` can opt to look into the response attributes on their own terms.

To improve this component of the transfers module, we will need to:

Remove logging calls
Raise exceptions instead of returning error codes when something goes wrong
Assign components of the request and response to member variables where required
Decide on which getters/setters will be most appropriate

This work could probably do with some discussion before it is implemented however.

Problem: installation instructions

When using the installation instructions for using the automation tools with a docker install I ran into a few problems. I think these are more general problems of the installation instructions and I think it might be a good idea to update the instructions. Before doing that I would like to know if you agree.

Proposed changes:

Add sudo to the commands for creating the automation directories. All parent directories are owned by root, not the user.
Add instructions for creating the archivematica group and user. In my case they didn't exist - but this might be because of the docker installation.
Add instrcutions for changing ownership of the new directories from root to the user.

If you think that this is a good idea I'll go ahead and create a PR for changing this.

Problem: amclient does not use log file location specified by `--log-file`

I tried to run the amclient script with the following bash script

# cat /etc/archivematica/automation-tools/amclient-close-completed-transfers.sh
#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.amclient --log-file /var/log/archivematica/automation-tools/amclient.log  close-completed-transfers --am-user-name autobot <API_KEY>

(note that /usr/lib/archivematica/automation-tools is a symlink to /opt/archivematica/automation-tools)

However getting the following error:

# sudo -u archivematica /etc/archivematica/automation-tools/amclient-close-completed-transfers.sh
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/archivematica/automation-tools/transfers/amclient.py", line 33, in <module>
    from transfers import utils
  File "transfers/utils.py", line 17, in <module>
    LOGGER = get_logger(defaults.AMCLIENT_LOG_FILE, defaults.DEFAULT_LOG_LEVEL)
  File "transfers/utils.py", line 13, in get_logger
    return loggingconfig.setup(log_level, log_file_name, "amclient")
  File "transfers/loggingconfig.py", line 54, in setup
    logging.config.dictConfig(CONFIG)
  File "/usr/lib/python2.7/logging/config.py", line 794, in dictConfig
    dictConfigClass(config).configure()
  File "/usr/lib/python2.7/logging/config.py", line 576, in configure
    '%r: %s' % (name, e))
ValueError: Unable to configure handler 'file': [Errno 13] Permission denied: '/opt/archivematica/automation-tools/transfers/amclient.log'

So it seems that it is still using the default log file location (os.path.join(THIS_DIR, 'amclient.log'))

Problem: processing hangs because UUID for ingest is missing in the Sqlite database

From time to time the UUID is missing in the Sqlite database. Automation Tools will never get an update on the status for this unit since the UUID is missing and then constructs an api-call which is not valid.

When running the API call manually, I got status "Completed" so updating the uuid field finishes this unit in the database.

It seems to me that something gets wrong when updating the db, when changing from transfer to ingest as unit type.

sqlite> select * from unit where id > 33;
id          uuid        path                                unit_type   status      microservice  current   
----------  ----------  ----------------------------------  ----------  ----------  ------------  ----------
34          7c7a75b5-9  auto/pliktmusikk_EAN_7071245336653  ingest      COMPLETE                  0         
35                      auto/pliktmusikk_EAN_7071245339920  ingest      PROCESSING                1

From the log:
INFO 2019-05-15 09:51:01 transfer.py:579 Automation tools waking up
INFO 2019-05-15 09:51:01 transfer.py:69 Configuration values read for pidfile: /var/archivematica/automation-tools/transfers-pid.lck
INFO 2019-05-15 09:51:01 transfer.py:69 Configuration values read for databasefile: /var/archivematica/automation-tools/transfers.db
INFO 2019-05-15 09:51:01 transfer.py:618 Current unit: <Unit(id=35, uuid=None, unit_type=ingest, path=auto/pliktmusikk_EAN_7071245339920, status=PROCESSING, current=True)>
WARNING 2019-05-15 09:51:01 utils.py:56 GET Request to http://xxxxx:81/api/ingest/status/None/ returned 404 NOT FOUND
INFO 2019-05-15 09:51:01 transfer.py:632 Status info: Invalid response from server, check amclient log
ERROR 2019-05-15 09:51:01 transfer.py:641 Cannot read response from server for <Unit(id=35, uuid=None, unit_type=ingest, path=auto/pliktmusikk_EAN_7071245339920, status=PROCESSING, current=True)>: 'str' object has no attribute 'get'
INFO 2019-05-15 09:51:01 transfer.py:49 Running post-execution clean-up. Exiting script

Problem: Response from hide_units can be non-None even if it fails

Here:

automation-tools/transfers/amclient.py

Line 185 in 7efd17e

for unit_uuid in _completed_units:

the response returned from the server via utils.py can be non-None even if there is an issue trying to talk to the server, i.e. we return error codes not None. As such, even in an error scenario a result can be added to the completed_successfully list we then return to the caller.

Problem: Transfers.py uses 'r+' fopen mode but probably only requires 'w'

Here:

automation-tools/transfers/transfer.py

Line 416 in 2a79db2

os.open(pid_file, os.O_CREAT | os.O_EXCL | os.O_RDWR), 'r+')

we have r+ open mode but we only use it to write to, so we should probably only open with w.

Problem: Apply ambv/black across AM codebase

Line-wrapping is a subjective part of code review (CR). There are ways to meet PEP8 compliance (79 characters) that look nicer than others. By way of making CR smoother, @sevein has suggested the use of black to provide some auto-formatting capability to our workflow. The Automation tools provides a good place to start and comment on. We will submit a PR against them and see how it looks to folks.

NB: The line length-flag will need to be set to 79 to comply with PEP8 guidelines: https://github.com/ambv/black#line-length - the default in black is 88 (their rationale: shorter modules using this additional width line-length).

Problem: Dataverse branch needs to be rebased against 'master'

The 'dev/dataverse' branch is currently out of date and needs to be rebased to be updated. An initial attempt shows the only conflict is requirements/base.txt. But will investigate further.

Problem: test_create_dip.py test may fail because of DST

I'm getting a failure at https://github.com/artefactual/automation-tools/blob/master/tests/test_create_dip.py#L143 when I run test_create_dip_success locally:

>               assert aip_lastmodified - 1 <= lastmodified <= aip_lastmodified + 1
E               AssertionError: assert 1510853052 <= (1510849451 + 1)

This looks like a (DST) timezone issue because the dates are off by 1 hour (plus the expected 1 second):

1510849451 is 2017-11-16 08:24:11
1510853052 is 2017-11-16 09:24:12

Problem: Defaults for automations tools have changed since refactor and docs need updating

The defaults have changed since the last refactor of AMClient.py and reflect those used for those working with Docker installs. Likely we need to put them back to the original values, but a good question raised by @jhsimpson was do we also take the time with this to put in a dev mode so that two sets of defaults can be easily picked from for those using automation tools.

Problem: It might be more flexible to autogenerate portions of README.md from amclient.py

Wiring the README manually is time-consuming, and also difficult to format consistently. Comments from a previous code-review point to the idea of auto-generating portions of the README.md which discuss amclient.py. Changes could be made in bulk and greater flexibility given to finding the right style to adopt.

E.g.

Problem: Need a script to enable bulk re-ingest

In some cases it is desirable to re-ingest AIPs against a common processing-configuration so that they are all more-or-less consistent with each other. One use case put to the Archivematica team by CCA is the desire to have uncompressed AIPs in storage, whereby a processing configuration can be used to re-ingest the packages so that the final result is that all AIPs are decompressed and stored that way.

Archivematica has a corresponding task in Redmine: CCA:11686

Problem: Dataverse pre-transfer hook doesn't have unit tests

As a developer I need to be able to understand the current state of legacy branches and their capability. Unit tests help to provide a baseline from which to understand that and then launch the development of new functionality and possible refactoring. Unit tests should be developed for the Dataverse branch to provide that.

The three commits that make up #15 are:

SQlite db connection held open

transfers.db gets locked whenever transfer.py is running – which normally isn't very long, but in the event of large transfers, this means that transfers.db can be locked for hours (because transfer.py is running until the rsync completes).

Ideally the connection to transfers.db would only be open when it is really needed…

To reproduce: try to write to transfers.db while transfer.py is running.

Problem: Adding headers to requests.request() in utils.py will result in re-recording many vcrpy fixtures

In a recent pull request: #57 we have added the ability to authenticate by header which in some Archivematica API calls is the only possible mechanism. As noted by @jrwdunham we can pass headers into the requests.requests() call whether we want to use parameters or not. The problem with this is that the request signature changes and with it the vcrpy fixtures will need to change as well.

Because of the number of fixtures we have adopted and the range of unit tests created for many different scenarios this will take some time to configure and run.

When we do this, we might also want to make an additional change in _call_url_json() outlined by Joel in the code snippet below to provide some nice logic depending on the type of request being sent.

    try:
        if method == METHOD_GET:
            response = requests.request(method, url=url, params=params,
                                        headers=headers)
        else:
            response = requests.request(method, url=url, data=params,
                                        headers=headers)
        LOGGER.debug('Response: %s', response)
        LOGGER.debug('type(response.text): %s ', type(response.text))
        LOGGER.debug('Response content-type: %s',
                    response.headers['content-type'])

Problem: The Automation Tools AMClient script could be easier to jump into and add functionality

As a developer I like to be able to see a division in responsibilities of code. For example being able to see clearly where I can add new CLI args, or new logic.

The AMClient.py script in Automation Tools currently tries to do a lot of this in a single script - args/logging/functional logic. I think there is an opportunity to pull those three pieces apart to make it easier for developers to work with and open the opportunity for smaller refactors in the future.

AMClient.py is a good candidate to have this attention paid to it as its functionality is only likely to grow as it wraps the AM API capabilities in a consistent way.

Proposal:

Create a logging configuration script that can provide a model for sharing across the other Automation Tools (Transfer.py would be the first to make use of this)
Separate the CLI argument definitions into a script on its own so that it becomes an entry point for new functionality being added by developers.
Leave the core features of the script in place (API calls and handling) and import the new scripts at the module level.

This will reduce the size of each of the scripts and hopefully make it a bit easier for folk to delve into.

Problem: Incorrect fixture name introduced into tests

Per line: https://github.com/artefactual/automation-tools/blob/master/tests/test_transfers.py#L209

Introduced in my previous commit: 699b018

Potential dependency conflicts between automation-tools and urllib3

Hi, as shown in the following full dependency graph of automation-tools, automation-tools requires urllib3 (the latest version), while the installed version of requests(2.22.0) requires urllib3>=1.21.1,<1.26.

According to Pip's “first found wins” installation strategy, urllib3 1.25.3 is the actually installed version.

Although the first found package version urllib3 1.25.3 just satisfies the later dependency constraint （urllib3>=1.21.1,<1.26), it will lead to a build failure once developers release a newer version of urllib3.

Dependency tree--------

automation-tools-master
| +-amclient(version range:==1.0.0rc2)
| +-enum34(version range:*)
| +-flake8(version range:==3.4.1)
| +-flake8-import-order(version range:==0.13)
| +-metsrw(version range:==0.3.8)
| +-mock(version range:*)
| +-pytest(version range:*)
| +-requests(version range:<3.0)
| | +-chardet(version range:>=3.0.2,<3.1.0)
| | +-idna(version range:>=2.5,<2.9)
| | +-urllib3(version range:>=1.21.1,<1.26)
| | +-certifi(version range:>=2017.4.17)
| +-six(version range:*)
| +-sqlalchemy(version range:*)
| +-urllib3(version range:*)
| +-vcrpy(version range:>=1.0.0)

Thanks for your attention.
Best,
Neolith

Automation error

Trying to automate archivematica.
Getting the following error:

/etc/archivematica/automation-tools# sudo -u archivematica ./transfer-script.sh
/usr/share/python/automation-tools/bin/python: No module named transfers

Problem: no issue template

This repo should have an issue template that refers to the archivematica/issues repo if the issue they are filing is specific to archivematica.

Problem: Consider making the installation instructions a little more clear

While installing the automation tools, a person not as familiar with python, or does not know what virtualenv is may get hung up on step two of the installation instructions: Create virtualenv /usr/share/python/automation-tools and pip install requirements there

Problem: Make the Automation Tools take advantage of Python Setup Tools

Per: #46 (review) and (http://python-packaging.readthedocs.io/en/latest/command-line-scripts.html)

Additional comments from @sevein here: #47 (comment)

Problem: Latest AT fails in AM<=1.14 because python<3.8

Latest changes in automation tools only works in python >=8

Connects to AM issue #1632

Problem: Rogue second call to getattr exists in AMClient code following refactor

We have a second call to getattr that can be triggered resulting in most cases a second http request being made and the wrong result being sent to the client terminal: https://github.com/artefactual/automation-tools/blob/master/transfers/amclient.py#L128

Introduced here: 699b018

Problem: Automation Tools has no gui

Automation tools runs as cron jobs. It has no user interface, command line or gui. For Archivematica users that want to troubleshoot a system without using ssh, it is not possible to tell what is going on in the automation tools.

It might be possible to use this
https://simonwillison.net/2017/Nov/13/datasette/

to partially resolve this problem. At least this would give users without ssh access the ability to get information out of the sqlite database used by the automation tools.

Problem: There are a larger number of filetypes than we can lookup in dataverse pre-transfer

This portion of code provides a lookup for extensions that we might find when parsing dataverse.json files:

automation-tools/transfers/examples/pre-transfer/dataverse.py

Lines 23 to 34 in 2e828e0

 # https://github.com/IQSS/dataverse/blob/4.0.1/src/main/java/MimeTypeDisplay.properties 

 # https://github.com/IQSS/dataverse/blob/4.0.1/src/main/java/META-INF/mime.types 

 EXTENSION_MAPPING = { 

 'Comma Separated Values': '.csv', 

 'MS Excel (XLSX)': '.xlsx', 

 'R Data': '.RData', 

 'SPSS Portable': '.por', 

 'SPSS SAV': '.sav', 

 'Stata Binary': '.dta', 

 'Stata 13 Binary': '.dta', 

 'UNKNOWN': '', 

 }

If we follow the comments to the development branch of dataverse, we see the two lists are much longer:

https://github.com/IQSS/dataverse/blob/cc2909ca18faaf74c3ac459cc812851828c57042/src/main/java/META-INF/mime.types

https://github.com/IQSS/dataverse/blob/cc2909ca18faaf74c3ac459cc812851828c57042/src/main/java/MimeTypeDisplay.properties

We may want to take these lists and convert them into more comprehensive lookup tables.

Problem: this repo has no contributing.md

Would it be worthwhile to have a contributing.md for this repo? Questions to consider are how issues are filed in the Archivematica vs. AtoM vs. Binder projects and whether we want this repo to follow the same pattern?

Problem: python shebang in py modules that are not executable

We only need to have a shebang if the script is an executable / entry point.

699b018 introduced shebags in modules where I believe it's not needed.

> transfers/amclientargs.py:1:#!/usr/bin/env python
> transfers/defaults.py:1:#!/usr/bin/env python
> transfers/errors.py:1:#!/usr/bin/env python
> transfers/loggingconfig.py:1:#!/usr/bin/env python
> transfers/utils.py:1:#!/usr/bin/env python

Problem: AMClient instance attributes are used for request-specific actions

This is just bad practice and, as @sevein points out, makes AMClient not thread safe.

Problem: Internal functions in AMClient.py aren't clearly delineated from subcommand functions

As a developer new to amclient.py I want to be able to quickly see which functions we want to expose as subcommands and which we want to use internally only. This helps me to see that a subcommand has been configured appropriately. This should promote ease of maintenance.

E.g. def get_next_package(self, next_path) we use at the module level only and cannot be exposed as a subcommand to the client when used on the terminal.

problem: when ingesting DSpace files sometimes transfers fail to approve ("Unable to find unapproved transfer directory.")

Sometimes automation tools fail to approve transfers, requiring manual intervention (i.e., approve manually in the dashboard). Log shows:

...
INFO      2017-09-06 21:05:01  transfer.py:388  Ready to start
INFO      2017-09-06 21:05:01  transfer.py:414  Approving [email protected]
INFO      2017-09-06 21:05:07  transfer.py:398  Failed approve, try 1 of 3
INFO      2017-09-06 21:05:07  transfer.py:414  Approving [email protected]
INFO      2017-09-06 21:05:13  transfer.py:398  Failed approve, try 2 of 3
INFO      2017-09-06 21:05:13  transfer.py:414  Approving [email protected]
INFO      2017-09-06 21:05:19  transfer.py:398  Failed approve, try 3 of 3
WARNING   2017-09-06 21:05:19  transfer.py:400  Not approved
...

When this happens, the corresponding database entry does not update (running the automated transfer script again does not update it), so for the package that needed to be manually approved and had the ingest completed, unit_type is "transfer" and status remains blank, also uuid is blank :

$ sqlite3 /var/archivematica/automation-tools/transfers.db "select * from unit order by id DESC limit 3;"
33150||[email protected]|transfer|||0|2017-09-06 21:05:19.837795
33149|c35028bf-a2e8-47a7-8049-20dea6fa44ad|[email protected]|ingest|COMPLETE||0|2017-09-06 20:35:07.745142
33148|62ba497b-14c7-4114-b1ff-124eb550e02f|[email protected]|ingest|COMPLETE||0|2017-09-06 20:05:08.633923

It is not clear how to be able to reproduce this problem.
(Error above occurred in the dev/dspace branch)

Problem: Non-equivalent directory name comparison in Transfers.py

See: https://github.com/artefactual/automation-tools/blob/master/transfers/transfer.py#L373

We're comparing a string that comes into the function with an fsencoded (byte) string, which will result in a match not being found in Python3.

Plain and encoded strings look like as follows:

Plain: transfer_1
FSEncoded: b'transfer_1'

Problem: name _call_url_json is inaccurate

The utility function _call_url_json should just be _call_url since the JSON decoding can be turned off.

Problem: Consider ways of removing repetition when calling amclient.py

Considering this comment: #58 (comment) we might introduce a config file for amclient.py. Alternatives might be to supply values via environment variable. An example of the environment variable approach is twarc's approach to reading Twitter authentication keys: https://github.com/ruebot/twarc#install

Problem: Sample README.md shell scripts are all on one line

Scripts on the README.md page are mostly all on one line, e.g.

#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> --api-key <apikey> --ss-user <user> --ss-api-key <apikey> --transfer-source <transfer_source_uuid> --config-file <config_file>

This becomes easier to scroll if we do something along these lines:

#!/bin/bash
cd /usr/lib/archivematica/automation-tools/
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> \
                                                                    --api-key <apikey> \
                                                                    --ss-user <user> \
                                                                    --ss-api-key <apikey> \
                                                                    --transfer-source <transfer_source_uuid> 
                                                                    --config-file <config_file>

need deep linking ability

Make the URL update with a #link when scrolling

create_dip.py script changes file names from BagIt transfers

Running the create_dip.py on an AIP created from a transfer of a BagIt bag causes the initial three characters to be removed from all of the file names in the DIP objects directory. The following warning is output when this occurs:

INFO 2021-08-03 15:36:40 Moving file: objects/box4b12.bin
WARNING 2021-08-03 15:36:40 premis:originalName not starting with %transferDirectory%objects/

In the case above, the final file name in the DIP objects directory (zipped) was: 4b12.bin.

Problem: fits:fslastmodified is not always available as a source for setting the modified date of DIP files

DIPs produced using automation-tools are supposed to re-write the last modified dates to files to what is recorded in the PREMIS metadata during Archivematica processing. However, this currently relies on last modified dates extracted by fits, which does not run reliably on all files and causes the modified date re-write to fail during DIP creation without this data present.

Given that fits elements are not always present in AIPs’ PREMIS metadata, the PREMIS element dateCreatedByApplication could be a preferable source for setting the modified date of DIP files produced with automation-tools.

If using dateCreatedByApplication is a desirable solution, I have a small fix to address this issue and can open up a PR for review in a fork.

	# https://github.com/IQSS/dataverse/blob/4.0.1/src/main/java/MimeTypeDisplay.properties
	# https://github.com/IQSS/dataverse/blob/4.0.1/src/main/java/META-INF/mime.types
	EXTENSION_MAPPING = {
	'Comma Separated Values': '.csv',
	'MS Excel (XLSX)': '.xlsx',
	'R Data': '.RData',
	'SPSS Portable': '.por',
	'SPSS SAV': '.sav',
	'Stata Binary': '.dta',
	'Stata 13 Binary': '.dta',
	'UNKNOWN': '',
	}

artefactual / automation-tools Goto Github PK

automation-tools's People

Stargazers

Watchers

Forkers

automation-tools's Issues

Dependency tree--------

Recommend Projects

Recommend Topics

Recommend Org