Git Product home page Git Product logo

archivematica-storage-service's People

Contributors

ablwr avatar cole avatar dhwaniartefact avatar eckardm avatar fitnycdigitalinitiatives avatar hakamine avatar helenst avatar hwesta avatar j4bbi avatar jambun avatar jhsimpson avatar jraddaoui avatar jrwdunham avatar kidsseeghosts avatar klavman avatar mamedin avatar marktriggs avatar mcantelon avatar mistydemeo avatar payten avatar qubot avatar remileduc avatar replaceafill avatar ross-spencer avatar sallain avatar sbreker avatar scollazo avatar sevein avatar sromkey avatar tw4l avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archivematica-storage-service's Issues

Problem: pipeline self-registration mechanism is static

When a new pipeline is registered, SS guesses the IP address by looking at the REMOTE_ADDR header of the HTTP request. This mechanism has worked fine for a long time because the pipeline was either running in the same host or in a separate host with a static IP address. In more dynamic environments IP addresses may change often and SS eventually becomes unable to reach the pipeline back, for example, for re-ingest.

In JiscRDSS this issue was temporary solved by adding a new environment variable PIPELINE_REMOTE_NAME - this environment is defined in rdss-archivematica. This enables us to use DNS names instead which works as long as the DNS server is up to date and the TCP port is always the same across al the replicas.

Problem: default osdeps file for RedHat causes fail during ss-osdeps playbook on RHEL

Currently in osdeps/, RedHat-7.json is a symlink to CentOS-7.json, which is logical. However this causes a problem with a ansible playbook in the role archivematica-src; ss-osdeps.yml fails out when doing a stat on said file (the one looked for if ansible_distribution = "RedHat") and 'isreg' (regular file rather than a symlink) returns false.

One obvious solution is to also make RedHat-7.json a regular file, slight increase in risk if do not keep it and CentOS-7.json in sync going forward.

Problem: there is no error raised when extract_file() fails to extract a file

Scenario: Trying to get a single file from a 7zipped AIP.

When package.extract_file is called, with a relative_path argument (i.e., extract only one file), and the package was compressed with 7zip, then 7z is used to extract the a file from within the package.

If the file doesn't exist (there is no file inside the package at 'relative_path') then 7z exits, with a message that says:

No files to process

Files: 0
Size:       0

but it returns exit code 0.

The problem appears to be here:
https://github.com/artefactual/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/models/package.py#L538-L541

This code assumes exit code 0 means the file was extracted.

This was uncovered when trying to extract a manifest-sha512.txt file from an AIP that did not have that file, it has a manifest-sha256.txt.

Problem: double replications being created from the first replication, not the original

To reproduce:

  • create an AIP storage location which has 2 replicating locations
  • Store an AIP in the location

Result:
The second replication will be created from the first replication, rather than from the original AIP. If you look at the pointer files for the original and replicated AIPs:

  • original AIP has one replication event resulting in the first replication
  • first replication also has a replication event resulting in the second replication

Instead, the original AIP pointer file should have two replication events and two corresponding validation events.

Problem: packages tab rendering scalability

When displaying the packages/ location, the Storage Service reads all the entries of the packages table to memory before rendering ( https://github.com/artefactual/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/views.py#L45-L47 ), even though the view shows only 10 packages at a time. On an instance with 40000+ packages it takes several minutes to display. Also fearing that eventually will get out of memory errors and become unable to display.

Problem: Arkivum version 4.2 requires md5 checksums

#216 fixed an issue when storing AIP's in an Arkivum space. However, that fix has introduced a new problem.

When storing an AIP in Arkivum, the storage service first moves the AIP from it internal storage location to the Arkivum AIP Storage Location with operating system commands (rsync and a mv). Afterwards, a checksum is POSTed to the Arkivum REST api, so that Arkivum can verify the contents were copied successfully before continuing with its processing.

After 216, the Arkivum model looks up the checksum for compresseed AIP's in the AIP's pointer file. For uncompressed AIP's the bag manifest file is sent instead.

Arkivum version 4.2 can process a range of checksum algorithms in bag manifest files, but for individual files (e.g., compressed AIPs) only md5 is supported.

The Arkivum model needs to be updated to send only md5 for compressed AIPs.

Problem: 7zip and unar deps are overlapping

We use unar once in the code, lsar is used twice.

Can we achieve the same with 7zip?
Best to keep the number of deps small, can we use 7zip where unar is used today?

Also, unar is not available in Alpine Linux our fav dist when we build containers.

Shibboleth user cannot login after the first logout

In testing the archivematica/1.x and archivematica-storage-service/0.x branches we encounter the following issue when a user authenticates with the external identity provider method Shibboleth:

  1. The user has never connected with the username
  2. The user connects to the dashboard\storage service
  3. The user is directed to the Shibboleth Identity Provider
  4. The user authenticates
  5. The user is directed to the dashboard\storage service
  6. A local account is created for the user with the correct roles
  7. The user operated the UI
  8. The user does a logout
  9. The Shibboleth service provider confirms the end of the session
  10. The user closes the browser.
    11 . The user opens the browser
  11. The user repeats 2, 3, 4 and 5 but then the expected 7 does not occur.... in stead the user is presented with the native login screen of the dashboard\service provider and cannot proceed.

Problem: tests mutate source folder

Tests should generate temporary data somewhere else.

How to reproduce:

$ docker run \
    -e DJANGO_SETTINGS_MODULE=storage_service.settings.test \
    --entrypoint pytest -t ss

administration/tests/test_languages.py ....
locations/tests/test_api.py ..............F..F...F.......
locations/tests/test_arkivum.py FFFFFF
locations/tests/test_dataverse.py ...F
locations/tests/test_dspace.py ..F..F.FF
locations/tests/test_duracloud.py .......FFFFFFFFFF
locations/tests/test_fixity_log.py .
locations/tests/test_locations.py ...
locations/tests/test_lockssomatic.py .
locations/tests/test_package.py ..........
locations/tests/test_swift.py .....FFFF.
storage_service/tests/test_shibboleth.py sssss
storage_service/tests/test_startup.py ....

Running as root fixes the problem but it's not ideal:

$ docker run \
    --user=root \
    -e DJANGO_SETTINGS_MODULE=storage_service.settings.test \
    --entrypoint pytest -t ss

administration/tests/test_languages.py ....
locations/tests/test_api.py .............................
locations/tests/test_arkivum.py ......
locations/tests/test_dataverse.py ....
locations/tests/test_dspace.py .........
locations/tests/test_duracloud.py .................
locations/tests/test_fixity_log.py .
locations/tests/test_locations.py ...
locations/tests/test_lockssomatic.py .
locations/tests/test_package.py ..........
locations/tests/test_swift.py ..........
storage_service/tests/test_shibboleth.py sssss
storage_service/tests/test_startup.py ....

Problem: extract_file endpont may fail to delete temporary folder

The /api/v1/file/<uuid>/extract_file/ endpoint uses the following code to extract a compressed package:

(extracted_file_path, temp_dir) = package.extract_file(relative_path_to_file)

Then it runs the following to stream the contents of the file to the client:

response = http.FileResponse(open(filepath, 'rb'))
if temp_dir and os.path.exists(temp_dir):
    shutil.rmtree(temp_dir)
return response

I guess this approach works because the interpreter holds the file descriptor even when the files are deleted by shutil.rmtree meaning that the interpreter can still read the contents of the file. Eventually when the transfer completes Django would release the file descriptor for us and the operative system would be able to claim the space.

However it's been reported by a client a problem creating a new AIC and the logs showed the following:

Traceback (most recent call last):
  File "/usr/share/python/archivematica-storage-service/local/lib/python2.7/site-packages/tastypie/resources.py", line 220, in wrapper
    response = callback(request, *args, **kwargs)
  File "/usr/lib/archivematica/storage-service/locations/api/resources.py", line 91, in wrapper
    result = func(resource, request, bundle, **kwargs)
  File "/usr/lib/archivematica/storage-service/locations/api/resources.py", line 655, in extract_file_request
    response = utils.download_file_stream(extracted_file_path, temp_dir)
  File "/usr/lib/archivematica/storage-service/common/utils.py", line 124, in download_file_stream
    shutil.rmtree(temp_dir)
  File "/usr/lib/python2.7/shutil.py", line 247, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/usr/lib/python2.7/shutil.py", line 247, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/usr/lib/python2.7/shutil.py", line 256, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/var/archivematica/storage_service/tmpGJtGYm/AIC_test_004-a2b1890f-c089-4d37-b96e-
39486f604445/data'

In this case, shutil.rmtree failed to delete the data sub-directory but I don't exactly how that can happen because the code shouldn't try to do so until it's empty (see https://github.com/python/cpython/blob/2.7/Lib/shutil.py).

I haven't been able to reproduce. One workaround could be to pass ignore_errors=True to shutil.rmtree (best effort). Or maybe pass an onerror so at least we can log it.

sword api bug ??

Hello!

I think I spoted a bug in the communication between islandora/archidora and archivematica sword api

I am getting an error and storage-service compaints when:
a) any field in dc:title???? ???? ????: subtitle/dc:title contains NON-Latin characters!
b) or/and a filename of the deposited package also contains Non latin chars ( ie: αβγπφ )

here is the output of my log

ERROR 2016-02-17 02:16:50 locations.api.sword.helpers:helpers:_fetch_content:239: Package download task encountered an error:'ascii' codec can't encode characters in position 26-32: ordinal not in range(128)

Regards,
Harry

Problem: re-ingest error message is broken

When you try to re-ingest a package that it has already been re-ingested the web interface reads the following message:

Error re-ingesting package: This AIP is already being reingested on {pipeline}

Instead of {pipeline} I would expect to see the name or the UUID of such pipeline.

missing migration on qa/0.8.x

After cloning the qa/0.8.x branch, without any local modification.

./manage.py check is going fine
./manage.py runserver tells me : You have unapplied migrations; your app may not work properly until they are applied.

if i do "makemigrations", i have the following migration created, which looks very minor to me:
class Migration(migrations.Migration):

    dependencies = [
        ('locations', '0004_v0_7'),
    ]

    operations = [
        migrations.AlterField(
            model_name='pipeline',
            name='uuid',
            field=django_extensions.db.fields.UUIDField(auto=False, validators=[django.core.validators.RegexValidator(b'\\w{8}-\\w{4}-\\w{4}-\\w{4}-\\w{12}', b'Needs to be format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where x is a hexadecimal digit.', b'Invalid UUID')], help_text=b'Identifier for the Archivematica pipeline', unique=True, verbose_name=b'UUID'),
            preserve_default=True,
        ),
    ]

Pipeline local filesystem: rsynced folders don't allow x permission for group

I have this issue:
I use a pipeline local filesystem to link the storage with the dashboard. In the dashboard, the services are run with the archivematica user.

For some reasons, when the storage sends files to the dashboard, it uses another user to ssh in the dashboard, called archivematica-storage.

Both archivematica and archivematica-storage are in the group archivematica, so the folders created by the storage on the dashboard look like this:

drwxrw-rw-. 1 archivematica-storage archivematica 0 Aug  9 15:26 archive

As you can see, groups can't execute the folder (so they can't see what's in). So, when the dashboard wants to use it, via the archivematica user, it doesn't work.

For me, it makes sense that the created folders have the execute permission as long as they already have read and write, thus I suggest the following PR: #227

Problem: we're using synchronous workers

This is a big issue when SS is doing I/O bound work because the workers block and wait instead of working on other requests.

I suggest to use gevent (add gevent==1.2.1 in requirements) and run Gunicorn with --worker-class gevent. It has to be tested.

GPG key deletion may not truly be deleting keys

To re-create:

  1. Create a GPG key using the SS interface

  2. Delete that GPG key using the SS interface

  3. Use the gpg command line tool to see if the key is still present. It should not be, but it is::

     $ gpg --list-keys --homedir=/var/archivematica/storage_service/
    

Problem: I can't run tests without installing database packages

This recent commit 215cf18

brought in support for mysql and postgresql as db backends, instead of just sqlite3.

When running tests on this repo, sqlite3 is still used, and so it should not be necessary to install mysqlclient and other db related dependencies in the test environment.

This is important for running automated tests (travis) as well as for testing locally (only source code checked out, not testing on a full deployed archivematica server).

Debian postinst script still not robust

After having tried to install 0.2.2ppa8.deb yesterday (before the fixes were pushed out), I attempted to apt-get purge the package and try it again today (ppa11). With some parts of the filesystem not left exactly as the script expects, it fails unnecessarily:

Setting up archivematica-storage-service (0.2.2ppa11) ...
creating archivematica user
User archivematica exists
creating django secret key
[redacted]
creating symlink in /usr/lib/archivematica
ln: failed to create symbolic link `/usr/lib/archivematica/storage-service/storage_service': File exists
mv: cannot move `/var/archivematica/storage-service/static' to `/usr/lib/archivematica/storage-service/static/static': Directory not empty
mv: cannot move `/var/archivematica/storage-service/templates' to `/usr/lib/archivematica/storage-service/templates/templates': Directory not empty
configuring django database and static files
/var/lib/dpkg/info/archivematica-storage-service.postinst: 33: /var/lib/dpkg/info/archivematica-storage-service.postinst: /usr/share/python/archivematica-storage-service/bin/python: not found
/var/lib/dpkg/info/archivematica-storage-service.postinst: 34: /var/lib/dpkg/info/archivematica-storage-service.postinst: /usr/share/python/archivematica-storage-service/bin/python: not found
updating directory permissions
rm: cannot remove `/tmp/storage-service.log': No such file or directory
dpkg: error processing archivematica-storage-service (--configure):
 subprocess installed post-installation script returned error exit status 1
E: Sub-process /usr/bin/dpkg returned an error code (1)

Problem: missing API endpoint to delete AIPs

Currently AIP deletions can only be done from the dashboard's Archival Storage tab. It is not possible to delete from the Storage Service directly (either via UI or API).

While in theory all AIPs are indexed and should appear in the Archival Storage, there are cases in which it would be useful to be able to delete an AIP not in Archival Storage.

A temporary workaround (to delete AIPs from the SS database not in the Archival Storage ) available here

Problem: All re-ingest is failing

When attempting to re-ingest an AIP (using any re-ingest type) the Django error Error re-ingesting package: An unknown error occurred shows up. (Originally filed under RM #11419 Multi-process bagit.validate breaks AIP re-ingest but this is a distinct issue.)

Inspection of the SS logs shows Rsync failures when trying to move source files at paths like var/archivematica/sharedDirectory/www/AIPsStore/var/archivematica/storage_service/tmp4q3zkU/test1-09dc6190-1909-4cb9-8134-3081739b1f12/data. Such paths are impossible. The bad call is happening at move_to_storage_service, called from package.py::start_reingest.

Further investigation reveals that the call to self.extract_file() in start_reingest is failing to set self.local_path_location to the SS-internal location. The result of this is that the current_location var in start_reingest never gets set to self.local_path_location, which results in relative_path (cf. relative_path = local_path.replace(current_location.full_path, '', 1).lstrip('/')) failing to have the correct path prefix removed. Finally, the result of all that is that reingest_files ends up containing a bunch of nonsense paths that have relative_path prefixed to them.

The fix would seem to be restoring the following three lines at the end of package.py::extract_file.

if not relative_path:
    self.local_path_location = ss_internal
    self.local_path = output_path

These lines were removed by #224, although it is not clear to me if their removal is essential to the goal of that PR.

Problem: pipelines are not always discoverable

SS stores the remote_name of a pipeline when it's created for consequent API access, e.g. re-ingest.

remote_name = models.CharField(
    max_length=256,
    default=None,
    null=True,
    blank=True,
    help_text="Host or IP address of the pipeline server for making API calls.")

This field is also editable from the web interface.

When a pipeline is created via the SS API, this field is populated after the REMOTE_ADDR header unless the client provides a value via the remote_name property.

The problem is that the dashboard doesn't allow users to provide a custom value so SS always fallbacks to the value found in the REMOTE_ADDR header which is problematic under some circumstances.

Problem: callback only works with some aips

The current implementation of the send_callback/post_store/ api endpoint only works with AIP's that include sha512 checksums.

See https://github.com/artefactual/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/api/resources.py#L792

Archivematica 1.6.0 and greater have a new feature allowing users to choose a different checksum algorithm. The callback should be updated to support any checksum algorithm supported by the bagit spec.

Problem: DJANGO_STATIC_ROOT is unused

There are two files that make use of DJANGO_STATIC_ROOT:

install/.storage-service
    export DJANGO_STATIC_ROOT=/var/archivematica/storage-service/assets

install/storage-service.gunicorn-config.py
    "DJANGO_STATIC_ROOT=/usr/lib/archivematica/storage-service/assets",

But it's not looked up from our settings, i.e. storage_service/storage_service/settings/base.py reads:

STATIC_ROOT = normpath(join(SITE_ROOT, 'assets'))

I think we should update the files under install/ to stop using it.

Problem: Create Key page is unclear

The two fields in the Create Key page are labelled "Name real" and "Name email." It would be more clear if they were labelled "Name" and "Email."

Problem: Replicators available in interface for all types of locations

Replication only works for AIP Storage location, so there shouldn't be the option to choose replicators for any other type of location (unless/until that functionality is added).

The risk is a user mistakenly thinking the content from another type of location is bring replicated when it's not.

Shibboleth integration (merge from Jisc fork)

Merge work from JiscSD#3 and JiscSD#6 into core.

This enables Shibboleth authentication (optionally) to allow login from academic institutions. This work should not concern itself with implementation of Shibboleth protocols - it will simply respond to authentication headers received from the web server and use those to create and configure users.

Related AM issue: artefactual/archivematica#666 Shibboleth integration (merge from Jisc fork) #666

Problem: SQLite is the only db backend supported (not distributed)

In many situations it's desirable to use a distributed RDBMS like MySQL or PostgreSQL. There's probably not much stopping us from supporting them other than the limitations in our configuration system.

Solution

SQLite will stay as the default backend until a new major version of SS is released.

Option 1: add a new SS_DB_ENGINE environment variable. We would document which values are supported here, e.g. django.db.backends.sqlite3 (default), django.db.backends.mysql and so on.

 DATABASES = {
      'default': {
          'ENGINE':  get_env_variable('SS_DB_ENGINE'),  # get_env_variable should take a default value
          # ...
      },
}

Option 2: look up new environment variable SS_DB_URL and parse it using dj-database-url. When SS_DB_URL is used, the old environment strings (SS_DB_NAME, SS_DB_USER, SS_DB_PASSWORD and SS_DB_HOST) are ignored.

DATABASES = {
    'default': dj_database_url.config(env='SS_DB_URL')
}

dj-database-url is very flexible. You can find many examples in test_dj_database_url.py, e.g. mysql://bea6eb025ca0d8:[email protected]/heroku_97681db3eff7580?reconnect=true.

AIP Storage Location - the full path is not taken?

I have a really weird behavior here... the storage is on a local pipeline setup like this:

  • space (local pipeline)
    • path: /
    • staging path: /
  • location (AIP storage)
    • relative path: path/to/storage/

The ingest fails to archive the AIP with this error:

locations.models.space:space:create_local_directory:482: Could not create storage directory: [Errno 13] Permission denied: '/c5fc'

From what I saw, the location path is not taken in account and it tries to put the AIP directly in the staging path of the space.
https://github.com/remileduc/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/models/space.py#L276
with destination_path coming from https://github.com/remileduc/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/models/package.py#L412
(from what I can understand, none of them contain the location path)


So, I created a new space just for the AIP storage:

  • space (local pipeline)
    • path: /
    • staging path: /path/to/storage-SPACE/
  • location (AIP storage)
    • relative path: path/to/storage-LOCATION/

and it works, with this really weird log message:

locations.models.space:space:move_rsync:421: Moving from /path/to/storage-LOCATION/.../aip.7z to /path/to/storage-SPACE/.../aip.7z

This trick kind of works but in a really strange way... I don't understand the last move in the logs. There is something that doesn't make sense oO

Note: the lines given in the logs may be a bit different from the ones in master as I've added some lines to log more stuff.
Aslo, the logs are all from /var/log/archivematica/storage-service/storage_service.log

Catch ValueError from literal_eval

Settings can be uuids, and some uuids are valid Python expressions, so running literal_eval on a settings value might raise a ValueError (to reject an expression) rather than SyntaxError

Done here in Jisc, needs porting to core. JiscSD#14

Related to #211

Problem: SS can't find p7zip binary when it's called 7z

I've learned today that in the installation instructions of CentOS we ask the user to run the following command:

$ sudo ln -sf /usr/bin/7za /usr/bin/7z

SS doesn't implement a mechanism to fallback to 7za so if the step is missed the appliation will fail. This is an example on how this affected to one of our users: artefactual/fixity#11.

I'm not sure if this should be considered a priority but it wouldn't be hard to solve.

Problem: filesystem and database are not treated as injectable dependencies

The storage service contains pervasive calls to filesystem APIs (os, sys, shutil) and database APIs (Django models, Elasticsearch). If classes like package.py::Package explicitly declared dependencies on objects (say fs or db) and the APIs they required, then those dependencies could more easily be swapped out in different environments, e.g., during testing or when digital objects are accessed not from a Unix filesystem but from an S3 store.

One immediate benefit of such a refactoring would be that unit and functional tests could be made to run much more quickly. If instead of creating test databases and test directory structures we swapped in mock dependencies, the runtime of these tests could probably be significantly decreased.

The metsrw PR 27 illustrates a strategy for dependency injection that might be applicable here.

Relatedly, this strategy could allow us to clean up the space-related code and make spaces pluggable dependencies, which could allow users to avoid installing unneeded third party dependencies for space types that they never use.

Refactoring SS to include dependency injection could proceed via the StranglerApplication approach/pattern.

Reingest broken: SS expects pointer file which isn't there

Attempt a metadata-only re-ingest (or other type) on an AIP in Archivematica at qa/1.x. The storeAIP micro-service will issue a PUT request against the SS's Package resources, attempting to update the AIP, which ultimately calls finish_reingest. This breaks (returns a 500 response) because finish_reingest is expecting a pointer file from AM which is no longer present because recent changes to Archivematica have removed the pointer file creation micro-service and pointer file creation has been moved to the Storage Service.

See https://projects.artefactual.com/issues/11663

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.