artefactual / archivematica-sampledata Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 15.0 673.87 MB

Archivematica sample data

Home Page: http://www.archivematica.org

Python 33.42% Makefile 0.67% TeX 0.92% Rich Text Format 64.99%

archivematica-sampledata's People

Contributors

Stargazers

Watchers

Forkers

aliirmak sromkey rozmi vonrosenchild dfloresbr weatherwood unitaslibrary evelynpm harmen89 ginfo-cflex docs-digitais universidadeaveiro jbosse-artefactual

archivematica-sampledata's Issues

Add more information to stderr and/or exit status in createtransfers.py

We could do with a little more information being output to the command line to get a feel for whether writing the files has worked at all. Plus any other relevant output we/a user might need.

Problem: Create variously-encoded-directory names for testing

Similar to the need to test filenames with different encodings. If any path we work on points to a directory, and not a file, and passes through a different portion of code, e.g. a mkdir or mv then it might trigger a different set of errors. Related is artefactual/archivematica#1104 where a directory in a zip file using cp437 is causing undefined behaviour in the transfer.

Create a sample set for Bulk Extractor (BE) potential outputs

Create a set of files that can test the capability of BE inside Archivematica.

Non-UTF-8 file name creation command fails with IOError

The new createtransfers.py script fails when calling ./createtransfers.py create-variously-encoded-files with IOError: [Errno 84] Invalid or incomplete multibyte or wide character.

This failure happens on the following platforms:

Mac OS X 10.13.1 High Sierra with any version of python I tried
Debian 8.9 (jessie) with Python 2.7.13 (i.e., in the python:2.7 Docker container created by the SS Dockerfile)

This failure does not happen with:

Ubuntu 16.04 xenial and Python 2.7.12

Problem: no issues template

Issues template is required to redirect people to the issue repo.

Problem: DemoTransfer causing invalid METS

If you run the DemoTransfer the PREMIS rights.csv file causes invalid METS:

line 2224, column 57: cvc-type.3.1.3: The value '' of element 'premis:copyrightStatusDeterminationDate' is not valid.
line 2226, column 49: cvc-complex-type.2.4.b: The content of element 'premis:copyrightApplicableDates' is not complete. One of '{"info:lc/xmlns/premis-v2":startDate}' is expected.

Problem: Need more sample-data to test extract of packages (zips and zips within zips)

Issue: artefactual/archivematica#1104 in the Archivematica repository is an interesting one that means testing the ability of Archivematica to recuse into structures like zip files and still be able to perform the same activities. More samples with zips inside zips will help us to test that.

NB. This would likely be in support of a feature file describing the behaviour as well.

Problem: Need a zero byte file associated with virusTests

To acceptance test the correct fix of artefactual/archivematica#808

Problem: No mkv samples included

I would like to test MediaConCH in Archivematica, but there are no mkv samples included in this repo. I am having trouble finding some with an explicit licence that would be suitable to including in this repo.

Samples I found are:
http://jell.yfish.us/
https://www.matroska.org/downloads/test_w1.html

Issue: Need the capability to create large transfers e.g. 1000+ files, 200+ folders

As a tester I need to be able to create fairly arbitrarily sized transfers to be able to test Archivematica's limits. An issue where the limits of AM were noticed, and thus fixed is here.

Have a look at creating this sample set in the create transfers utility for testing purposes.

Problem: commits on master may have been lost

Namely, those from #15.

Problem: opf-format-corpus sub module should be updated

The opf format corpus is included in this repo as a sub module. The version being pulled in is a few years old, there are some valuable updates missing.

The submodule link should be updated to openpreserve/format-corpus@5a93e3e

Problem: Create files to test issues around extension-based identification

As a digital preservation analyst I want to understand what will be reported by Archivematica's format identification tools when fed file formats that do not conform to any specification, but are named with known file-format extensions, e.g. .jpg .gif .bin etc. The determination by one of these tools may require me to design a different workflow using the system utilising one specific tool, or seek to improve the output from one of the tools or another.

Problem: Add cp437 encoding to createtransfers.py

While it's likely that we can recreate the issues of artefactual/archivematica#1104 with other encodings, cp437 seems to be a popular encoding used in the past. We can pretty easily add this code page as part of the other createtransfers.py work.

Problem: Need a file over 20mb with a virus for testing

There are 4 scenarios to effectively test clamAV:

files < 20M that have virus WE HAVE
files < 20M that don’t have virus WE HAVE
files > 20 M that have virus WE DO NOT HAVE
files > 20M that don’t have virus WE HAVE

We need a file that is over 20mb that has a virus.

Problem: Review sampledata

It's been a while since the sampledata set was thoroughly reviewed. Many things have been added over time, and it's likely that there's unnecessary repetition within the sample data. At the same time, the feature set in Archivematica has been growing quickly, so there are features for which there is no sample data (see, for example, #31). Even some basic features are not testable with the current sample data (see #28).

We should consider how the auto-generated sample data fits in, and if it needs to be incorporated in a better/more readily apparent way - not sure how to do this, but it warrants consideration.

We could also look at how the sampledata set is deployed to sandbox and testing servers.

Finally, better documentation about the data and what the various transfers are supposed to test would be helpful (probably as part of the README here).

Problem: filenames with strange encodings are not created programmatically

Filenames with strange (non-ASCII, non-UTF8) encodings are currently stored at TestTransfers/files_with_various_encodings/. However, on certain platforms (e.g., Mac OS X 10.13.1) attempting to checkout the master branch of this repo triggers an error in git:

error: unable to create file TestTransfers/files_with_various_encodings/big5/?s?{ (Illegal byte sequence)
error: unable to create file TestTransfers/files_with_various_encodings/shift_jis/?ۂ??Ղ郁?C?? (Illegal byte sequence)
error: unable to create file TestTransfers/files_with_various_encodings/windows_1252/s?ster (Illegal byte sequence)

If this issue cannot be overcome by some other means, then these transfers should be created programmatically (e.g., via make rules) in this sampledata repo.

Problem: Create a Fail Transfer Compliance Test Set

To fail verify transfer compliance (at present) all you need is a single empty folder. We can create a structure like this so that folks can observe that behaviour:

FailTransferCompliance/
├── README.md
└── TransferThisFolder

1 directory, 1 file

Problem: missing cohesive build method

The AM Ansible role has been recently updated so it runs createtransfers.py once the repo is downloaded. I think that we should introduce a Makefile in this repo so the build process details are hidden from the consumers. This would allow us greater flexibility in controlling what's in the build.