Git Product home page Git Product logo

mets-reader-writer's Introduction

METS Reader & Writer

By Artefactual

PyPI version GitHub CI codecov

METSRW is a library to help with parsing and creating METS files. It provides an API, and abstracts away the actual creation of the XML. METSRW was initially created for use in Archivematica and is managed as part of that project.

You are free to copy, modify, and distribute metsrw with attribution under the terms of the AGPL license. See the LICENSE file for details.

Installation & Dependencies

METSRW can be installed with pip.

pip install metsrw

METSRW has been tested with:

  • Python 3.8
  • Python 3.9
  • Python 3.10
  • Python 3.11
  • Python 3.12

Basic Usage

Read a METS file

mets = metsrw.METSDocument.fromfile('path/to/file')  # Reads a file
mets = metsrw.METSDocument.fromstring('<mets document>')  # Parses a string
mets = metsrw.METSDocument.fromtree(lxml.ElementTree)  # Parses an lxml.Element or lxml.ElementTree

Create a new METS file

mets = metsrw.METSDocument()

Contributing

METSRW is in early development and welcomes feedback on the API and overall design! Design goals, use cases, and a proposed API are in the Github wiki

mets-reader-writer's People

Contributors

cole avatar eviau-artefactual avatar hwesta avatar jraddaoui avatar jrwdunham avatar qubot avatar replaceafill avatar ross-spencer avatar sallain avatar sevein avatar tw4l avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mets-reader-writer's Issues

Problem: Tree or serialize() required as an arg to validate functions when METSDocument would be a nicer abstraction

Here, and a few other places we're asked to pass a .tree argument as mets_doc, in this case, to validate the document:

def xsd_validate(mets_doc, xmlschema=METS_XSD_PATH):
    xmlschema = get_xmlschema(xmlschema, mets_doc)
    is_valid = xmlschema.validate(mets_doc)
    error_log = xmlschema.error_log
    return is_valid, error_log

I wonder if the METSDocument itself wouldn't be a nicer abstraction to be passing around functions such as this when using this utility?

metsrw.METSDocument.fromfile fails over undefined "root"

This METS file is at least well formed according to xmllint, I think it should not fail in the amdSec stage already:
A.ALF.ANT.R.091_METS.xml.gz

>>> mets = metsrw.METSDocument.fromfile('A.ALF.ANT.R.091_METS.xml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/metsrw/mets.py", line 437, in fromfile
    i._fromfile(path)
  File "/usr/lib/python2.7/site-packages/metsrw/mets.py", line 431, in _fromfile
    self._parse_tree(self.tree)
  File "/usr/lib/python2.7/site-packages/metsrw/mets.py", line 414, in _parse_tree
    tree, structMap, normative_parent_elem=normative_struct_map)
  File "/usr/lib/python2.7/site-packages/metsrw/mets.py", line 291, in _parse_tree_structmap
    tree, elem, normative_parent_elem=normative_elem)
  File "/usr/lib/python2.7/site-packages/metsrw/mets.py", line 298, in _parse_tree_structmap
    self._add_amdsecs_to_fs_entry(fptr.amdids, fs_entry, tree)
  File "/usr/lib/python2.7/site-packages/metsrw/mets.py", line 390, in _add_amdsecs_to_fs_entry
    amdsec = metadata.AMDSec.parse(amdsec_elem)
  File "/usr/lib/python2.7/site-packages/metsrw/metadata.py", line 59, in parse
    if root.tag != utils.lxmlns('mets') + 'amdSec':
AttributeError: 'NoneType' object has no attribute 'tag'

Python 2.7 on Fedora 25, metsrw installed via pip (Downloading metsrw-0.2.0-py2.py3-none-any.whl (62kB)).

Problem: mets-reader-writer neither reads nor writes

The API should reflect the name. Without breaking the existing API, we should proxy fromfile-type class methods to read methods that can take either an XML string or a file-like object or a path. Similarly, there should be a write method where params or context can determine what is output: file on disk or XML string.

Problem: there are no docs

We should use the docs/ project already available and we should wire it with Travis CI so it builds the page and pushes it into the gh-pages branch or readthedocs or similar.

Problem: Characterization tool namespaces in premis:objects prevent serialization to XML

Example code such as the following throws an exception:

for premis_object in fs_entry.get_premis_objects():
        premis_object_xml = premis_object.tostring()

This appears to be due to characterization tool namespaces not being in the namespaces map:

celery-worker_1  | Traceback (most recent call last):
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 412, in trace_task
celery-worker_1  |     R = retval = fun(*args, **kwargs)
celery-worker_1  |   File "/src/AIPscan/celery.py", line 17, in __call__
celery-worker_1  |     return TaskBase.__call__(self, *args, **kwargs)
celery-worker_1  |   File "/src/AIPscan/celery.py", line 17, in __call__
celery-worker_1  |     return TaskBase.__call__(self, *args, **kwargs)
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 704, in __protected_call__
celery-worker_1  |     return self.run(*args, **kwargs)
celery-worker_1  |   File "/src/AIPscan/Aggregator/tasks.py", line 353, in get_mets
celery-worker_1  |     database_helpers.process_aip_data(aip, mets)
celery-worker_1  |   File "/src/AIPscan/Aggregator/database_helpers.py", line 397, in process_aip_data
celery-worker_1  |     create_file_object(FileType.original, file_, aip.id)
celery-worker_1  |   File "/src/AIPscan/Aggregator/database_helpers.py", line 365, in create_file_object
celery-worker_1  |     _add_characteristics_extension(fs_entry, new_file.id)
celery-worker_1  |   File "/src/AIPscan/Aggregator/database_helpers.py", line 314, in _add_characteristics_extension
celery-worker_1  |     file_.characteristics_extension = premis_object.tostring()
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/metsrw/plugins/premisrw/premis.py", line 139, in tostring
celery-worker_1  |     self.serialize(), pretty_print=pretty_print, encoding=encoding
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/metsrw/plugins/premisrw/premis.py", line 135, in serialize
celery-worker_1  |     return data_to_premis(self._data, self.premis_version)
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/metsrw/plugins/premisrw/premis.py", line 722, in data_to_premis
celery-worker_1  |     return _data_to_lxml_el(data, "premis", nsmap)
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/metsrw/plugins/premisrw/premis.py", line 608, in _data_to_lxml_el
celery-worker_1  |     _data_to_lxml_el(
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/metsrw/plugins/premisrw/premis.py", line 608, in _data_to_lxml_el
celery-worker_1  |     _data_to_lxml_el(
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/metsrw/plugins/premisrw/premis.py", line 608, in _data_to_lxml_el
celery-worker_1  |     _data_to_lxml_el(
celery-worker_1  |   [Previous line repeated 3 more times]
celery-worker_1  |   File "/usr/local/lib/python3.8/dist-packages/metsrw/plugins/premisrw/premis.py", line 620, in _data_to_lxml_el
celery-worker_1  |     ret = func(*args)
celery-worker_1  |   File "src/lxml/builder.py", line 208, in lxml.builder.ElementMaker.__call__
celery-worker_1  |   File "src/lxml/etree.pyx", line 3022, in lxml.etree.Element
celery-worker_1  |   File "src/lxml/apihelpers.pxi", line 101, in lxml.etree._makeElement
celery-worker_1  |   File "src/lxml/apihelpers.pxi", line 1734, in lxml.etree._tagValidOrRaise
celery-worker_1  | ValueError: Invalid tag name 'http://hul.harvard.edu/ois/xml/ns/fits/fitsOutput:tool'

Problem: Cannot generate EVENTS through member variables

Hi Joel,

Looking for clarification on how to write PREMIS events using this. If I try the following:

ev = premisrw.PREMISEvent()
ev.event_type = "AM CAMP DEMO"
ev.event_outcome_detail = "SUCCESS"
ev.event_outcome_detail_note = "dag iedereen!"
ev.linking_agent_identifier_type = "python script"
ev.linking_agent_identifier_value = "1.0"
print ev.generate_data

Then the default values for the object are output and no more:

<bound method PREMISEvent.generate_data of ('event', 
    {'xsi:schema_location': 'info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd', 'version': '2.2'}, 
    ('event_identifier', ('event_identifier_type', 'UUID'), 
    ('event_identifier_value', 'aa4f20ef-678e-4b07-9bb6-0f875930ddaa')), 
    ('event_date_time', '2020-09-03T14:44:33'))>

I am currently only able to create an event object and output it to XML by creating a tuple using the technique outlined here.

I guess, am I missing something?

METS is invalid according to XMLstarlet due to PREMIS - How do ye validate?

Hi,
SUMMARY:
When I validate the output of Archivematica METS againest the mets.xsd schema, it says that it's invalid. When I create a custom XSD that references both METS and PREMIS schemas, then all is well. How do ye validate your XML as part of your dev process?

ISSUE:
This particularly seems to relate to PREMIS:TYPE definitions, and when I remove some of the extra namespace info for the PREMIS data, it validates just fine. Perhaps xmlstarlet isn't the best for this type of operation?

How to replicate:
I took a METS XML file from the current archivematica sandbox , and I uploaded it here: https://gist.github.com/kieranjol/43f3d977306e3740daefaa284cc2d565
I validated it with the METS XSD from here: https://www.loc.gov/standards/mets/mets.xsd
and the result is at the end of this issue.
However eventually I found this from the PREMISv2 days, and it appears to be a similar issue: https://stackoverflow.com/questions/26712645/xml-type-definition-is-absent

I edited the example in the answer and created a new xsd which contains the following, and that validated your XML output just fine.


<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
           elementFormDefault="qualified"> 

  <xs:import namespace="http://www.loc.gov/METS/"
     schemaLocation="http://www.loc.gov/standards/mets/mets.xsd"
  />

  <xs:import namespace="http://www.loc.gov/premis/v3"
    schemaLocation="http://www.loc.gov/standards/premis/v3/premis.xsd"
  />
</xs:schema>

And here's the error I got when validating archivematica METS against the original mets.xsd

xml val -e -s mets.xsd  ..\Downloads\METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:7.72: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}intellectualEntity' of the xsi:type attribute does not resolve to a type definition.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:7.72: Element '{http://www.loc.gov/premis/v3}object': The type definition is absent.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:21.74: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}file' of the xsi:type attribute does not resolve to a type definition.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:21.74: Element '{http://www.loc.gov/premis/v3}object': The type definition is absent.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:549.74: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}file' of the xsi:type attribute does not resolve to a type definition.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:549.74: Element '{http://www.loc.gov/premis/v3}object': The type definition is absent.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:744.74: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}file' of the xsi:type attribute does not resolve to a type definition.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:744.74: Element '{http://www.loc.gov/premis/v3}object': The type definition is absent.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:1105.74: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}file' of the xsi:type attribute does not resolve to a type definition.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:1105.74: Element '{http://www.loc.gov/premis/v3}object': The type definition is absent.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:1370.74: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}file' of the xsi:type attribute does not resolve to a type definition.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:1370.74: Element '{http://www.loc.gov/premis/v3}object': The type definition is absent.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:1635.74: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}file' of the xsi:type attribute does not resolve to a type definition.
../Downloads/METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml:1635.74: Element '{http://www.loc.gov/premis/v3}object': The type definition is absent.
..\Downloads\METS.56006c7d-77ce-462a-b20f-35650ca66e52.xml - invalid

Problem: `FSEntry` instances can circularly reference themselves in their `derived_from` attributes

It is possible for f = FSEntry(); f.derived_from = f to be true when certain METS files are parsed, cf. the strange derived_from values and lack of UUIDs in the following:

P1050152.JPG with UUID None is derived from P1050152.JPG with UUID None
P1050154.JPG with UUID None is derived from P1050152.JPG with UUID None
P1050155.JPG with UUID None is derived from P1050152.JPG with UUID None
P1050156.JPG with UUID None is derived from P1050152.JPG with UUID None

Parsing the METS file of the AIP at http://am17x.qa.archivematica.org/archival-storage/6214faf5-eab6-424c-b0f9-b1078e7c0828/ will exhibit this behaviour. This seems to be related to the presence of USE="service" type files.

<mets:div LABEL="service" TYPE="Directory" DMDID="dmdSec_2">
  <mets:div LABEL="P1050152.JPG" TYPE="Item">
    <mets:fptr FILEID="file-acabdea5-3f09-4dd4-814c-cab7cbc662dc"/>
  </mets:div>
  <mets:div LABEL="P1050154.JPG" TYPE="Item">
    <mets:fptr FILEID="file-bb7fb59a-a858-482d-b8f8-d9631356d5cc"/>
  </mets:div>
  <mets:div LABEL="P1050155.JPG" TYPE="Item">
    <mets:fptr FILEID="file-414c5cc1-f5df-4a4a-8621-c4303b82092f"/>
  </mets:div>
  <mets:div LABEL="P1050156.JPG" TYPE="Item">
    <mets:fptr FILEID="file-d5c9fcbe-284d-44f9-a6b8-f6518185518e"/>
  </mets:div>
</mets:div>

This will ultimately trigger a RuntimeError: maximum recursion depth exceeded error when attempting an AIP re-ingest. See artefactual/archivematica-storage-service#254.

More investigation needed.

Problem: XML Schema Location isn't output when creating a PREMIS event

If we do something along the lines of:

def generate_event():
	# Add some new EVENTS to our METS
	return ('event', ('event_identifier', 
			('event_identifier_type', "UUID"), 
			('event_identifier_value', uuid.uuid4())), 
			('event_type', "AM CAMP DEMO"), 
			('event_date_time', datetime.now().isoformat()), 
			('event_detail', "Adding new PREMIS EVENTS"), 
			('event_outcome_information', ('event_outcome', "SUCCESS"), 
										  ('event_outcome_detail', 
										  ('event_outcome_detail_note', 
										   "dag iedereen!"))), 
			('linking_agent_identifier', 
			('linking_agent_identifier_type', "python script"), 
			('linking_agent_identifier_value', "1.0")))


print lxml.etree.tostring(premisrw.data_to_premis(generate_event()), 
	                      pretty_print=True)

The output is as follows:

<premis:event xmlns:premis="info:lc/xmlns/premis-v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <premis:eventIdentifier>
    <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
    <premis:eventIdentifierValue>8557f11f-a4c0-447d-9f01-cdae5e41e535</premis:eventIdentifierValue>
  </premis:eventIdentifier>
  <premis:eventType>AM CAMP DEMO</premis:eventType>
  <premis:eventDateTime>2018-04-05T13:44:17.283713</premis:eventDateTime>
  <premis:eventDetail>Adding new PREMIS EVENTS</premis:eventDetail>
  <premis:eventOutcomeInformation>
    <premis:eventOutcome>SUCCESS</premis:eventOutcome>
    <premis:eventOutcomeDetail>
      <premis:eventOutcomeDetailNote>dag iedereen!</premis:eventOutcomeDetailNote>
    </premis:eventOutcomeDetail>
  </premis:eventOutcomeInformation>
  <premis:linkingAgentIdentifier>
    <premis:linkingAgentIdentifierType>python script</premis:linkingAgentIdentifierType>
    <premis:linkingAgentIdentifierValue>1.0</premis:linkingAgentIdentifierValue>
  </premis:linkingAgentIdentifier>
</premis:event>

Which, when we add this to an existing METS document and validate against our schematron file, will result in:

Error: A digiprovMD mdWrap element MUST contain an XML schema location.

and related:

Unless MDTYPE is OTHER an mdRef element MUST contain an XML schema location.

example METS output:

    <mets:digiprovMD ID="digiprovMD_231587" CREATED="2018-04-05T11:12:21">
      <mets:mdWrap MDTYPE="PREMIS:EVENT">
        <mets:xmlData>
          <premis:event xmlns:premis="info:lc/xmlns/premis-v2">
            <premis:eventIdentifier>
              <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
              <premis:eventIdentifierValue>65c8369f-2b47-49e0-be0c-08da6bbd8b24</premis:eventIdentifierValue>
            </premis:eventIdentifier>
            <premis:eventType>AM CAMP DEMO</premis:eventType>
            <premis:eventDateTime>2018-04-05T13:12:21.128197</premis:eventDateTime>
            <premis:eventDetail>Adding new PREMIS EVENTS</premis:eventDetail>
            <premis:eventOutcomeInformation>
              <premis:eventOutcome>SUCCESS</premis:eventOutcome>
              <premis:eventOutcomeDetail>
                <premis:eventOutcomeDetailNote>dag iedereen!</premis:eventOutcomeDetailNote>
              </premis:eventOutcomeDetail>
            </premis:eventOutcomeInformation>
            <premis:linkingAgentIdentifier>
              <premis:linkingAgentIdentifierType>python script</premis:linkingAgentIdentifierType>
              <premis:linkingAgentIdentifierValue>1.0</premis:linkingAgentIdentifierValue>
            </premis:linkingAgentIdentifier>
          </premis:event>
        </mets:xmlData>
      </mets:mdWrap>
    </mets:digiprovMD>

I believe we need xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" to persist into the output, as per the comment: https://github.com/artefactual-labs/mets-reader-writer/blob/master/metsrw/plugins/premisrw/premis.py#L586

Problem: linting is disabled

We run flake8 in Travis CI but it throws many errors and the build status is ignored. We should also look into pylint and flake8 plugins like bugbear, docstyle or import-order.

Problem: dependency declaration is overly verbose

In order to declare a dependency, e.g., in the FSEntry class, we currently write:

premis_object_class = Dependency('premis_object_class', ...)

Providing the dependency name as first argument is unnecessarily verbose given that the same name is provided as the managed class's class attribute, e.g., FSEntry.premis_object_class. It would be better if we could write:

premis_object_class = Dependency(...)

Problem: Archivematica METS Schematron doesn't support Archivematica 'registration' event type

Per here: https://github.com/artefactual/archivematica/blob/0c748d9f448b8d18961fc8cb764c0149e56fd11b/src/archivematicaCommon/lib/fileOperations.py#L81

The supported types: https://github.com/artefactual-labs/mets-reader-writer/blob/master/metsrw/resources/archivematica_mets_schematron.xml#L33

ingestion, 
message digest calculation, 
virus check, 
name cleanup, 
format identification, 
validation, 
normalization, 
fixity check, 
creation, 
unpacking, 
compression

If we try validating with schematron on a 1.7 METS, example PREMIS:

<premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
            <premis:eventIdentifier>
              <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
              <premis:eventIdentifierValue>0e745fb1-171a-404a-ab71-7d845c042d25</premis:eventIdentifierValue>
            </premis:eventIdentifier>
            <premis:eventType>registration</premis:eventType>
            <premis:eventDateTime>2018-04-04T10:37:06.517039+00:00</premis:eventDateTime>
            <premis:eventDetail></premis:eventDetail>
            <premis:eventOutcomeInformation>
              <premis:eventOutcome></premis:eventOutcome>
              <premis:eventOutcomeDetail>
                <premis:eventOutcomeDetailNote>accession#am_camp_1</premis:eventOutcomeDetailNote>
              </premis:eventOutcomeDetail>
            </premis:eventOutcomeInformation>
            <premis:linkingAgentIdentifier>
              <premis:linkingAgentIdentifierType>preservation system</premis:linkingAgentIdentifierType>
              <premis:linkingAgentIdentifierValue>Archivematica-1.7</premis:linkingAgentIdentifierValue>
            </premis:linkingAgentIdentifier>
            <premis:linkingAgentIdentifier>
              <premis:linkingAgentIdentifierType>repository code</premis:linkingAgentIdentifierType>
              <premis:linkingAgentIdentifierValue>test</premis:linkingAgentIdentifierValue>
            </premis:linkingAgentIdentifier>
            <premis:linkingAgentIdentifier>
              <premis:linkingAgentIdentifierType>Archivematica user pk</premis:linkingAgentIdentifierType>
              <premis:linkingAgentIdentifierValue>1</premis:linkingAgentIdentifierValue>
            </premis:linkingAgentIdentifier>
          </premis:event>

We see:

 <svrl:failed-assert test="contains($premisEventTypes, m:xmlData/p:event/p:eventType)" location="/*[local-name()=\'mets\' and namespace-uri()=\'http://www.loc.gov/METS/\']/*[local-name()=\'amdSec\' and namespace-uri()=\'http://www.loc.gov/METS/\'][14]/*[local-name()=\'digiprovMD\' and namespace-uri()=\'http://www.loc.gov/METS/\'][2]/*[local-name()=\'mdWrap\' and namespace-uri()=\'http://www.loc.gov/METS/\']">
    <svrl:text>A PREMIS event MUST be of a recognized eventType. (registration is not in ingestion, message digest calculation, virus check, name cleanup, format identification, validation, normalization, fixity check, creation, unpacking, compression)</svrl:text>
  </svrl:failed-assert>

METS file with uppercase PHYSICAL and LOGICAL structMap

Hello,

I'm trying to parse METS file from the bnf (the French national library) and they use uppercase PHYSICAL and LOGICAL type for structMap.

This library fails to read such a file, saying no structmap were found.

I suppose the easy fix would be to modify the METS files and lowercase all PHYSICAL and LOGICAL attributes, but I was wondering if metsrw could be improved to take into account this corner case ?

Thank you for your help.

PS: here is the file https://pastebin.com/DT6grTEe

Problem: API for defining metadata plugins is ill-defined

The following PRs introduce PREMIS-related functionality to metsrw:

The Write pointer files PR #27 introduces an informal plugin system that allows metsrw's METSDocument class to know how to work with different metadata standards, in this case PREMIS. The API for this plugin system needs to be analyzed and explicitly defined so that it can be used, e.g., to convert PR #20's PREMIS work to a plugin.

Here is how the metsrw plugin system currently works. The constructor of the METSDocument class now accepts a plugins kwarg which should be a dict mapping mdType values (e.g., 'PREMIS:OBJECT') to classes. Those classes are expected to have a class method fromtree which takes an lxml._Element instance (provided by metsrw' FSEntry) and returns an instance of the plugin's representation of the metadata element, cf. the PREMISRW plugin. This allows one to call METSDocument(plugins=my_plugins).get_file(file_uuid=my_file_uuid).get_premis_objects() and get PREMIS objects as the anticipated type of Python instance.

In the other direction, mets_fs_entry.add_premis_object(premis_object.serialize()) is how a PREMIS metadata element is currently added to an existing metsrw FSEntry instance.

In summary, under the API implicit in PR #27, metadata plugins must be classes that provide:

  1. a fromtree class method that creates an instance given an lxml._Element instance as input, and
  2. a serialize instance method that returns an lxml._Element instance.

This ^ API is definitely not written in stone. But it is a start...

Problem: it is difficult to access some attributes of `PREMISElement` instances

Right now, it is difficult to access attributes of collections of elements within a PREMISElement instance. For example, to get the premis:relationshipSubType value of all premis:relationship elements in a PREMISObject with current metsrw.plugins.premisrw, the following is necessary::

>>> for relationship in premis_object.relationship:
...     try:
...         sub_type = [el for el in relationship if el[0] == 'relationship_sub_type'][0][1]
...     except IndexError:
...         sub_type = None

It would be better if something like the following were possible:

>>> for relationship in premis_object.relationship:
...     sub_type = relationship.sub_type

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.