mokko / mpapi Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 3.0 2.03 MB

Unofficial Open-Source Client for MuseumPlus MpRIA Webservice in Python

License: GNU General Public License v3.0

Python 93.89% ASL 0.23% XSLT 5.88%

api cli museum python

mpapi's People

Contributors

Stargazers

Watchers

Forkers

mugraph swiss-art-research-net araeubig

mpapi's Issues

Better tests

I have included tests in pre-commit hook in the meantime, next would be to have more fast non http tests

ISILs für relatedWorks

Frank hätte gerne ISILs für relatedWorks.

Ich könnte diese im Python Schritt in RIA nachschlagen und gleich in LiDO-XML eintragen.
Alternativ müsste ich in MpApi die relatedWorks herunterladen, einen Mechanismus definieren, wie diese von anderen Objektdaten zu unterscheiden sind. Sie sollen ja nicht automatisch veröffentlicht werden. Das würde MpApi Prozess verlangsamen.

async to speedup api requests

try out async IO to speed up big, slow requests.
Should probably be a new project or at least a branch

Needs replacement for requests -> aiohttp

ensure that chunky cleans up zml in the same way as normal search

todo

Records without photos

Many records have no public photos.

There are Arbeitsphotos that not yet published. Who descides which photos are published? Who is responsible for photos?

Issues w/ duplicate moduleItems

Issues with multiple nodes. I dont know why I get this. Is it possible that I have the same records multiple times?

Perhaps test it with specially prepared data and the write a select statement that loops only thru distinct records.

split/DE-MUS-019118/211835.lido.xml
split/DE-MUS-019118/211836.lido.xml
Error in xsl:result-document/@href on line 20 column 75 of splitLido.xsl:
  XTDE1490  Cannot write more than one result document to the same URI:
  file:/C:/m3/zml2lido/sdata/3Wege/split/DE-MUS-019118/211836.lido.xml
  In template rule with match="/lido:lidoWrap/lido:lido" on line 15 of splitLido.xsl
     invoked by xsl:apply-templates at file:/C:/m3/zml2lido/xsl/splitLido.xsl#12
  In template rule with match="/" on line 11 of splitLido.xsl
Cannot write more than one result document to the same URI: file:/C:/m3/zml2lido/sdata/3Wege/split/DE-MUS-019118/211836.lido.xml
Traceback (most recent call last):
  File "C:\m3\zml2lido\src\lido.py", line 328, in <module>
    getattr(m, args.job)()
  File "C:\m3\zml2lido\src\lido.py", line 142, in smb
    self.splitLido(input=linklido_fn)  # individual records as files
  File "C:\m3\zml2lido\src\lido.py", line 235, in splitLido
    self._saxon(input=input, xsl=xsl["splitLido"], output="o.xml")
  File "C:\m3\zml2lido\src\lido.py", line 303, in _saxon
    subprocess.run(
  File "subprocess.py", line 528, in run
subprocess.CalledProcessError: Command 'java -Xmx1450m -jar C:\m3\SaxonHE10-5J\saxon-he-10.5.jar -s:C:\m3\zml2lido\sdata\3Wege\3Wege20211129-links.onlyPub.lido.xml -xsl:C:\m3\zml2lido\xsl\splitLido.xsl -o:o.xml' returned non-zero exit status 2.

SAR:getByApprovalGrp

add a new method
getByApprovalGrp(id=id,
module=module)

Ontology

instance is what I call a vocabulary. The instance has many nodes. Nodes include one or several terms

instance (=Vokabular)
: type
  node
    terms

Todo: Find out more about termClasses

Instances

thesis: instance mainly provides type and termClass

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<instance xmlns="http://www.zetcom.com/ria/ws/vocabulary"
          logicalName="ObjIconographyKeywordProjectVgr"
          id="61671">
	<uuid>3951a563-351d-46c1-b20a-bd565d8e00b3</uuid>
	<version>2020-06-28T15:44:46.099Z</version>
	<lastModified>2020-06-28T15:44:46.098Z</lastModified>
	<lastModifiedUser>zetLS</lastModifiedUser>
	<orgUnit logicalName="VocabularyAdministration"/>
	<type logicalName="simpleThesaurus"/>
	<termClasses>
		<termClass logicalName="Untereintrag"/>
		<termClass logicalName="Haupteintrag"/>
	</termClasses>
</instance>

Nodes

Thesis: Node mainly provides id, hierarchy (parents) and terms

<node xmlns="http://www.zetcom.com/ria/ws/vocabulary"
      logicalName="Europeana-Fashion##Modeobjekte##Visuelle und verbale Kommunikation##analoge Medien##Zeichnung"
      id="4254998">
	<uuid>a356c4d0-b24a-5a6a-4eaf-e1305df2b6c0</uuid>
	<version>2021-12-01T11:59:23.128Z</version>
	<lastModified>2021-12-01T11:59:23.124Z</lastModified>
	<lastModifiedUser>IFM_FvH</lastModifiedUser>
	<orgUnit logicalName="VocabularyAdministration"/>
	<status logicalName="valid"/>
	<parents>
		<parent nodeId="4254966"/>
	</parents>
	<instance logicalName="ObjIconographyKeywordProjectVgr"
	          id="61671"/>
	<terms>
		<term id="4745128">
			<uuid>9f1e3769-8bdf-ad28-b039-452caaa25916</uuid>
			<version>2020-11-30T16:03:28.682Z</version>
			<lastModified>2020-11-30T16:03:28.682Z</lastModified>
			<lastModifiedUser>MPRiaImporter</lastModifiedUser>
			<isoLanguageCode>de</isoLanguageCode>
			<content>Zeichnung</content>
			<order>1</order>
			<status logicalName="valid"/>
			<category logicalName="preferred"/>
		</term>
	</terms>
</node>
```

Pagination

Thoughts on a possible solution
1st request: ask how many results
2nd request: get the first package
Nth request: repeat until all results

E.g. size=3333

Limit=1000, Offset=0
Limit=1000, Offset=1000
Limit=1000, Offset=2000
Limit=1000, Offset=3000

For now package size will be a global variable in SAR.

New method: PaginatedSearch with same interface as search

Currently we save all info from one export in one big xml file. That doesn't scale well. At about 1GB I can't process xml anymore on a regular laptop. So there is currently a limit to how big a package can be and it is about 10000 records.

I could change the output format to a split format where every record is saved in its own file obj123456.xml, pk1234567.xml etc.

While currently we have a multi-entity file as output, the new system would save atomic records as files.

Then all consumers have to be rewritten. Currently only two: SHF/npx, Lido.

Consumers will then need to read multiple xml files in xslt. Lido and npx are more compact, so we could package way more records in one GB, but in principle there is still a size limit.

Another alternative is that we (simply) introduce pagination, i.e. download 1000 or so records at a time. This is also required by the atomic file system format, so independent of the choice between atomic and package solution.

We would then continue to have multi entity files. Download might get considerably slower if MpApi has to request individual files. But that might be avoidable.

So in reality we want a configurable size for the packages, but as a first attempt I will assume the package size doesn't exceed 1000.

zetcom’s vocabulary code points

Transparently try out existing code points. Goal is to find fastest way of getting terms associated with a set of objects

To this end, I will write a new test.

Statusanzeige

Bei längeren Prozessen wäre es hilfreich wenn der Status angezeigt würde 15% von 100 ... egal was oder wie.

 Hauptsache es wird deutlich das etwas geschieht....

Bei einer längeren Wartezeit (>5 Minuten) wächst die Unsicherheit ob sich nicht das Programm aufgehängt hat....

Update for mpapi

Test search query w/ last modified date
Add optional Param to SAR.getByExhbit, getByGroup etc.
Add optional date to mink jobs.dsl as fourth param

install MpApi Feler

? Was mach ich falsch?

Download Address records as well

    <moduleReference name="ObjOwnerRef" targetModule="Address" multiplicity="N:1" size="1">
      <moduleReferenceItem moduleItemId="165950">
        <formattedValue language="en">Museum für Asiatische Kunst, Staatliche Museen zu Berlin</formattedValue>
      </moduleReferenceItem>
    </moduleReference>

zml-qc

Command line utility that runs tests against zml(=mpx) data files. It tests if there is Sachbegriff, Objekttyp, verwaltende Institution etc.

Mypy

Better support of type hints

use session

write a test to check how much time we save using session
learn session mechanism from metaodi

i suspect short requests like a getItem profit from session more than long requests, so it won't help me much atm. That's why I didn't do it earlier.

todo: touch.py

practically, recherche.smb only shows freigebene multimedia resources if the last object has been changed after the freigabe on resource level has been granted. So let's write a replacer plugin that finds exactly those records and changes them.

Clean for upload

Todo: implement a new method in Module that expects downloaded xml and returns a "cleaned version" suitable for upload.

Background: sometimes or always Ria doesn't accept the xml that it spits out for renewed upload.

Clean_for_upload() uses side-effect, I.e works on internal xml.