Git Product home page Git Product logo

mpapi's People

Contributors

mokko avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mpapi's Issues

Better tests

I have included tests in pre-commit hook in the meantime, next would be to have more fast non http tests

ISILs für relatedWorks

Frank hätte gerne ISILs für relatedWorks.

  1. Ich könnte diese im Python Schritt in RIA nachschlagen und gleich in LiDO-XML eintragen.

  2. Alternativ müsste ich in MpApi die relatedWorks herunterladen, einen Mechanismus definieren, wie diese von anderen Objektdaten zu unterscheiden sind. Sie sollen ja nicht automatisch veröffentlicht werden. Das würde MpApi Prozess verlangsamen.

async to speedup api requests

try out async IO to speed up big, slow requests.
Should probably be a new project or at least a branch

Needs replacement for requests -> aiohttp

Records without photos

Many records have no public photos.

  • There are Arbeitsphotos that not yet published. Who descides which photos are published? Who is responsible for photos?

Issues w/ duplicate moduleItems

Issues with multiple nodes. I dont know why I get this. Is it possible that I have the same records multiple times?

Perhaps test it with specially prepared data and the write a select statement that loops only thru distinct records.

split/DE-MUS-019118/211835.lido.xml
split/DE-MUS-019118/211836.lido.xml
Error in xsl:result-document/@href on line 20 column 75 of splitLido.xsl:
  XTDE1490  Cannot write more than one result document to the same URI:
  file:/C:/m3/zml2lido/sdata/3Wege/split/DE-MUS-019118/211836.lido.xml
  In template rule with match="/lido:lidoWrap/lido:lido" on line 15 of splitLido.xsl
     invoked by xsl:apply-templates at file:/C:/m3/zml2lido/xsl/splitLido.xsl#12
  In template rule with match="/" on line 11 of splitLido.xsl
Cannot write more than one result document to the same URI: file:/C:/m3/zml2lido/sdata/3Wege/split/DE-MUS-019118/211836.lido.xml
Traceback (most recent call last):
  File "C:\m3\zml2lido\src\lido.py", line 328, in <module>
    getattr(m, args.job)()
  File "C:\m3\zml2lido\src\lido.py", line 142, in smb
    self.splitLido(input=linklido_fn)  # individual records as files
  File "C:\m3\zml2lido\src\lido.py", line 235, in splitLido
    self._saxon(input=input, xsl=xsl["splitLido"], output="o.xml")
  File "C:\m3\zml2lido\src\lido.py", line 303, in _saxon
    subprocess.run(
  File "subprocess.py", line 528, in run
subprocess.CalledProcessError: Command 'java -Xmx1450m -jar C:\m3\SaxonHE10-5J\saxon-he-10.5.jar -s:C:\m3\zml2lido\sdata\3Wege\3Wege20211129-links.onlyPub.lido.xml -xsl:C:\m3\zml2lido\xsl\splitLido.xsl -o:o.xml' returned non-zero exit status 2.

vocabulary examples

Ontology

instance is what I call a vocabulary. The instance has many nodes. Nodes include one or several terms

instance (=Vokabular)
: type
  node
    terms

Todo: Find out more about termClasses

Instances

thesis: instance mainly provides type and termClass

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<instance xmlns="http://www.zetcom.com/ria/ws/vocabulary"
          logicalName="ObjIconographyKeywordProjectVgr"
          id="61671">
	<uuid>3951a563-351d-46c1-b20a-bd565d8e00b3</uuid>
	<version>2020-06-28T15:44:46.099Z</version>
	<lastModified>2020-06-28T15:44:46.098Z</lastModified>
	<lastModifiedUser>zetLS</lastModifiedUser>
	<orgUnit logicalName="VocabularyAdministration"/>
	<type logicalName="simpleThesaurus"/>
	<termClasses>
		<termClass logicalName="Untereintrag"/>
		<termClass logicalName="Haupteintrag"/>
	</termClasses>
</instance>

Nodes

Thesis: Node mainly provides id, hierarchy (parents) and terms

<node xmlns="http://www.zetcom.com/ria/ws/vocabulary"
      logicalName="Europeana-Fashion##Modeobjekte##Visuelle und verbale Kommunikation##analoge Medien##Zeichnung"
      id="4254998">
	<uuid>a356c4d0-b24a-5a6a-4eaf-e1305df2b6c0</uuid>
	<version>2021-12-01T11:59:23.128Z</version>
	<lastModified>2021-12-01T11:59:23.124Z</lastModified>
	<lastModifiedUser>IFM_FvH</lastModifiedUser>
	<orgUnit logicalName="VocabularyAdministration"/>
	<status logicalName="valid"/>
	<parents>
		<parent nodeId="4254966"/>
	</parents>
	<instance logicalName="ObjIconographyKeywordProjectVgr"
	          id="61671"/>
	<terms>
		<term id="4745128">
			<uuid>9f1e3769-8bdf-ad28-b039-452caaa25916</uuid>
			<version>2020-11-30T16:03:28.682Z</version>
			<lastModified>2020-11-30T16:03:28.682Z</lastModified>
			<lastModifiedUser>MPRiaImporter</lastModifiedUser>
			<isoLanguageCode>de</isoLanguageCode>
			<content>Zeichnung</content>
			<order>1</order>
			<status logicalName="valid"/>
			<category logicalName="preferred"/>
		</term>
	</terms>
</node>
```

Pagination

Thoughts on a possible solution
1st request: ask how many results
2nd request: get the first package
Nth request: repeat until all results

E.g. size=3333

  1. Limit=1000, Offset=0
  2. Limit=1000, Offset=1000
  3. Limit=1000, Offset=2000
  4. Limit=1000, Offset=3000

For now package size will be a global variable in SAR.

New method: PaginatedSearch with same interface as search

Big Exports

Currently we save all info from one export in one big xml file. That doesn't scale well. At about 1GB I can't process xml anymore on a regular laptop. So there is currently a limit to how big a package can be and it is about 10000 records.

I could change the output format to a split format where every record is saved in its own file obj123456.xml, pk1234567.xml etc.

While currently we have a multi-entity file as output, the new system would save atomic records as files.

Then all consumers have to be rewritten. Currently only two: SHF/npx, Lido.

Consumers will then need to read multiple xml files in xslt. Lido and npx are more compact, so we could package way more records in one GB, but in principle there is still a size limit.

Another alternative is that we (simply) introduce pagination, i.e. download 1000 or so records at a time. This is also required by the atomic file system format, so independent of the choice between atomic and package solution.

We would then continue to have multi entity files. Download might get considerably slower if MpApi has to request individual files. But that might be avoidable.

So in reality we want a configurable size for the packages, but as a first attempt I will assume the package size doesn't exceed 1000.

zetcom’s vocabulary code points

Transparently try out existing code points. Goal is to find fastest way of getting terms associated with a set of objects

To this end, I will write a new test.

Statusanzeige

Bei längeren Prozessen wäre es hilfreich wenn der Status angezeigt würde 15% von 100 ... egal was oder wie.

 Hauptsache es wird deutlich das etwas geschieht....

Bei einer längeren Wartezeit (>5 Minuten) wächst die Unsicherheit ob sich nicht das Programm aufgehängt hat....

Update for mpapi

  1. Test search query w/ last modified date
  2. Add optional Param to SAR.getByExhbit, getByGroup etc.
  3. Add optional date to mink jobs.dsl as fourth param

Download Address records as well

    <moduleReference name="ObjOwnerRef" targetModule="Address" multiplicity="N:1" size="1">
      <moduleReferenceItem moduleItemId="165950">
        <formattedValue language="en">Museum für Asiatische Kunst, Staatliche Museen zu Berlin</formattedValue>
      </moduleReferenceItem>
    </moduleReference>

zml-qc

Command line utility that runs tests against zml(=mpx) data files. It tests if there is Sachbegriff, Objekttyp, verwaltende Institution etc.

Mypy

Better support of type hints

use session

  • write a test to check how much time we save using session
  • learn session mechanism from metaodi

i suspect short requests like a getItem profit from session more than long requests, so it won't help me much atm. That's why I didn't do it earlier.

todo: touch.py

practically, recherche.smb only shows freigebene multimedia resources if the last object has been changed after the freigabe on resource level has been granted. So let's write a replacer plugin that finds exactly those records and changes them.

Clean for upload

Todo: implement a new method in Module that expects downloaded xml and returns a "cleaned version" suitable for upload.

Background: sometimes or always Ria doesn't accept the xml that it spits out for renewed upload.

Clean_for_upload() uses side-effect, I.e works on internal xml.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.