Git Product home page Git Product logo

enrich-authority-csv's People

Contributors

svenlieber avatar

Watchers

 avatar

enrich-authority-csv's Issues

Improve performance by bundling requests

Currently API calls are sent row by row. For a file with 10k rows this means 10 API requests which brings a massive overhead. Depending on the query keywords offered by the SRU endpoint, several requests could be bundled.

For example, at KBR instead of IDNO:123 one could query IDNO('123', '456', '789') to query as many identifiers as the max length URL query string.

Use environment variable for URLs

In case someone does not want to put the publicly available URL of the SRU endpoint into the config, a workaround has to be used.
The current workaround is to make the connection type authenticated and use the following configuration:

"connection": {
    "type": "authenticated",
    "url": "$userVariable",
    "userVariable": "ENV_URL",
    "passwordVariable": ""
}

However, it would be more straightforward to make the usage of an environment variable for the URL possible too.
Authentication is a different concern.

Add commandline parameters to control console output or logfile creation

An alternative way to do what the script does is to manually use OpenRefine: https://www.oclc.org/developer/news/2018/using-oclc-apis-in-open-refine.en.html

The added value of the script is that no code needs to be written (only XPath's need to be specified in the config) and that the script can run automatically without interacting with a UI.
However, for a better integration of the script into automatic workflows, it would be nice if one could specify a log file (e.g. via --log or --logfile).

Currently there is still a lot of console output. It would be nice if this could be silenced too (e.g. via --quiet or --silent)

Correct progress bar counter

The program seems to stop in the middle of the execution, the progress bar never reaches 100%.
However, the number of lines in the input and in the output match. Thus the whole input file is being processed and only the counting of the total rows that can be processed is wrong.

Make it possible to retrieve data from other SRU APIs

The National Library of France (BnF) also has a SRU API from which data such as nationality (unimarc field 102$a) could be fetched: http://catalogue.bnf.fr/api/SRU

The current script can be generalized to also fetch data from other sources.
All we would need is a proper configuration via config file and/or commandline arguments.

  • specify the SRU API and the authentication if required
  • specify how to extract a certain element, e.g. nationality with xpath over unimarc BnF data or intermarc BnF data, KBR identifier via xpath over ISNI data
  • when calling the script, check that the given data can be fetched from the given data source, e.g. KBR identifier cannot be fetched from BnF as there is no MARC field that contains it.

A possible configuration:

{
  "apis": {
    "BnF": {
      "connection": {
        "type": "unauthenticated",
        "url": "http://catalogue.bnf.fr/api/SRU" 
      },
      "data": {
        "unimarcxchange": {
          "nationality": {
            "type": "element",
            "path": "srw:records/srw:record/srw:recordData/mxc:record/mxc:datafield[@tag='102']/mxc:subfield[@code='a']"
          },
          "language": {
            "type": "element",
            "path": "srw:records/srw:record/srw:recordData/mxc:record/mxc:datafield[@tag='101']/mxc:subfield[@code='a']"
          }
         }
      }
    },
    "ISNI": {
      "connection": {
        "type": "authenticated",
        "url": "http://isni-m.oclc.org/sru/username=$userVariable/password=$passwordVariable/DB=1.3",
        "userVariable": "ISNI_SRU_USERNAME",
        "passwordVariable": "ISNI_SRU_PASSWORD"
      },
      "data": {
        "isni-e": {
          "nationality": {
            "type": "element",
            "path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/identity/personOrFiction/additionalInformation/nationality"
           },
          "gender": {
            "type": "element",
            "path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/identity/personOrFiction/additionalInformation/gender"
          },
          "KBR": {
            "type": "identifier",
            "path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/sources",
            "identifierCodeSubpath": "codeOfSource",
            "identifierNameSubpath": "sourceIdentifier"
          },
          "BNF": {
            "type": "identifier",
            "path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/sources",
            "identifierCodeSubpath": "codeOfSource",
            "identifierNameSubpath": "sourceIdentifier"
          },
          "NTA": {
            "type": "identifier",
            "path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/sources",
            "identifierCodeSubpath": "codeOfSource",
            "identifierNameSubpath": "sourceIdentifier"
          }
        }
      }
    }
  }
}

where ISNI_SRU_USERNAME and ISNI_SRU_PASSWORD are environment variables as they are used now already.

The configuration for the different identifiers is redundant and probably could be simplified more. But for now to start the implementation this is sufficient.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.