kbrbe / enrich-authority-csv Goto Github PK
View Code? Open in Web Editor NEWA python script that uses SRU APIs to complete a CSV file with missing data based on an available identifier column
License: GNU Affero General Public License v3.0
A python script that uses SRU APIs to complete a CSV file with missing data based on an available identifier column
License: GNU Affero General Public License v3.0
For KBR the query keyword is q
. This should be configurable for every possible API endpoint, whereas query
can be the default.
Currently there are only a few unit tests, but we should add overall integration tests like we did for our ISNI request script: https://github.com/kbrbe/request-isni/blob/9ac408030ddab3618213ad3ea064bb681dbb1e60/request_isni/test_request_isni.py
Currently API calls are sent row by row. For a file with 10k rows this means 10 API requests which brings a massive overhead. Depending on the query keywords offered by the SRU endpoint, several requests could be bundled.
For example, at KBR instead of IDNO:123
one could query IDNO('123', '456', '789')
to query as many identifiers as the max length URL query string.
In case someone does not want to put the publicly available URL of the SRU endpoint into the config, a workaround has to be used.
The current workaround is to make the connection type authenticated
and use the following configuration:
"connection": {
"type": "authenticated",
"url": "$userVariable",
"userVariable": "ENV_URL",
"passwordVariable": ""
}
However, it would be more straightforward to make the usage of an environment variable for the URL possible too.
Authentication is a different concern.
Currently the tool can't be installed as a module using pip.
An alternative way to do what the script does is to manually use OpenRefine: https://www.oclc.org/developer/news/2018/using-oclc-apis-in-open-refine.en.html
The added value of the script is that no code needs to be written (only XPath's need to be specified in the config) and that the script can run automatically without interacting with a UI.
However, for a better integration of the script into automatic workflows, it would be nice if one could specify a log file (e.g. via --log
or --logfile
).
Currently there is still a lot of console output. It would be nice if this could be silenced too (e.g. via --quiet
or --silent
)
The program seems to stop in the middle of the execution, the progress bar never reaches 100%.
However, the number of lines in the input and in the output match. Thus the whole input file is being processed and only the counting of the total rows that can be processed is wrong.
The National Library of France (BnF) also has a SRU API from which data such as nationality
(unimarc field 102$a
) could be fetched: http://catalogue.bnf.fr/api/SRU
The current script can be generalized to also fetch data from other sources.
All we would need is a proper configuration via config file and/or commandline arguments.
A possible configuration:
{
"apis": {
"BnF": {
"connection": {
"type": "unauthenticated",
"url": "http://catalogue.bnf.fr/api/SRU"
},
"data": {
"unimarcxchange": {
"nationality": {
"type": "element",
"path": "srw:records/srw:record/srw:recordData/mxc:record/mxc:datafield[@tag='102']/mxc:subfield[@code='a']"
},
"language": {
"type": "element",
"path": "srw:records/srw:record/srw:recordData/mxc:record/mxc:datafield[@tag='101']/mxc:subfield[@code='a']"
}
}
}
},
"ISNI": {
"connection": {
"type": "authenticated",
"url": "http://isni-m.oclc.org/sru/username=$userVariable/password=$passwordVariable/DB=1.3",
"userVariable": "ISNI_SRU_USERNAME",
"passwordVariable": "ISNI_SRU_PASSWORD"
},
"data": {
"isni-e": {
"nationality": {
"type": "element",
"path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/identity/personOrFiction/additionalInformation/nationality"
},
"gender": {
"type": "element",
"path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/identity/personOrFiction/additionalInformation/gender"
},
"KBR": {
"type": "identifier",
"path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/sources",
"identifierCodeSubpath": "codeOfSource",
"identifierNameSubpath": "sourceIdentifier"
},
"BNF": {
"type": "identifier",
"path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/sources",
"identifierCodeSubpath": "codeOfSource",
"identifierNameSubpath": "sourceIdentifier"
},
"NTA": {
"type": "identifier",
"path": "srw:records/srw:record/srw:recordData/responseRecord/ISNIAssigned/ISNIMetadata/sources",
"identifierCodeSubpath": "codeOfSource",
"identifierNameSubpath": "sourceIdentifier"
}
}
}
}
}
}
where ISNI_SRU_USERNAME
and ISNI_SRU_PASSWORD
are environment variables as they are used now already.
The configuration for the different identifiers is redundant and probably could be simplified more. But for now to start the implementation this is sufficient.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.