Comments (5)
@zazi, thanks for the bug report. Could reproduce. The DNB endpoint is in general relatively broken. I believe I saw this error before:
Your request matches to many records (>100000). The result size is 353017. Please try to restrict the request-period.
$ curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-05T23:59:59Z&verb=ListRecords"
<html><head><title>Error</title></head><body>Your request matches to many records (&gt;100000). The result size is 353017. Please try to restrict the request-period.</body></html>
It really odd, because even on a daily slice (using the -daily flag) it is too much. If, in theory, all records would have a single timestamp, there would be no way at all to retrieve the records in a windowed fashion - which in turn means that it is not fully OAI compliant.
Next thing I would try would be:
$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository
We wrote oaicrawl for zvdd.de OAI, because it's calling itself OAI, despite being broken. The oaicrawl is a much blunter tool, it will fetch all identifiers (ListIdentifiers) and request records one-by-one (GetRecord). Let's see what happens with DNB:
$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository
FATA[2018-07-30T14:15:52+02:00] expected element type <OAI-PMH> but have <html>
Digging into it a bit more:
<title>Error</title>Your request matches to many records (>100000). The result size is 13413063. Please try to restrict the request-period.
Now, let me rant on a bit. Why does OAI has so-called "resumption-tokens" at all? Datacite, base (Bielefeld) and other huge repositories can work just fine by paging through the data (tens of millions of records) for days. It's a DNB problem, it would be best, if they use their own resources to solve this problem.
from metha.
I cannot define the concrete set over there, or?
Yes, oaicrawl was more of a one-shot for a particular endpoint and has a minimal feature set.
Thanks a lot for your feedback, I'll forward it to DNB somehow.
I can try to do the same.
Does this sound like a solution for you @miku ?
Yes, sure this is an option. This is also a limitation of metha, which I would like to get rid of one day (it was not essential for the use cases so far, so it is not implemented): It has only monthly and daily slices, not arbitrary precision.
from metha.
thanks a lot @miku for your very fast reply. I was also on trying oaicrawl for this, but then I thought that it might be a bit to much fetching this rather larger authorities set 1-by-1 from DNB - so I skipped this approach. Furthermore, as far as I understood the arguments from oaicrawl - I cannot define the concrete set over there, or?
Thanks a lot for your feedback, I'll forward it to DNB somehow.
For our concrete usecase it probably might even be enough to get the data excerpt from "Sächsische Bibliographie" via SRU. Then I "only" need to be able to define the appropriate CQL query (which is a bit out of my knowledge so far).
from metha.
while writing the draft for an answer to DNB and reading their OAI docs again, I came to a possible solution:
since the request return a 413, which is a standard HTTP status code from RFC 7231 - one can make use of this information and reduce the standard interval from daily to e.g. hourly for such cases (which requires to set both parameters, from and until, in the request).
Does this sound like a solution for you @miku ?
PS: the DNB OAI docu also says "Depending on the OAI repository these can be either defined to the day (YYYY-MM-DD) or to the second (YYYY-MM-DDThh:mm:ssZ)" - so working with hourly slice might be possible.
curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T13:00:00Z&until=2008-04-05T14:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&verb=ListRecords"
delivers at least some results (incl. a resumption token)
from metha.
Ok, we've send a request to DNB, whether they can increase the result size limit. On the other side, we would appreciate, when you could implement the proposed fall-back functionality, when a 413 will be thrown, i.e., decrease the interval temporarily to hourly (and the go back to daily).
from metha.
Related Issues (20)
- Bad page state in metha-sync (arm)
- authorization // character limit HOT 5
- Support for basic auth HOT 2
- two different resumptionTokens? HOT 1
- metha-cat - can not open the "dir" extablished in .cache/metha HOT 3
- Urlencode resuptionToken HOT 2
- Migration from Goodtables to Frictionless Repository
- Question: Can metha auto harvest all formats/metadataPrefixes? HOT 2
- Metha-Cat: Support for Paging? HOT 3
- Client Timeout HOT 4
- Dependency Issue with Version 0.2.37 HOT 2
- metha-sync should catch SIGINT HOT 3
- Extend metha-cat to extract metadata records HOT 3
- Selective Harvesting and metha-cat HOT 2
- decode failed due to XML header HOT 1
- conflicting namespace prefixes during ListRecords HOT 2
- `-format` not respected? HOT 2
- Date Parsing Issue HOT 2
- Failed With Unprocessable Entity HOT 8
- arxiv files download by topic? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metha.