crossref / cayenne Goto Github PK
View Code? Open in Web Editor NEWMOVED to https://gitlab.com/crossref/rest_api
Home Page: https://gitlab.com/crossref/rest_api
License: MIT License
MOVED to https://gitlab.com/crossref/rest_api
Home Page: https://gitlab.com/crossref/rest_api
License: MIT License
We need a method for getting a list of Crossref Member IDs for all Sponsoring Organizations (so this is the CS assigned number that we use for Prep for example). Is there another way to get this info other than single api queries for each account name one at a time? Do we have a list of account names and member IDs somewhere that I could do a lookup on or somesuch?
A metadata user reported finding "empty reference objects" for a number of DOIs. Based on the example he provided, it looks like those correspond to references that were deposited without keys.
For example, metadata for 10.1186/2008-2231-20-88 was last updated with submission 1355381104 which included these among its references.
<citation key="10.1186/2008-2231-20-88-B20">
<doi>10.1186/1472-698X-5-5</doi>
</citation>
<citation key="-">
<doi>10.1108/09526860510612207</doi>
</citation>
<citation key="-">
<doi>10.3923/ijp.2012.586.589</doi>
</citation>
<citation key="-">
<doi>10.1016/0749-5978(91)90020-T</doi>
</citation>
<citation key="-">
<doi>10.1348/135910705X66043</doi>
</citation>
<citation key="10.1186/2008-2231-20-88-B31">
<doi>10.1111/j.1365-2753.2011.01690.x</doi>
</citation>
The results in the REST API here show all the references with citation key="-" as empty brackets
http://api.crossref.org/works/10.1186/2008-2231-20-88
{
"key": "10.1186/2008-2231-20-88-B20",
"DOI": "10.1186/1472-698X-5-5",
"doi-asserted-by": "publisher"
},
{},
{},
{},
{},
{
"key": "10.1186/2008-2231-20-88-B31",
"DOI": "10.1111/j.1365-2753.2011.01690.x",
"doi-asserted-by": "publisher"
screenshots of the above also attached.
There is an 'edited-book' type which I assume corresponds with the book type 'edited book' in the input schema, but books registered with that type are appearing in the JSON output as 'book', for example:
http://api.crossref.org/works/10.1108/S2044-9968(2013)6_Part_F
We should create a JSON schema that validates the response. This can be run against production deployment as a sanity check.
Do not do this automatically. It might generate race conditions if there are replicas.
The tests are currently uncategorised. Some of them make external requests and it isn't clear which ones can be run in CI and which ones are for manual use only.
Label tests into one of:
unit
- Tests that involve only executing code, no external services.component
- Tests that exercise a component or service. These may involve spinning up an API server, but no external network access is required, no dependencies on e.g. Elastic Search.integation
- Tests that involve external services, such as Elastic Search. These will all be contained in the Docker environment, however, and require no external network access.manual-live
- Tests that compare the behaviour to the extant API. These are only run manually during feature development.Update README to indicate how to run them using Docker Compose.
We have two kinds of test data. The ‘corpus’ data is large but randomly chosen. The regression data is diverse, specific but out of date. Locate metadata that covers a selection of content types to give a reliable cross-section of our features.
Definition of done:
Java 11's module system no longer includes javax
, which means some dependencies are missing.
Get documentation in a state where it contains all existing knowledge, is correct, and can support and be fully included in feature development. Remove doubt from the code.
Existing existing documentation may concern:
Definition of done:
DRAFT
Need to work up what commands need to be run to run the full system with all various roles. Docker service commands and configuration, change HAProxy endpoint, Change the pusher so it’s updated. And also need to take down the staging and clear out that elasticsearch machine and reuse it. Take down the cayenne staging instance and reuse the elastic service, assuming we get it deployed into Docker.
Elastic allows us to store data in a field without indexing it with the 'enabled' keyword. This was used for storing coverage information. However the way it was didn't work. This caused coverage to be missing and tests to break.
mappings._doc._all.enabled
path was used in mapping spec, when the correct one should be mappings._doc.enabled
.Fix is a simple change in the mapping, plus error handling.
When DOI metadata is indexed a publisher name is associated with each solr record. Need a way to quickly re-index a member's metadata after a member name change.
https://api.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2010-01,until-pub-date:2010-01
results in 55471 records
http://api-es-staging.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2010-01,until-pub-date:2010-01
results in 950380
Doing the same queries with 2016 instead of 2010 results in different numbers, and the second record in my result shows that the publication date is 2016-02-01
http://api-es-staging.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2016-01,until-pub-date:2016-01
The second record is
"DOI": "10.1016/j.jmii.2013.01.003"
I know the number of fields can be increased, but I'm concerned that there more than 1000 in funders, as they shouldn't have that many fields.
I suspect they're being generated incorrectly.
This happens with the live RDF data, which can be had from the default configuration.
java.lang.IllegalArgumentException: Limit of total fields [1000] in index [funder] has been exceeded
at org.elasticsearch.index.mapper.MapperService.checkTotalFieldsLimit(MapperService.java:626) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:450) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:353) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:285) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:313) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:643) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:273) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Remove old branches, merge those that need merging, delete merged and unused ones. Remaining should be master, elastic-2018 and any in-progress branches.
Definition of done:
Integration tests need to be able to index a chunk of XML into the Elastic index. Currently this is done by putting into the feed directory and waiting for indexing to complete. It would be better to ingest a given chunk of files synchronously and explicitly in the test. This will pave the way for more diverse and atomic integration tests.
Hi,
I hope this is the right repository to report such issues, if not please guide me appropriately.
We are using CrossRef for a lot of fetching functionalities inside JabRef.
In the last days (and in general) the API lead to a few problems.
For example:
http://api.crossref.org/works?query.title=A+break+in+the+clouds%3A+towards+a+cloud+definition&rows=20&offset=0`
randomly responds with status 500 and this response body.
{"status":"error","message-type":"exception","message-version":"1.0.0","message":{"name":"class java.lang.RuntimeException","description":"java.lang.RuntimeException: Solr returned a partial result set","message":"Solr returned a partial result set","cause":null}}
Maybe this helps tracking down some problems with the API.
Best regards, Stefan
Coverage is expressed as a float (i.e. proportions). Default zero value is an integer zero. For consistency, the zero should probably be 0.0.
There's no filter for subtitle, and at least one person would find it useful.
Would be an easy way to detect typos and short-circuit filters, etc.
The line:
:deposits-articles (or (> (get-in coverage-doc [:coverage :all :journal-article :_count]) 0) false)}}
in src/cayenne/data/coverage.clj
is causing an NPE as the > sometimes is given a nil.
I switched around the or and the > to get values to compare against for the time being in the staging instance;
:deposits-articles (> (or (get-in coverage-doc [:coverage :all :journal-article :_count]) 0))}}
The test-data
directory contains 56 XML files which are not referenced anywhere else in the code. These were clearly used for manual regression testing. Having the files around with no explanation is confusing, could be misleading, and they may become irrelevant (or may already be).
Ideal case is to use each file in a regression test of some kind. Some of these will be suitable for unit tests, some may concern the feed process and more of an integration test will be useful.
If these can't be put to good use, delete them.
The CS currently has a method for manually scanning the SOLR index for missed DOIs. This will need to be updated, and the process reviewed, for the new Elastic Search version.
There should be a health check, possibly in the /heartbeat
route, that can do some lightweight checks against the data. These can be checked during a re-index to sanity-check the deployment.
E.g.
The docker-compose.yml
file uses docker.elastic.co/elasticsearch/elasticsearch
and then disables the commercial bits, like xpack
. This was a hacky workaround to the fact that Elastic Search refused to distribute an open source docker image. Subsequently the OSS version has been made available, and it's used in Event Data: docker.elastic.co/elasticsearch/elasticsearch-oss
Update docker-compose file.
The test code (in the Clojure project) in the current Elastic Search branch automates Docker by spinning up containers. The code runs outside Docker. This will make it tricky to include in CI. It also means that the code isn't being tested isn't running in the target environment.
To bring the methodology into line with how we do Docker, all code should be run in the Docker container, managed with Docker Compose. The Clojure code should have no knowledge of Docker.
If the document set changes in the duration of a cursor session, are new results included or is the result set effectively frozen for the duration?
The ResourceSync draft specification assumes that the result set is frozen. If we rely on this, we should demonstrate it in an integration test.
Also, this will surely happen in real usage, so we should be able to say for sure what the behaviour is.
The fix for the cutoff date in 49b4d51 wasn't merged. This means that the cutoff date is wrong.
IOP is having some more concerns about DOIs related to that hugely hyped black hole picture that was all over the place recently. The BibTeX citation formatting isn’t working for them.
10.3847/2041-8213/ab0ec7
10.3847/2041-8213/ab0c96
10.3847/2041-8213/ab0c57
10.3847/2041-8213/ab0e85
10.3847/2041-8213/ab0f43
10.3847/2041-8213/ab1141
The problem is that the citation formatting service doesn’t work for them. If you ask for BibTeX from it, you get a blank response
curl -LH “Accept: text/bibliography; style=bibtex” http://dx.doi.org/10.3847/2041-8213/ab0c96
and if you ask for some other formats, you get an error message: curl -LH “Accept: text/bibliography; style=harvard1" http://dx.doi.org/10.3847/2041-8213/ab0c96 returns a Java stack trace.
There’s a ‘BibTeX’ button on the article home page on IOPscience that tries to use this functionality, so it’s broken for these articles.
I can also confirm this by using search.crossref.org and trying Actions>Cite for the above DOIs.
We've had interest from some research institutions/research tracking systems about being able to query the API by first or additional author, rather than just 'author' (we have those fields in the metadata).
This is useful for institutions who want to track their research outputs based on the lead author on a piece of content.
When looking for either Journals or articles contained within a journal only ISSN can be used, however there are journals lacking ISSNs. Within the system those without ISSNs must have a DOI, though there are some older ones that have neither, which we should ignore.
To find the ones without an ISSN we should implement both a journals route and as a container-doi for articles.
Will require a schema change and reindex.
cayenne.elastic.mappings/create-indexes
expects all indexes not to be present and will fail if they already exist. Sometimes we want to manually delete one or more index and re-create it by running create-indexes
.
Update create-indexes
to accept pre-existing indexes, and create only when they don't exist.
We should publish a specification for what happens to DOIs when they're deleted (i.e. aliased), if / how that is propagated into the REST API, and how that is represented. Currently deleted DOIs don't get included in a clean rebuild.
Expose cursor information via heartbeat for health checks.
This information is available at /_nodes/stats/indices?filter_path=**.open_contexts
Cayenne can load data in a variety of formats. Documentation on ingestion is thin and the code is quite abstract.
Objectives:
This will then allow #59 to proceed and we can account for (or delete) each of the unused test files in the repo.
We will be including the new Grant ID metadata in the REST API output, here's my first take on what that will look like:
Note that each grant has (potentially) several 'projects' which contain the bulk of the grant metadata.
The message-version
in API responses is hard-coded to "1.0.0".
https://github.com/CrossRef/cayenne/blob/master/src/cayenne/version.clj#L3
Now we're going to start versioning Cayenne in earnest, probably with Semantic Versioning, we can use this version number for the message-version
. The format of data is steadily changing as we add new features anyway, and this will make it clearer to consumers.
Some tests were given the same name and ended up not running. This let some regresssions slip such as #75.
Rename the tests and fix whatever needs fixing to make them work.
Make sure pagination works to our satisfaction.
Definition of done:
cayenne.api.v1.schema
contains a schema for the response types. New changes to the API may or may not still be compatible. We should check that everything is still compatible after the Participation Reports work, and make sure there's a regression test.
The file available at http://ftp.crossref.org/titlelist/titleFile.csv has changed its headers. These are hard-coded to expect a certain ordering.
The format was most recently changed with [Jira CS-3961(https://jira.crossref.org/jira/browse/CS-3961) and updated in Cayenne master #349e12c452512c8f0906a5c816e7a9c599419b70 but this wasn't merged into the elastic-2019
branch.
To fix this, and prevent future brittleness, make column identification dynamic.
The Funding Information model seems to work but there are questions about its suitability. Decide on the criteria for success, then make sure it does what we want.
Definition of done:
Currently hard-coded, leading to issues like #45 . The order of columns should be based on the headers, which are present in the file. A test case should be provided too.
Current integration and component tests start an HTTP server and make requests via the network. It's more efficient and less error-prone to go straight to the route functions rather than starting an HTTP server. This also allows unit tests on the API.
By taking the network request out of the default test request function in cayenne.api-fixture/api-get
we can make most tests run without the network stack. Those that genuinely need to test the network can still use the old method.
Currently feed (index) concurrency is set as num processors - 1. Becuase feed is used in integration tests, in which clarity is more important than speed, move this into the configuration system with the same default, but allow tests to set it differently.
The sharding and replication config is hard-coded in the cayenne.elastic.mappings
namespace. We need to decide on sensible values, and whether they should be configurable.
Tests that rely on snapshots have an implicit coupling to the time at which the snapshots were taken and when the tests were run. Coverage is broken down into 'current' and 'backfile', and if the tests are run a year later than when the snapshot was taken, some works will be classified differently.
Update test fixture to freeze time at a known point.
We have 'other' as a type, but it includes both book and book child (chapter, usually). Other book / book child types are split out individually, these should be separated into separate types.
For example - this is registered as a book with book_type 'other':
http://api.crossref.org/works/10.5555/suffixtest
This is registered as a content_item (child of 'book') with component_type 'other' and also has type 'other' in the JSON:
Content negotiation for https://doi.org/10.4414/smw.2018.14628 doesn't work. The DOI proxy should redirect to a service that can provide a response (e.g. data.crossref.org)
curl -vLH "Accept: application/rdf+xml" https://doi.org/10.4414/smw.2018.14628
< HTTP/1.1 303 See Other
< Server: Apache-Coyote/1.1
< Location: https://doi.emh.ch/smw.2018.14628
Instead, it redirects to the DOI's landing page. This may be caused by a bug in the DOI proxy, or missing data.
The DOI appears to be a Crossref one:
http://api.crossref.org/works/10.4414/smw.2018.14628/agency
This behaviour happens with other DOIs with the same prefix, e.g.:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.