crossref / cayenne Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 9.0 31.32 MB

MOVED to https://gitlab.com/crossref/rest_api

Home Page: https://gitlab.com/crossref/rest_api

License: MIT License

Clojure 99.40% XSLT 0.59% Dockerfile 0.01%

cayenne's People

Contributors

Stargazers

Watchers

Forkers

netconstructor scraping-xx bigdata-tools clojens soli indera mnjstwins winonone alex-ball

cayenne's Issues

Query / endpoint to retrieve sponsoring organisations

We need a method for getting a list of Crossref Member IDs for all Sponsoring Organizations (so this is the CS assigned number that we use for Prep for example). Is there another way to get this info other than single api queries for each account name one at a time? Do we have a list of account names and member IDs somewhere that I could do a lookup on or somesuch?

References deposited without "keys" appear as empty brackets in API results

A metadata user reported finding "empty reference objects" for a number of DOIs. Based on the example he provided, it looks like those correspond to references that were deposited without keys.

For example, metadata for 10.1186/2008-2231-20-88 was last updated with submission 1355381104 which included these among its references.

<citation key="10.1186/2008-2231-20-88-B20">
<doi>10.1186/1472-698X-5-5</doi>
</citation>
<citation key="-">
<doi>10.1108/09526860510612207</doi>
</citation>
<citation key="-">
<doi>10.3923/ijp.2012.586.589</doi>
</citation>
<citation key="-">
<doi>10.1016/0749-5978(91)90020-T</doi>
</citation>
<citation key="-">
<doi>10.1348/135910705X66043</doi>
</citation>
<citation key="10.1186/2008-2231-20-88-B31">
<doi>10.1111/j.1365-2753.2011.01690.x</doi>
</citation>

The results in the REST API here show all the references with citation key="-" as empty brackets
http://api.crossref.org/works/10.1186/2008-2231-20-88

{
"key": "10.1186/2008-2231-20-88-B20",
"DOI": "10.1186/1472-698X-5-5",
"doi-asserted-by": "publisher"
},
{},
{},
{},
{},
{
"key": "10.1186/2008-2231-20-88-B31",
"DOI": "10.1111/j.1365-2753.2011.01690.x",
"doi-asserted-by": "publisher"

screenshots of the above also attached.

No works present for edited-book type

There is an 'edited-book' type which I assume corresponds with the book type 'edited book' in the input schema, but books registered with that type are appearing in the JSON output as 'book', for example:

http://api.crossref.org/works/10.1108/S2044-9968(2013)6_Part_F

Schema for API response and test

We should create a JSON schema that validates the response. This can be run against production deployment as a sanity check.

Create command to insert schema in elasticsearch instead of using the REPL

Do not do this automatically. It might generate race conditions if there are replicas.

Tests should be categorised into unit tests, integration tests, and others.

The tests are currently uncategorised. Some of them make external requests and it isn't clear which ones can be run in CI and which ones are for manual use only.

Label tests into one of:

unit - Tests that involve only executing code, no external services.
component - Tests that exercise a component or service. These may involve spinning up an API server, but no external network access is required, no dependencies on e.g. Elastic Search.
integation - Tests that involve external services, such as Elastic Search. These will all be contained in the Docker environment, however, and require no external network access.
manual-live - Tests that compare the behaviour to the extant API. These are only run manually during feature development.

Update README to indicate how to run them using Docker Compose.

Sample Data for ingestion

We have two kinds of test data. The ‘corpus’ data is large but randomly chosen. The regression data is diverse, specific but out of date. Locate metadata that covers a selection of content types to give a reliable cross-section of our features.

Definition of done:

All of the work types enumerated in documentation. e.g. about 15 types.
All of the supporting input types.
For each, there is a good number (e.g. 100) of works that implement relevant features including any quirks.
Tests parse, with manually verified that they are correctly represented in regression test suite.

Compatibility with Java 11

Java 11's module system no longer includes javax, which means some dependencies are missing.

Internal documentation

Get documentation in a state where it contains all existing knowledge, is correct, and can support and be fully included in feature development. Remove doubt from the code.

Existing existing documentation may concern:

Development
Testing
Deployment & Operations

Definition of done:

One place for public docs, one place for development docs. Others removed.
Data flows upstream and downstream of the REST API are documented, and product managers for those identified. This will include internal APIs and services.
There are no outstanding TODOs, all have been removed and / or turned into issues.

Prepare elastic staging instance

DRAFT

Need to work up what commands need to be run to run the full system with all various roles. Docker service commands and configuration, change HAProxy endpoint, Change the pusher so it’s updated. And also need to take down the staging and clear out that elasticsearch machine and reuse it. Take down the cayenne staging instance and reuse the elastic service, assuming we get it deployed into Docker.

Disabled auto-index mappings not properly configured

Elastic allows us to store data in a field without indexing it with the 'enabled' keyword. This was used for storing coverage information. However the way it was didn't work. This caused coverage to be missing and tests to break.

The mappings._doc._all.enabled path was used in mapping spec, when the correct one should be mappings._doc.enabled.
Without this being applied, the data being stored was too large (too many fields) and Elastic was erroring.
There was no error handling on the coverage generation process, meaning the error that Elastic returned was ignored.
Integration tests were given conflicting names, meaning the tests that showed this weren't running.

Fix is a simple change in the mapping, plus error handling.

Simple way to re-index publisher names

When DOI metadata is indexed a publisher name is associated with each solr record. Need a way to quickly re-index a member's metadata after a member name change.

Date filters don't seem to limit much

https://api.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2010-01,until-pub-date:2010-01
results in 55471 records

http://api-es-staging.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2010-01,until-pub-date:2010-01
results in 950380

Doing the same queries with 2016 instead of 2010 results in different numbers, and the second record in my result shows that the publication date is 2016-02-01
http://api-es-staging.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2016-01,until-pub-date:2016-01
The second record is
"DOI": "10.1016/j.jmii.2013.01.003"

funder indexing causes elastic to fail with too many fields

I know the number of fields can be increased, but I'm concerned that there more than 1000 in funders, as they shouldn't have that many fields.
I suspect they're being generated incorrectly.
This happens with the live RDF data, which can be had from the default configuration.

java.lang.IllegalArgumentException: Limit of total fields [1000] in index [funder] has been exceeded
at org.elasticsearch.index.mapper.MapperService.checkTotalFieldsLimit(MapperService.java:626) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:450) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:353) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:285) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:313) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:643) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:273) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

Tidy up branches

Remove old branches, merge those that need merging, delete merged and unused ones. Remaining should be master, elastic-2018 and any in-progress branches.

Definition of done:

All branches merged / deleted.
Clear where to start when working on a new feature.

Synchronous indexing for tests

Integration tests need to be able to index a chunk of XML into the Elastic index. Currently this is done by putting into the feed directory and waiting for indexing to complete. It would be better to ingest a given chunk of files synchronously and explicitly in the test. This will pave the way for more diverse and atomic integration tests.

Various problems accessing the CrossRef API

Hi,

I hope this is the right repository to report such issues, if not please guide me appropriately.

We are using CrossRef for a lot of fetching functionalities inside JabRef.
In the last days (and in general) the API lead to a few problems.

In general the API is often unavailable serving 503 and 504 HTTP status codes
The API randomly responds with a 500 HTTP error code

For example:

http://api.crossref.org/works?query.title=A+break+in+the+clouds%3A+towards+a+cloud+definition&rows=20&offset=0`

randomly responds with status 500 and this response body.

{"status":"error","message-type":"exception","message-version":"1.0.0","message":{"name":"class java.lang.RuntimeException","description":"java.lang.RuntimeException: Solr returned a partial result set","message":"Solr returned a partial result set","cause":null}}

Maybe this helps tracking down some problems with the API.

Best regards, Stefan

Coverage zeroes should be floats

Coverage is expressed as a float (i.e. proportions). Default zero value is an integer zero. For consistency, the zero should probably be 0.0.

Add has-subtitle filter to works

There's no filter for subtitle, and at least one person would find it useful.

ORCID parsing does not calculate checksum

Would be an easy way to detect typos and short-circuit filters, etc.

https://github.com/CrossRef/cayenne/blob/605b539654a839eb7c9510774d4c33685ffad106/src/cayenne/ids/orcid.clj

NPE in /members route

The line:
:deposits-articles (or (> (get-in coverage-doc [:coverage :all :journal-article :_count]) 0) false)}}
in src/cayenne/data/coverage.clj
is causing an NPE as the > sometimes is given a nil.
I switched around the or and the > to get values to compare against for the time being in the staging instance;
:deposits-articles (> (or (get-in coverage-doc [:coverage :all :journal-article :_count]) 0))}}

Redundant test data - use it or lose it

The test-data directory contains 56 XML files which are not referenced anywhere else in the code. These were clearly used for manual regression testing. Having the files around with no explanation is confusing, could be misleading, and they may become irrelevant (or may already be).

Ideal case is to use each file in a regression test of some kind. Some of these will be suitable for unit tests, some may concern the feed process and more of an integration test will be useful.

If these can't be put to good use, delete them.

Verification of coverage from CS

The CS currently has a method for manually scanning the SOLR index for missed DOIs. This will need to be updated, and the process reviewed, for the new Elastic Search version.

Lightweight health checks for data in index

There should be a health check, possibly in the /heartbeat route, that can do some lightweight checks against the data. These can be checked during a re-index to sanity-check the deployment.

E.g.

version of cayenne running indexer = version of cayenne running api
random work of type X has required fields
schema tests, per #30

Remove datomic dependency if not used.

Docker Compose uses commercial Elastic Search docker image

The docker-compose.yml file uses docker.elastic.co/elasticsearch/elasticsearch and then disables the commercial bits, like xpack. This was a hacky workaround to the fact that Elastic Search refused to distribute an open source docker image. Subsequently the OSS version has been made available, and it's used in Event Data: docker.elastic.co/elasticsearch/elasticsearch-oss

Update docker-compose file.

Test infrastructure should be run by Docker Compose, not other way round.

The test code (in the Clojure project) in the current Elastic Search branch automates Docker by spinning up containers. The code runs outside Docker. This will make it tricky to include in CI. It also means that the code isn't being tested isn't running in the target environment.

To bring the methodology into line with how we do Docker, all code should be run in the Docker container, managed with Docker Compose. The Clojure code should have no knowledge of Docker.

Test for scroll pagination - are results frozen?

If the document set changes in the duration of a cursor session, are new results included or is the result set effectively frozen for the duration?

The ResourceSync draft specification assumes that the result set is frozen. If we rely on this, we should demonstrate it in an integration test.

Also, this will surely happen in real usage, so we should be able to say for sure what the behaviour is.

Coverage date cutoff fix not merged into Elastic branch.

The fix for the cutoff date in 49b4d51 wasn't merged. This means that the cutoff date is wrong.

Bibtex citation not working for some IOP DOIs

IOP is having some more concerns about DOIs related to that hugely hyped black hole picture that was all over the place recently. The BibTeX citation formatting isn’t working for them.
10.3847/2041-8213/ab0ec7
10.3847/2041-8213/ab0c96
10.3847/2041-8213/ab0c57
10.3847/2041-8213/ab0e85
10.3847/2041-8213/ab0f43
10.3847/2041-8213/ab1141

The problem is that the citation formatting service doesn’t work for them. If you ask for BibTeX from it, you get a blank response

curl -LH “Accept: text/bibliography; style=bibtex” http://dx.doi.org/10.3847/2041-8213/ab0c96

and if you ask for some other formats, you get an error message: curl -LH “Accept: text/bibliography; style=harvard1" http://dx.doi.org/10.3847/2041-8213/ab0c96 returns a Java stack trace.

There’s a ‘BibTeX’ button on the article home page on IOPscience that tries to use this functionality, so it’s broken for these articles.

I can also confirm this by using search.crossref.org and trying Actions>Cite for the above DOIs.

Ability to query by first or additional author

We've had interest from some research institutions/research tracking systems about being able to query the API by first or additional author, rather than just 'author' (we have those fields in the metadata).

This is useful for institutions who want to track their research outputs based on the lead author on a piece of content.

Journals only available via ISSN

When looking for either Journals or articles contained within a journal only ISSN can be used, however there are journals lacking ISSNs. Within the system those without ISSNs must have a DOI, though there are some older ones that have neither, which we should ignore.
To find the ones without an ISSN we should implement both a journals route and as a container-doi for articles.
Will require a schema change and reindex.

Elastic index creation fails if they already exist.

cayenne.elastic.mappings/create-indexes expects all indexes not to be present and will fail if they already exist. Sometimes we want to manually delete one or more index and re-create it by running create-indexes.

Update create-indexes to accept pre-existing indexes, and create only when they don't exist.

Document behaviour around deleted DOIs

We should publish a specification for what happens to DOIs when they're deleted (i.e. aliased), if / how that is propagated into the REST API, and how that is represented. Currently deleted DOIs don't get included in a clean rebuild.

List number of cursor sessions in the heartbeat.

Expose cursor information via heartbeat for health checks.

This information is available at /_nodes/stats/indices?filter_path=**.open_contexts

Clarity on ingestion in all forms

Cayenne can load data in a variety of formats. Documentation on ingestion is thin and the code is quite abstract.

Objectives:

List in the docs of all data formats that can be ingested.
Instructions on how to load test data in manually.
Component tests that load data from the file system, but don't index into Elastic.

This will then allow #59 to proceed and we can account for (or delete) each of the unused test files in the repo.

grant ID JSON output

We will be including the new Grant ID metadata in the REST API output, here's my first take on what that will look like:

sample input XML: https://github.com/CrossRef/grantID-schema/blob/master/grantid_full_example.xml
fields are mapped in this google doc
proposed JSON (draft, needs work): https://github.com/CrossRef/grantID-schema/blob/master/grantID.json

Note that each grant has (potentially) several 'projects' which contain the bulk of the grant metadata.

Keep `message-version` in sync with `cayenne.version/version`

The message-version in API responses is hard-coded to "1.0.0".

https://github.com/CrossRef/cayenne/blob/master/src/cayenne/version.clj#L3

Now we're going to start versioning Cayenne in earnest, probably with Semantic Versioning, we can use this version number for the message-version. The format of data is steadily changing as we add new features anyway, and this will make it clearer to consumers.

Journals tests don't all run

Some tests were given the same name and ended up not running. This let some regresssions slip such as #75.

Rename the tests and fix whatever needs fixing to make them work.

Pagination

Make sure pagination works to our satisfaction.

Definition of done:

All resources that include pagination listed and documented, and the different sort orders available.
Implementation should allow that in all cases pagination allows paging through a whole, large result set (e.g. 1 million items), with any search ordering.
Integration tests include deep paging.
Pagination documented in user docs.

Ensure that schema is up to date with all fields for participation reports

cayenne.api.v1.schema contains a schema for the response types. New changes to the API may or may not still be compatible. We should check that everything is still compatible after the Participation Reports work, and make sure there's a regression test.

Title File loading is out of sync with upstream file

The file available at http://ftp.crossref.org/titlelist/titleFile.csv has changed its headers. These are hard-coded to expect a certain ordering.

The format was most recently changed with [Jira CS-3961(https://jira.crossref.org/jira/browse/CS-3961) and updated in Cayenne master #349e12c452512c8f0906a5c816e7a9c599419b70 but this wasn't merged into the elastic-2019 branch.

To fix this, and prevent future brittleness, make column identification dynamic.

Funder information

The Funding Information model seems to work but there are questions about its suitability. Decide on the criteria for success, then make sure it does what we want.

Definition of done:

All features regarding funding information, both /funders route and /works route (and any others that come out of the woodwork) are documented.
All the features should be implemented and tested, including queries.
RDF relationships that are directly used should be documented in developer docs.

CSV column order should be dynamic

Currently hard-coded, leading to issues like #45 . The order of columns should be based on the headers, which are present in the file. A test case should be provided too.

Mock API responses in most integration tests.

Current integration and component tests start an HTTP server and make requests via the network. It's more efficient and less error-prone to go straight to the route functions rather than starting an HTTP server. This also allows unit tests on the API.

By taking the network request out of the default test request function in cayenne.api-fixture/api-get we can make most tests run without the network stack. Those that genuinely need to test the network can still use the old method.

Configurable feed concurrency

Currently feed (index) concurrency is set as num processors - 1. Becuase feed is used in integration tests, in which clarity is more important than speed, move this into the configuration system with the same default, but allow tests to set it differently.

Production Elastic Configuration

The sharding and replication config is hard-coded in the cayenne.elastic.mappings namespace. We need to decide on sensible values, and whether they should be configurable.

Tests are time-dependent, but time isn't specified

Tests that rely on snapshots have an implicit coupling to the time at which the snapshots were taken and when the tests were run. Coverage is broken down into 'current' and 'backfile', and if the tests are run a year later than when the snapshot was taken, some works will be classified differently.

Update test fixture to freeze time at a known point.

type 'other' includes both books and chapter / content items

We have 'other' as a type, but it includes both book and book child (chapter, usually). Other book / book child types are split out individually, these should be separated into separate types.

For example - this is registered as a book with book_type 'other':
http://api.crossref.org/works/10.5555/suffixtest

This is registered as a content_item (child of 'book') with component_type 'other' and also has type 'other' in the JSON:

http://api.crossref.org/works/10.4337/9781781001639.00001

Inconsistent Content Negotiation response for 10.4414-prefixed DOI

Content negotiation for https://doi.org/10.4414/smw.2018.14628 doesn't work. The DOI proxy should redirect to a service that can provide a response (e.g. data.crossref.org)

curl -vLH "Accept: application/rdf+xml" https://doi.org/10.4414/smw.2018.14628

< HTTP/1.1 303 See Other
< Server: Apache-Coyote/1.1
< Location: https://doi.emh.ch/smw.2018.14628

Instead, it redirects to the DOI's landing page. This may be caused by a bug in the DOI proxy, or missing data.

The DOI appears to be a Crossref one:

http://api.crossref.org/works/10.4414/smw.2018.14628/agency

This behaviour happens with other DOIs with the same prefix, e.g.:

10.4414/smf.2004.05202
10.4414/smf.2004.05203
10.4414/smf.2004.05201

crossref / cayenne Goto Github PK

cayenne's People

Contributors

Stargazers

Watchers

Forkers

cayenne's Issues

Recommend Projects

Recommend Topics

Recommend Org