Git Product home page Git Product logo

cayenne's People

Contributors

afandian avatar ckoscher avatar gbilder avatar kjw avatar markwoodhall avatar mikeyalter avatar mochajh avatar soli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cayenne's Issues

Query / endpoint to retrieve sponsoring organisations

We need a method for getting a list of Crossref Member IDs for all Sponsoring Organizations (so this is the CS assigned number that we use for Prep for example). Is there another way to get this info other than single api queries for each account name one at a time? Do we have a list of account names and member IDs somewhere that I could do a lookup on or somesuch?

References deposited without "keys" appear as empty brackets in API results

A metadata user reported finding "empty reference objects" for a number of DOIs. Based on the example he provided, it looks like those correspond to references that were deposited without keys.

For example, metadata for 10.1186/2008-2231-20-88 was last updated with submission 1355381104 which included these among its references.

<citation key="10.1186/2008-2231-20-88-B20">
<doi>10.1186/1472-698X-5-5</doi>
</citation>
<citation key="-">
<doi>10.1108/09526860510612207</doi>
</citation>
<citation key="-">
<doi>10.3923/ijp.2012.586.589</doi>
</citation>
<citation key="-">
<doi>10.1016/0749-5978(91)90020-T</doi>
</citation>
<citation key="-">
<doi>10.1348/135910705X66043</doi>
</citation>
<citation key="10.1186/2008-2231-20-88-B31">
<doi>10.1111/j.1365-2753.2011.01690.x</doi>
</citation>

The results in the REST API here show all the references with citation key="-" as empty brackets
http://api.crossref.org/works/10.1186/2008-2231-20-88

{
"key": "10.1186/2008-2231-20-88-B20",
"DOI": "10.1186/1472-698X-5-5",
"doi-asserted-by": "publisher"
},
{},
{},
{},
{},
{
"key": "10.1186/2008-2231-20-88-B31",
"DOI": "10.1111/j.1365-2753.2011.01690.x",
"doi-asserted-by": "publisher"

screenshots of the above also attached.

Screen Shot 2019-03-28 at 2 43 01 PM

Screen Shot 2019-03-28 at 2 41 31 PM

Tests should be categorised into unit tests, integration tests, and others.

The tests are currently uncategorised. Some of them make external requests and it isn't clear which ones can be run in CI and which ones are for manual use only.

Label tests into one of:

  • unit - Tests that involve only executing code, no external services.
  • component - Tests that exercise a component or service. These may involve spinning up an API server, but no external network access is required, no dependencies on e.g. Elastic Search.
  • integation - Tests that involve external services, such as Elastic Search. These will all be contained in the Docker environment, however, and require no external network access.
  • manual-live - Tests that compare the behaviour to the extant API. These are only run manually during feature development.

Update README to indicate how to run them using Docker Compose.

Sample Data for ingestion

We have two kinds of test data. The ‘corpus’ data is large but randomly chosen. The regression data is diverse, specific but out of date. Locate metadata that covers a selection of content types to give a reliable cross-section of our features.

Definition of done:

  • All of the work types enumerated in documentation. e.g. about 15 types.
  • All of the supporting input types.
  • For each, there is a good number (e.g. 100) of works that implement relevant features including any quirks.
  • Tests parse, with manually verified that they are correctly represented in regression test suite.

Internal documentation

Get documentation in a state where it contains all existing knowledge, is correct, and can support and be fully included in feature development. Remove doubt from the code.

Existing existing documentation may concern:

  • Development
  • Testing
  • Deployment & Operations

Definition of done:

  • One place for public docs, one place for development docs. Others removed.
  • Data flows upstream and downstream of the REST API are documented, and product managers for those identified. This will include internal APIs and services.
  • There are no outstanding TODOs, all have been removed and / or turned into issues.

Prepare elastic staging instance

DRAFT

Need to work up what commands need to be run to run the full system with all various roles. Docker service commands and configuration, change HAProxy endpoint, Change the pusher so it’s updated. And also need to take down the staging and clear out that elasticsearch machine and reuse it. Take down the cayenne staging instance and reuse the elastic service, assuming we get it deployed into Docker.

Disabled auto-index mappings not properly configured

Elastic allows us to store data in a field without indexing it with the 'enabled' keyword. This was used for storing coverage information. However the way it was didn't work. This caused coverage to be missing and tests to break.

  • The mappings._doc._all.enabled path was used in mapping spec, when the correct one should be mappings._doc.enabled.
  • Without this being applied, the data being stored was too large (too many fields) and Elastic was erroring.
  • There was no error handling on the coverage generation process, meaning the error that Elastic returned was ignored.
  • Integration tests were given conflicting names, meaning the tests that showed this weren't running.

Fix is a simple change in the mapping, plus error handling.

Simple way to re-index publisher names

When DOI metadata is indexed a publisher name is associated with each solr record. Need a way to quickly re-index a member's metadata after a member name change.

Date filters don't seem to limit much

https://api.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2010-01,until-pub-date:2010-01
results in 55471 records

http://api-es-staging.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2010-01,until-pub-date:2010-01
results in 950380

Doing the same queries with 2016 instead of 2010 results in different numbers, and the second record in my result shows that the publication date is 2016-02-01
http://api-es-staging.crossref.org/prefixes/10.1016/works?filter=from-pub-date:2016-01,until-pub-date:2016-01
The second record is
"DOI": "10.1016/j.jmii.2013.01.003"

funder indexing causes elastic to fail with too many fields

I know the number of fields can be increased, but I'm concerned that there more than 1000 in funders, as they shouldn't have that many fields.
I suspect they're being generated incorrectly.
This happens with the live RDF data, which can be had from the default configuration.

java.lang.IllegalArgumentException: Limit of total fields [1000] in index [funder] has been exceeded
at org.elasticsearch.index.mapper.MapperService.checkTotalFieldsLimit(MapperService.java:626) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:450) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:353) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:285) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:313) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:643) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:273) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

Tidy up branches

Remove old branches, merge those that need merging, delete merged and unused ones. Remaining should be master, elastic-2018 and any in-progress branches.

Definition of done:

  • All branches merged / deleted.
  • Clear where to start when working on a new feature.

Synchronous indexing for tests

Integration tests need to be able to index a chunk of XML into the Elastic index. Currently this is done by putting into the feed directory and waiting for indexing to complete. It would be better to ingest a given chunk of files synchronously and explicitly in the test. This will pave the way for more diverse and atomic integration tests.

Various problems accessing the CrossRef API

Hi,

I hope this is the right repository to report such issues, if not please guide me appropriately.

We are using CrossRef for a lot of fetching functionalities inside JabRef.
In the last days (and in general) the API lead to a few problems.

  1. In general the API is often unavailable serving 503 and 504 HTTP status codes
  2. The API randomly responds with a 500 HTTP error code

For example:

http://api.crossref.org/works?query.title=A+break+in+the+clouds%3A+towards+a+cloud+definition&rows=20&offset=0`

randomly responds with status 500 and this response body.

{"status":"error","message-type":"exception","message-version":"1.0.0","message":{"name":"class java.lang.RuntimeException","description":"java.lang.RuntimeException: Solr returned a partial result set","message":"Solr returned a partial result set","cause":null}}

Maybe this helps tracking down some problems with the API.

Best regards, Stefan

Coverage zeroes should be floats

Coverage is expressed as a float (i.e. proportions). Default zero value is an integer zero. For consistency, the zero should probably be 0.0.

NPE in /members route

The line:
:deposits-articles (or (> (get-in coverage-doc [:coverage :all :journal-article :_count]) 0) false)}}
in src/cayenne/data/coverage.clj
is causing an NPE as the > sometimes is given a nil.
I switched around the or and the > to get values to compare against for the time being in the staging instance;
:deposits-articles (> (or (get-in coverage-doc [:coverage :all :journal-article :_count]) 0))}}

Redundant test data - use it or lose it

The test-data directory contains 56 XML files which are not referenced anywhere else in the code. These were clearly used for manual regression testing. Having the files around with no explanation is confusing, could be misleading, and they may become irrelevant (or may already be).

Ideal case is to use each file in a regression test of some kind. Some of these will be suitable for unit tests, some may concern the feed process and more of an integration test will be useful.

If these can't be put to good use, delete them.

Verification of coverage from CS

The CS currently has a method for manually scanning the SOLR index for missed DOIs. This will need to be updated, and the process reviewed, for the new Elastic Search version.

Lightweight health checks for data in index

There should be a health check, possibly in the /heartbeat route, that can do some lightweight checks against the data. These can be checked during a re-index to sanity-check the deployment.

E.g.

  • version of cayenne running indexer = version of cayenne running api
  • random work of type X has required fields
  • schema tests, per #30

Docker Compose uses commercial Elastic Search docker image

The docker-compose.yml file uses docker.elastic.co/elasticsearch/elasticsearch and then disables the commercial bits, like xpack. This was a hacky workaround to the fact that Elastic Search refused to distribute an open source docker image. Subsequently the OSS version has been made available, and it's used in Event Data: docker.elastic.co/elasticsearch/elasticsearch-oss

Update docker-compose file.

Test infrastructure should be run *by* Docker Compose, not other way round.

The test code (in the Clojure project) in the current Elastic Search branch automates Docker by spinning up containers. The code runs outside Docker. This will make it tricky to include in CI. It also means that the code isn't being tested isn't running in the target environment.

To bring the methodology into line with how we do Docker, all code should be run in the Docker container, managed with Docker Compose. The Clojure code should have no knowledge of Docker.

Test for scroll pagination - are results frozen?

If the document set changes in the duration of a cursor session, are new results included or is the result set effectively frozen for the duration?

The ResourceSync draft specification assumes that the result set is frozen. If we rely on this, we should demonstrate it in an integration test.

Also, this will surely happen in real usage, so we should be able to say for sure what the behaviour is.

Bibtex citation not working for some IOP DOIs

IOP is having some more concerns about DOIs related to that hugely hyped black hole picture that was all over the place recently. The BibTeX citation formatting isn’t working for them.
10.3847/2041-8213/ab0ec7
10.3847/2041-8213/ab0c96
10.3847/2041-8213/ab0c57
10.3847/2041-8213/ab0e85
10.3847/2041-8213/ab0f43
10.3847/2041-8213/ab1141

The problem is that the citation formatting service doesn’t work for them. If you ask for BibTeX from it, you get a blank response

curl -LH “Accept: text/bibliography; style=bibtex” http://dx.doi.org/10.3847/2041-8213/ab0c96

and if you ask for some other formats, you get an error message: curl -LH “Accept: text/bibliography; style=harvard1" http://dx.doi.org/10.3847/2041-8213/ab0c96 returns a Java stack trace.

There’s a ‘BibTeX’ button on the article home page on IOPscience that tries to use this functionality, so it’s broken for these articles.

I can also confirm this by using search.crossref.org and trying Actions>Cite for the above DOIs.

Ability to query by first or additional author

We've had interest from some research institutions/research tracking systems about being able to query the API by first or additional author, rather than just 'author' (we have those fields in the metadata).

This is useful for institutions who want to track their research outputs based on the lead author on a piece of content.

Journals only available via ISSN

When looking for either Journals or articles contained within a journal only ISSN can be used, however there are journals lacking ISSNs. Within the system those without ISSNs must have a DOI, though there are some older ones that have neither, which we should ignore.
To find the ones without an ISSN we should implement both a journals route and as a container-doi for articles.
Will require a schema change and reindex.

Elastic index creation fails if they already exist.

cayenne.elastic.mappings/create-indexes expects all indexes not to be present and will fail if they already exist. Sometimes we want to manually delete one or more index and re-create it by running create-indexes.

Update create-indexes to accept pre-existing indexes, and create only when they don't exist.

Document behaviour around deleted DOIs

We should publish a specification for what happens to DOIs when they're deleted (i.e. aliased), if / how that is propagated into the REST API, and how that is represented. Currently deleted DOIs don't get included in a clean rebuild.

Clarity on ingestion in all forms

Cayenne can load data in a variety of formats. Documentation on ingestion is thin and the code is quite abstract.

Objectives:

  • List in the docs of all data formats that can be ingested.
  • Instructions on how to load test data in manually.
  • Component tests that load data from the file system, but don't index into Elastic.

This will then allow #59 to proceed and we can account for (or delete) each of the unused test files in the repo.

Journals tests don't all run

Some tests were given the same name and ended up not running. This let some regresssions slip such as #75.

Rename the tests and fix whatever needs fixing to make them work.

Pagination

Make sure pagination works to our satisfaction.

Definition of done:

  • All resources that include pagination listed and documented, and the different sort orders available.
  • Implementation should allow that in all cases pagination allows paging through a whole, large result set (e.g. 1 million items), with any search ordering.
  • Integration tests include deep paging.
  • Pagination documented in user docs.

Title File loading is out of sync with upstream file

The file available at http://ftp.crossref.org/titlelist/titleFile.csv has changed its headers. These are hard-coded to expect a certain ordering.

The format was most recently changed with [Jira CS-3961(https://jira.crossref.org/jira/browse/CS-3961) and updated in Cayenne master #349e12c452512c8f0906a5c816e7a9c599419b70 but this wasn't merged into the elastic-2019 branch.

To fix this, and prevent future brittleness, make column identification dynamic.

Funder information

The Funding Information model seems to work but there are questions about its suitability. Decide on the criteria for success, then make sure it does what we want.

Definition of done:

  • All features regarding funding information, both /funders route and /works route (and any others that come out of the woodwork) are documented.
  • All the features should be implemented and tested, including queries.
  • RDF relationships that are directly used should be documented in developer docs.

CSV column order should be dynamic

Currently hard-coded, leading to issues like #45 . The order of columns should be based on the headers, which are present in the file. A test case should be provided too.

Mock API responses in most integration tests.

Current integration and component tests start an HTTP server and make requests via the network. It's more efficient and less error-prone to go straight to the route functions rather than starting an HTTP server. This also allows unit tests on the API.

By taking the network request out of the default test request function in cayenne.api-fixture/api-get we can make most tests run without the network stack. Those that genuinely need to test the network can still use the old method.

Configurable feed concurrency

Currently feed (index) concurrency is set as num processors - 1. Becuase feed is used in integration tests, in which clarity is more important than speed, move this into the configuration system with the same default, but allow tests to set it differently.

Production Elastic Configuration

The sharding and replication config is hard-coded in the cayenne.elastic.mappings namespace. We need to decide on sensible values, and whether they should be configurable.

Tests are time-dependent, but time isn't specified

Tests that rely on snapshots have an implicit coupling to the time at which the snapshots were taken and when the tests were run. Coverage is broken down into 'current' and 'backfile', and if the tests are run a year later than when the snapshot was taken, some works will be classified differently.

Update test fixture to freeze time at a known point.

type 'other' includes both books and chapter / content items

We have 'other' as a type, but it includes both book and book child (chapter, usually). Other book / book child types are split out individually, these should be separated into separate types.

For example - this is registered as a book with book_type 'other':
http://api.crossref.org/works/10.5555/suffixtest

This is registered as a content_item (child of 'book') with component_type 'other' and also has type 'other' in the JSON:

http://api.crossref.org/works/10.4337/9781781001639.00001

Inconsistent Content Negotiation response for 10.4414-prefixed DOI

Content negotiation for https://doi.org/10.4414/smw.2018.14628 doesn't work. The DOI proxy should redirect to a service that can provide a response (e.g. data.crossref.org)

curl -vLH "Accept: application/rdf+xml" https://doi.org/10.4414/smw.2018.14628

< HTTP/1.1 303 See Other
< Server: Apache-Coyote/1.1
< Location: https://doi.emh.ch/smw.2018.14628

Instead, it redirects to the DOI's landing page. This may be caused by a bug in the DOI proxy, or missing data.

The DOI appears to be a Crossref one:

http://api.crossref.org/works/10.4414/smw.2018.14628/agency

This behaviour happens with other DOIs with the same prefix, e.g.:

  • 10.4414/smf.2004.05202
  • 10.4414/smf.2004.05203
  • 10.4414/smf.2004.05201

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.