arxiv / zzzarchived_arxiv-external-links Goto Github PK

View Code? Open in Web Editor NEW

4.0 7.0 5.0 222 KB

Clearinghouse for relations between arXiv e-prints and external resources

License: MIT License

Python 95.55% HTML 1.36% Shell 3.09%

arxiv-ng flask python

zzzarchived_arxiv-external-links's Introduction

arXiv external links

Clearinghouse for relations between arXiv e-prints and external resources

Background

A wide range of requirements and feature requests that we have received from stakeholders and end users involve attaching metadata about relations between arXiv e-prints and external resources. This includes things like links to datasets, code, and other online content, and better support for information about the published version of record.

Including this kind of relational metadata in the core arXiv metadata record is a poor fit given the way that e-prints are versioned, the requirement that secondary metadata be maintainable outside of the submission process, and the requirement that support for secondary metadata be as evolvable and extensible as possible.

Additionally, we need to bring forward into NG the automated routines that we use to harvest relational metadata (e.g. DOIs, journal citations) from other publishing platforms. A shortcoming of the classic system is that the provenance of these kinds of metadata are not tracked, which presents challenges for our partners to interpret and use those metadata downstream.

Goals

Store information about relationships between e-prints and external resources:
- Published versions, e.g. via DOIs
- Datasets
- Code repositories
- Multimedia
- Methods/protocols
- Related works
- Blogs and other websites
- Etc
Track provenance/history of this information.
- When it was added,
- How/by whom
Provide APIs for retrieving this information, adding new relations.
Provide an intuitive user interface for authors to curate these relations for their e-prints.

Requirements

Author-owners can add, edit, deactivate relationships via an html ui. View aggregated relations, view detailed provenance log.
Authorized API clients can add, edit, deactivate relationships via JSON API. Read aggregated relations, read detailed provenance log.
Anonymous users, clients can view/read aggregated relations, provenance of active relations.
Relation data are immutable.
- Add means create new assertion about relation.
- Edit means create new assertion that supercedes a previous assertion.
- Deactivate means create new assertion that a previous relation is incorrect, should be suppressed.
Relation data model includes
- Type of relation
- E-print id and version
- Type of resource
- Canonical identifier for resource (doi, uri, etc)
- Freeform description of relation
- Datetime added
- Client + user who created
- identifier of relation superceded or suppressed
Emits event on Kinesis stream when data is added.
For each resource type, mechanism to verify that resource exists.

Constraints

Flask app that follows the design approach outlined in https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html . Can be deployed as a Docker image, e.g with uWSGI application server
Separate blueprints for API, user interface
Use arXiv base for base templates, error handling, etc
Use arXiv auth for authn/z
API documented with OpenAPI 3 and JSON schema

Code overview

This project follows the general design approach described here.

The application source lives in relations/.

The application factory module relations/factory.py defines the construction of two Flask apps: (1) an API application that provides the REST API, and an UI application that provides views for human users.

Note that these apps use the following general tooling from the arXiv project:

arxiv.base.Base, which adds some useful things like exception handlers, an arxiv URL converter, etc.
The arxiv.users library, which adds tooling for authnz/.

In general, it's a good idea to get comfortable with the arxiv namespaced packages, as there are several useful tools there.

HTTP routing is implemented in the routes module. The API and UI each have their own blueprint. Routing functions don't implement much logic; they are there to provide an interface to the controller functions.

Controller functions do the work of handling requests. They are defined in relations/controllers.py. Controllers orchestrate the real work; they use domain objects and services (below) to carry out work to handle requests.

The service domain is defined in relations/domain.py. The domain is comprised of classes or other structs that define the main concepts of the application, and the core domain logic/rules. See https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html#data-domain for details.

Service modules can be found in relations/services/. This is where (for example) a Kinesis notification producer would be implemented.

Quick-start

We use Pipenv for dependency management.

pipenv install --dev

You can run either the API or the UI using the Flask development server.

FLASK_APP=ui.py FLASK_DEBUG=1 pipenv run flask run

Dockerfiles are also provided in the root of this repository. These use uWSGI and the corresponding wsgi_[xxx].py entrypoints.

Contributing

Please see the arXiv contributor guidelines for tips on getting started.

Code of Conduct

All contributors are expected to adhere to the arXiv Code of Conduct.

zzzarchived_arxiv-external-links's People

Contributors

Stargazers

Watchers

Forkers

bonotake poad42 johncookds tubbz-alt

zzzarchived_arxiv-external-links's Issues

Implement blueprint for API routes

In order to support the requirement that

Authorized API clients can add, edit, deactivate relationships via JSON API. Read aggregated relations, read detailed provenance log.

we will need a Flask blueprint at relations/routes/api.py that exposes the following paths and methods:

POST to /{arxiv id}/relations : create a new relation for an e-print (supports #3)
PUT to /{arxiv id}/relations/{relation id} : create a new relation that supersedes an existing relation (supports #3)
DELETE to /{arxiv id}/relations/{relation id} : create a new relation that suppresses an existing relation (supports #3)
GET to /{arxiv id} : get all of the active (not suppressed or superseded) relations for an e-print (supports #4)
GET to /{arxiv id}/log : get the complete set of relation events (including suppressed and superseded)

See https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html#routes for how routes are implemented

Write JSON Schema document for assertion about link between e-print and external resource

This should define (in schema/resources/assertion.json) a JSON representation of the domain class from #1 .

Use-case: users can see a list of all external relations associated with a particular e-print

As a reader on the arXiv platform, I want to see a list of all of the external relations associated with a particular e-print, so that I can find auxiliary information like datasets, code, and other content. I should be able to see who added the information, and when it was added.

Implement validator for DOI

We require a function is_valid(value: str) -> bool in relations/process/validate/doi.py that checks whether or not a string is a valid DOI.

We can re-use the permissive pattern at https://github.com/arXiv/arxiv-base/blob/cc9eb8f9b5e22e643c3a669420d6f9931acb1bfd/arxiv/base/urls/links.py#L74-L79

Implement domain model for links between arXiv e-prints and external resources

We need a core domain class that defines a external relation assertion, in relations/domain.py.

Relation data are immutable.

Add means create new assertion about relation.
Edit means create new assertion that supercedes a previous assertion.
Deactivate means create new assertion that a previous relation is incorrect, should be suppressed.

Relation data model includes

Type of relation
E-print id and version
Type of resource
Canonical identifier for resource (doi, uri, etc)
Freeform description of relation
Datetime added
Client + user who created
identifier of relation superceded or suppressed

Consider something like typing.NamedTuple.

Implement validator for URL

We require a function is_valid(value: str) -> bool in relations/process/validate/url.py that checks whether or not a string is a valid HTTP(S) URL.

Use-case: ingest and display data about e-prints from Papers with Code

Papers with Code finds code repositories associated with ML papers, including e-prints on arXiv. They make their data available under CC-BY-SA. We should explore what would be involved in incorporating this dataset into arXiv external links, and displaying links to the code repositories on the arXiv abs page of ML papers.

@rstojnic what do you think?

Implement a storage service for external links

We need to implement a module for storing and retrieving external links for e-prints. It should go here: https://github.com/arXiv/arxiv-external-links/tree/develop/relations/services

We can use a SQL database for this, or something else. Let's discuss what to use before we get too far into the implementation.

We'll want to focus on a storage model that works well for #3 and #4

Use case: API client can read all active links between e-print and external resource

As a developer of an information systems/platform, I want to be able to develop against a RESTful JSON API that supports retrieving an array of all active external relations for an individual e-print and a summary of who added the information, so that I can fill in incomplete data in my system and make informed choices about what to include.

Variation 2:

... So that I can develop a user interface that presents external links to users.

This will require an endpoint that supports GET requests, with the e-print id (and optional version) as a URL parameter.

We may want a query param to toggle condensed and detailed views, eg for application that doesn't require provenance.

Implement blueprint for UI routes

In order to support the requirement that

Author-owners can add, edit, deactivate relationships via an html ui. View aggregated relations, view detailed provenance log.

we will need a Flask blueprint at relations/routes/ui.py that exposes the following paths and methods:

POST to /{arxiv id}/relations : create a new relation for an e-print (supports #6)
POST to /{arxiv id}/relations/{relation id} : create a new relation that supersedes an existing relation (supports #6)
POST to /{arxiv id}/relations/{relation id}/delete : create a new relation that suppresses an existing relation (supports #6)
GET to /{arxiv id} : get all of the active (not suppressed or superseded) relations for an e-print (supports #7)
GET to /{arxiv id}/log : get the complete set of relation events (including suppressed and superseded)

See https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html#routes for how routes are implemented

Parser for legacy bib-feeds

arXiv consumes "bib-feeds" from a handful of publishers and other data providers. These feeds contain DOI and/or journal reference information for arXiv e-prints. Samples can be found in tests/data/legacy/feeds/.

We require a function parse(feed: str) -> List[FeedEntry] at harvesters/bibfeed/parse.py, where FeedEntry is

class FeedEntry(TypedDict):
    arxiv_id: str
    doi: Optional[str]
    journal_ref: Optional[str]

Note that these feeds include relations for non-arXiv preprints as well; parse() should only return entries for arXiv e-prints.

Implement service module for checking whether or not an URL exists

We require a service module at relations/services/url.py that defines a class URLService with a method exists(url: str, timeout_seconds: int = 5) -> bool. The method should return True if the provided URL exists, i.e. a status code < 400 is returned in response to a HEAD request in under timeout_seconds. We can use the requests module for this. Care should be taken to handle any exceptions gracefully, for this method will be used to call a very wide range of addresses.

Write JSON Schema for e-print external relations view

This should describe (as schema/resources/links.json) the JSON view implemented for #4 .

Use-case: author-owner can add, edit links to external resources for their e-prints

As an author-owner of an announced e-print (i.e. the submitter, or another user authorized by the submitter) I want to be able to curate external links associated with that e-print, so that I can direct readers to useful information related to my research. I should be able to add new links, and also make revisions/override external links added by other users/clients.

Author-owners should have privileges that other clients of this service don't have, namely the ability to deactivate/suppress links created by other clients. For example, if an automated harvester adds a DOI that is incorrect, the author-owner should have full control over suppressing the incorrect relation or adding a correct DOI that supersedes it.

Use case: API client can add new assertions about links between e-print and external resource

As the developer of a metadata harvester for arXiv e-prints, I want to develop against a RESTful JSON API that supports adding new assertions about relations between e-prints and external resources, so that information that I collect about published versions of a paper (e.g. DOI of Version of Record) can be made available to other users and API clients.

An example application is the feed collectors that monitor publisher APIs for DOIs associated with e-prints.

Another application might be a collector that pulls from ORCID.

To support this use case, we will need an HTTP route that supports POST and/or PUT requests that creates new assertions.