Git Product home page Git Product logo

zzzarchived_arxiv-external-links's Introduction

arXiv external links

Clearinghouse for relations between arXiv e-prints and external resources

Background

A wide range of requirements and feature requests that we have received from stakeholders and end users involve attaching metadata about relations between arXiv e-prints and external resources. This includes things like links to datasets, code, and other online content, and better support for information about the published version of record.

Including this kind of relational metadata in the core arXiv metadata record is a poor fit given the way that e-prints are versioned, the requirement that secondary metadata be maintainable outside of the submission process, and the requirement that support for secondary metadata be as evolvable and extensible as possible.

Additionally, we need to bring forward into NG the automated routines that we use to harvest relational metadata (e.g. DOIs, journal citations) from other publishing platforms. A shortcoming of the classic system is that the provenance of these kinds of metadata are not tracked, which presents challenges for our partners to interpret and use those metadata downstream.

Goals

  1. Store information about relationships between e-prints and external resources:

    • Published versions, e.g. via DOIs
    • Datasets
    • Code repositories
    • Multimedia
    • Methods/protocols
    • Related works
    • Blogs and other websites
    • Etc
  2. Track provenance/history of this information.

    • When it was added,
    • How/by whom
  3. Provide APIs for retrieving this information, adding new relations.

  4. Provide an intuitive user interface for authors to curate these relations for their e-prints.

Requirements

  1. Author-owners can add, edit, deactivate relationships via an html ui. View aggregated relations, view detailed provenance log.

  2. Authorized API clients can add, edit, deactivate relationships via JSON API. Read aggregated relations, read detailed provenance log.

  3. Anonymous users, clients can view/read aggregated relations, provenance of active relations.

  4. Relation data are immutable.

    • Add means create new assertion about relation.
    • Edit means create new assertion that supercedes a previous assertion.
    • Deactivate means create new assertion that a previous relation is incorrect, should be suppressed.
  5. Relation data model includes

    • Type of relation
    • E-print id and version
    • Type of resource
    • Canonical identifier for resource (doi, uri, etc)
    • Freeform description of relation
    • Datetime added
    • Client + user who created
    • identifier of relation superceded or suppressed
  6. Emits event on Kinesis stream when data is added.

  7. For each resource type, mechanism to verify that resource exists.

Constraints

  1. Flask app that follows the design approach outlined in https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html . Can be deployed as a Docker image, e.g with uWSGI application server
  2. Separate blueprints for API, user interface
  3. Use arXiv base for base templates, error handling, etc
  4. Use arXiv auth for authn/z
  5. API documented with OpenAPI 3 and JSON schema

Code overview

This project follows the general design approach described here.

The application source lives in relations/.

The application factory module relations/factory.py defines the construction of two Flask apps: (1) an API application that provides the REST API, and an UI application that provides views for human users.

Note that these apps use the following general tooling from the arXiv project:

  • arxiv.base.Base, which adds some useful things like exception handlers, an arxiv URL converter, etc.
  • The arxiv.users library, which adds tooling for authnz/.

In general, it's a good idea to get comfortable with the arxiv namespaced packages, as there are several useful tools there.

HTTP routing is implemented in the routes module. The API and UI each have their own blueprint. Routing functions don't implement much logic; they are there to provide an interface to the controller functions.

Controller functions do the work of handling requests. They are defined in relations/controllers.py. Controllers orchestrate the real work; they use domain objects and services (below) to carry out work to handle requests.

The service domain is defined in relations/domain.py. The domain is comprised of classes or other structs that define the main concepts of the application, and the core domain logic/rules. See https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html#data-domain for details.

Service modules can be found in relations/services/. This is where (for example) a Kinesis notification producer would be implemented.

Quick-start

We use Pipenv for dependency management.

pipenv install --dev

You can run either the API or the UI using the Flask development server.

FLASK_APP=ui.py FLASK_DEBUG=1 pipenv run flask run

Dockerfiles are also provided in the root of this repository. These use uWSGI and the corresponding wsgi_[xxx].py entrypoints.

Contributing

Please see the arXiv contributor guidelines for tips on getting started.

Code of Conduct

All contributors are expected to adhere to the arXiv Code of Conduct.

zzzarchived_arxiv-external-links's People

Contributors

bonotake avatar erickpeirson avatar johncookds avatar

Stargazers

Indiscipline avatar Aaron John Sabu avatar Nick Fn Blum avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar  avatar  avatar Nick Fn Blum avatar

zzzarchived_arxiv-external-links's Issues

Implement blueprint for API routes

In order to support the requirement that

Authorized API clients can add, edit, deactivate relationships via JSON API. Read aggregated relations, read detailed provenance log.

we will need a Flask blueprint at relations/routes/api.py that exposes the following paths and methods:

  • POST to /{arxiv id}/relations : create a new relation for an e-print (supports #3)
  • PUT to /{arxiv id}/relations/{relation id} : create a new relation that supersedes an existing relation (supports #3)
  • DELETE to /{arxiv id}/relations/{relation id} : create a new relation that suppresses an existing relation (supports #3)
  • GET to /{arxiv id} : get all of the active (not suppressed or superseded) relations for an e-print (supports #4)
  • GET to /{arxiv id}/log : get the complete set of relation events (including suppressed and superseded)

See https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html#routes for how routes are implemented

Implement domain model for links between arXiv e-prints and external resources

We need a core domain class that defines a external relation assertion, in relations/domain.py.

Relation data are immutable.

  • Add means create new assertion about relation.
  • Edit means create new assertion that supercedes a previous assertion.
  • Deactivate means create new assertion that a previous relation is incorrect, should be suppressed.

Relation data model includes

  • Type of relation
  • E-print id and version
  • Type of resource
  • Canonical identifier for resource (doi, uri, etc)
  • Freeform description of relation
  • Datetime added
  • Client + user who created
  • identifier of relation superceded or suppressed

Consider something like typing.NamedTuple.

Implement validator for URL

We require a function is_valid(value: str) -> bool in relations/process/validate/url.py that checks whether or not a string is a valid HTTP(S) URL.

Use case: API client can read all active links between e-print and external resource

As a developer of an information systems/platform, I want to be able to develop against a RESTful JSON API that supports retrieving an array of all active external relations for an individual e-print and a summary of who added the information, so that I can fill in incomplete data in my system and make informed choices about what to include.

Variation 2:

... So that I can develop a user interface that presents external links to users.

This will require an endpoint that supports GET requests, with the e-print id (and optional version) as a URL parameter.

We may want a query param to toggle condensed and detailed views, eg for application that doesn't require provenance.

Implement blueprint for UI routes

In order to support the requirement that

Author-owners can add, edit, deactivate relationships via an html ui. View aggregated relations, view detailed provenance log.

we will need a Flask blueprint at relations/routes/ui.py that exposes the following paths and methods:

  • POST to /{arxiv id}/relations : create a new relation for an e-print (supports #6)
  • POST to /{arxiv id}/relations/{relation id} : create a new relation that supersedes an existing relation (supports #6)
  • POST to /{arxiv id}/relations/{relation id}/delete : create a new relation that suppresses an existing relation (supports #6)
  • GET to /{arxiv id} : get all of the active (not suppressed or superseded) relations for an e-print (supports #7)
  • GET to /{arxiv id}/log : get the complete set of relation events (including suppressed and superseded)

See https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html#routes for how routes are implemented

Parser for legacy bib-feeds

arXiv consumes "bib-feeds" from a handful of publishers and other data providers. These feeds contain DOI and/or journal reference information for arXiv e-prints. Samples can be found in tests/data/legacy/feeds/.

We require a function parse(feed: str) -> List[FeedEntry] at harvesters/bibfeed/parse.py, where FeedEntry is

class FeedEntry(TypedDict):
    arxiv_id: str
    doi: Optional[str]
    journal_ref: Optional[str]

Note that these feeds include relations for non-arXiv preprints as well; parse() should only return entries for arXiv e-prints.

Implement service module for checking whether or not an URL exists

We require a service module at relations/services/url.py that defines a class URLService with a method exists(url: str, timeout_seconds: int = 5) -> bool. The method should return True if the provided URL exists, i.e. a status code < 400 is returned in response to a HEAD request in under timeout_seconds. We can use the requests module for this. Care should be taken to handle any exceptions gracefully, for this method will be used to call a very wide range of addresses.

Use-case: author-owner can add, edit links to external resources for their e-prints

As an author-owner of an announced e-print (i.e. the submitter, or another user authorized by the submitter) I want to be able to curate external links associated with that e-print, so that I can direct readers to useful information related to my research. I should be able to add new links, and also make revisions/override external links added by other users/clients.

Author-owners should have privileges that other clients of this service don't have, namely the ability to deactivate/suppress links created by other clients. For example, if an automated harvester adds a DOI that is incorrect, the author-owner should have full control over suppressing the incorrect relation or adding a correct DOI that supersedes it.

Use case: API client can add new assertions about links between e-print and external resource

As the developer of a metadata harvester for arXiv e-prints, I want to develop against a RESTful JSON API that supports adding new assertions about relations between e-prints and external resources, so that information that I collect about published versions of a paper (e.g. DOI of Version of Record) can be made available to other users and API clients.

An example application is the feed collectors that monitor publisher APIs for DOIs associated with e-prints.

Another application might be a collector that pulls from ORCID.

To support this use case, we will need an HTTP route that supports POST and/or PUT requests that creates new assertions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.