inspirehep / inspire-schemas Goto Github PK

View Code? Open in Web Editor NEW

8.0 15.0 26.0 1.6 MB

Inspire JSON schemas and utilities to use them.

License: GNU General Public License v2.0

Python 97.96% Shell 1.12% JavaScript 0.92%

inspirehep python json-schema json

inspire-schemas's Introduction

inspire-schemas

Inspirehep schemas and related tools bundle.

Free software: GPLv2 license
Documentation: https://inspire-schemas.readthedocs.io

inspire-schemas's People

Contributors

Stargazers

Watchers

inspire-schemas's Issues

Record migrators

After versioning is introduced in #9 we will need to support upgrading records. There should be an API similar to:

def record_needs_upgrade(record):
     ...
     return True

def upgrade_record(record):
     ...
     return upgraded_record

That automatically upgrades the provided record.

builder: support `publication_info.material` field

Builder isn't populating publication_info.material field.

What to do with 773__t?

We have lots of them (https://inspirehep.net/search?ln=en&p=773__t%3A**&wl=0), but no space in the schema to put them. Shall we just discard them?

Page numbers

To be renamed page_number or number_of_pages?
To be enforced to be a simple int.

Make scripts/generate_example_records.js work out of the box

~~Add a package.json with the dependencies~~ Overkill, it's just one dependency.
Amend .gitignore to exclude node_modules
Make scripts/generate_example_records.json executable
Add missing requires

Orphan affiliations

Some ~20K records:
https://inspirehep.net/search?ln=en&p=902%3A**&of=hb&action_search=Search&sf=earliestdate&so=d&wl=0

have orphan affiliations, i.e. affiliation not attached to a specific author but just available for searching.

This are stored in MARC 902__a for Literature. We should preserve this field, because currently, it's not possible to recompute the affiliation for many of the affected records due to missing PDF.

Use only inspire_field_categories and arxiv_field_categories

There's no need anymore to support categories from other sources, we can only keep inspire and arxive ones, simplifying a lot and getting rid of the nefarious challenging 'anyOf'.

external_system_numbers

external_system_numbers renamed to external_system_identifiers,
obsolete disappear.
Texkeys are moved to a dedicated fields. Any obsolete texkey is moved to the end of the list.
institutions renamed scheme

TeXKeys on is own field

TeXKeys are automatically generated, shouldn't be deleted, but can be declared obsolete by a cataloger.

LiteratureBuilder: avoid adding empty fields

In [3]: x = LiteratureBuilder(source='arxiv')

In [4]: x.add_language('')

In [5]: x
Out[5]: LiteratureBuilder(source="arxiv", record={'languages': ['']})

Add docs on the integration tests

Specifically how to generate the examples for the backwards-compatibility checks.

Schemas will evolve with time. For this reason we should introduce versioning. Each schema should be versioned with semver technique. Each time a modification is performed this is done by copying the last schema into a new file first and then performing the modification.

Add keywords field to the builder

The keywords field is not handled by the builder.

Not depending on node

Currently we depend on node in order to generate fake data from jsonschema as noted in
#13 (comment)
We should port this to use a pythonic solution possibly based e.g. on fake-factory with some jsonschema extension.

Privilege INSPIRE categories

Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.

However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.

It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).

Generate valid json from the schemas when packaging

In order to support multiline strings we are using yaml format for the schemas (see #97 ), that means that we have to generate valid json at package time in order to distribute them (also will have to see how to handle in the tests and such).

Copyright.material is not really populated

When populating it, it's actually overriding the 'url' field:

In [2]: from inspire_schemas.builders import LiteratureBuilder

In [3]: lb = LiteratureBuilder(source='mama', )

In [4]: lb
Out[4]: LiteratureBuilder(source="mama", record={})

In [5]: lb.add_copyright(material='i\'m not a url')

In [6]: lb
Out[6]: LiteratureBuilder(source="mama", record={'copyright': [{'url': "i'm not a url"}]})

Create enum for source field

The source.yml is currently a free text field. This will be a problem when we start using this field value as the way to retrieve records from different sources (arXiv, APS, ...) by the merger, since this value will be in the database and should always be the same.

An enum should be created instead.

NOTE: One of the values in the enum should be used for the user submission forms. The forms currently don't populate the source field, but they should start doing that once we have the merger working.

external_system_numbers -> external_identifiers

We shall rename this so that it is more meaningful.

Also TeXKeys are not external and should thus not be stored in this field.

schema: /acquisition_source/datetime does not have a type

This part of schema acquisition_source/datetime doesn't have type which should be string

"acquisition_source": {
      "$schema": "http://json-schema.org/schema#", 
      "additionalProperties": false, 
      "description": "Only the first source is stored: if the record later gets enriched with\nmetadata coming from a second source, the `acquisition_source` is not\nupdated.\n\n:MARC: ``541``", 
      "properties": {
        "datetime": {
          "description": "This does not necessarily coincide with the creation date of the\nrecord, as there might be some delay between the moment the\noriginal information is obtained and a record is finally created in\nthe system.\n\n:MARC: ``541__d``", 
          "format": "date-time", 
          "title": "Date on which the metadata was obtained"
        },...

version_manager: handle inter-repo bug references.

Right now we only show in the changelog/releasenotes the bugs specified with closes #XXX but not the ones that have the external repo reference like addresses anotherorg/anotherrepo#XXX, we should show those too.

schema: HEP 210__a title variations: RPP

@annetteholtkamp commented on Thu Mar 02 2017

The 210 field mostly contains synonyms, expansions of acronyms etc - which can probably be ignored in the future. But it also contains the acronym RPP which is used for the citesummary option to exclude the RPP. These records we need to tag somehow.

Finalize literature mini-schema

Before deployment of Inspire 3 to labs, we need to finalize the part of the schema that is used in the harvesters that are currently on labs, namely user literature suggestions and non-CORE arXiv harvesting.

The concerned keys are:

arXiv harvesting

{'abstracts', 'preprint_date', 'collections', 'external_system_numbers', 'license', 'report_numbers', 'collaborations', 'titles', 'arxiv_eprints', 'public_notes', 'acquisition_source', 'publication_info', 'copyright', 'authors', 'dois', 'page_nr', 'imprints'}

Literature suggestion

{'external_system_numbers', 'accelerator_experiments', 'arxiv_eprints', 'collaboration', 'publication_info', 'acquisition_source', 'license', 'report_numbers', 'public_notes', 'imprints', 'abstracts', 'thesis', 'titles', 'languages', 'thesis_supervisors', 'field_categories', 'dois', 'urls', 'collections', 'title_translations', 'hidden_notes', 'authors'}

Added by workflow

{'core', 'citeable', 'published'}

Union, to go through

what we talk about when we talk about _files

The contents of the _files field for Literature record is supposed to contain the metadata to retrieve the file by invenio-records-files.

The schema we have for it was copied by Zenodo and so contains the basic info in the invenio-records-files schema, but also some additional Zenodo-specific stuff (previewer, type) that we probably don't need.

The workflow is using this field in yet another way, writing description and doctype there (for arXiv PDF and extracted plots), which are not currently in the schema. This doesn't cause any error now as the results of _files are discarded anyway and never sent to legacy, but we should decide on what information we really want to have there.

@kaplun and @tsgit know how files ares handled on legacy and could share their experience.
Discussing with @jacquerie, we identified the following keys that might be useful:

doctype (or document_type?): to signal what kind of document is attached. This would be an enum with values fulltext, plot, what else?
mime_type: how this document is encoded, which might warrant a different handling (e.g. PDF vs XML for a fulltext).
hidden: a flag to indicate whether this file is publicly visible (would be true for fulltexts used for indexing that we may not serve directly to our users).

Pumping up flags (CORE, Citeable, Refereed)

@michamos and I have identified that the flags: CORE, Citeable, Refereed are particular because they can be set by algorithms (that could evolve in time), but that could be overridden by a curator.

Since it's not currently possible to identify who set the flag, we have the issue of:

identifying when a curator should override the flag for a group of record because it's not sure whether the flag has been set by an algorithm or a curator
ditto for algorithms
we don't know when an algorithm should reprocess the flag.

We would propos that this flag are augmented with a source information. E.g.:

"citeable": {
    "flag": true,
    "source": "CURATOR",
}

"core": {
    "flag": false,
    "source": "core-guesser",
}

Alternatively we could have a list of objects:

"citeable": [
   {
       "flag": true,
       "source": "CURATOR",
   },
   {
       "flag": false,
       "source": "citeable-guesser",
   }
]

Possibly sorted chronologically (e.g. latest first), where the final value is computed at runtime (e.g. CURATOR has precedence over an algorithm.

The cons of this approach is that is adding quite some complexity.

Better ideas? Are we solving the wrong problem?

@annetteholtkamp, @jacquerie, @StellaCh ?

Move from inspire-next repo jsonschema validity test

As pointed out by @jacquerie, the test that ensures that our schemas are jsonschema compliant is sitting on:
https://github.com/inspirehep/inspire-next/blob/a6c641e860a9e7c357e30af5733edef9546afb76/tests/unit/records/test_records_jsonschemas.py

This should be moved to this repo, so that we don't commit invalid schemas.

Add versioning documentation

author schemas missing fields

Following fields are being used in author forms but are not on the schema. One of the side effects is that those fields are not visible in the record editor, so not editable from the Holding Pen.

_private_note
collections
_degree_type
_rank

Merge classification_number into keywords

classification_number is actually very similar to keywords to the point that we can simply merge them together.

Note that on display time:
PACS should be displayed in their human friendly way. PDG, should link to the PDG website.

builder: support `dois.material` field

Builder isn't populating dois.material field.

keywords for harvests

I think there are no keywords at all for arXiv harvest.
All keywords I found are for user-submissions, e.g.
https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=metadata.keywords.value:model

FYI:
If we have the fulltext (e.g. arXiv) we run
python bibclassify_cli.py -s -n 35 -k HEPontCore.rdf fulltext.pdf

If we have only metadata (e.g. for journals) we run
python bibclassify_cli.py -s -n 10 -k HEPontCore.rdf title_abstract_keywords.txt

raw_value in authors.affiliations

Besides the current value, record and curated_relation a raw_value field is needed to add the value automatically extracted by tools. This field will not be editable but is helpful to then populate the value field (which contains the canonical name that links to a record in the Institutions database)

https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/hep.json#L79

`thesis_supervisor` is going to be superseded by `supervisor` `inspire_role`

after merging #57, required for #58

Reference simplification

Originally we designed references to mimic mini-records. It looks like Catalogers will still want to curate them, so we shall simplify where possible the structure to make it nice when visualized in the record editor.

PR to soon come.

Add INIS keyword schema

INIS has a vocabulary of keywords that some of our records use, for example: https://inspirehep.net/record/132217/export/xme.

@annetteholtkamp says that we need to add it to

inspire-schemas/inspire_schemas/records/hep.yml

Lines 836 to 877 in efa2996

 schema: 

 description: |- 

  Describes to which vocabulary the keyword in :ref:`value` 

  belongs. 

  ``INSPIRE`` 

  :MARC: ``695__2:INSPIRE`` 

  The keyword has been assigned by Inspire, and 

  belongs to its vocabulary. 

  ``JACOW`` 

  :MARC: ``6531_2:JACOW`` 

  The keyword is part of the `Joint Accelerator 

  Conference Website (JACoW) vocabulary 

  <http://jacow.org/Tools/Keywords>`_. 

  ``PACS`` 

  :MARC: ``084__2:PACS`` 

  The keyword is a number from the `Physics and 

  Astronomy Classification Scheme (PACS) 

  <https://publishing.aip.org/publishing/pacs/pacs-2010-regular-edition>`_. 

  ``PDG`` 

  :MARC: ``084__2:PDG`` 

  The keyword is a `PDG Indentifier 

  <http://pdg.lbl.gov/2016/pdgid/PDGIdentifiers.html>`_. 

  .. note:: 

  If not present, the keyword is a free-form keyword, 

  not necessarily part of any vocabulary. 

  enum: 

 - INSPIRE 

 - JACOW 

 - PACS 

 - PDG 

 title: Keyword vocabulary 

 type: string

schemas: add noAdditionalProperties everywhere

Continues inspirehep/inspire-next#1504.

Support `copyright.year` field in the builder

The builder should support filed copyright.year.

hack: backport hidden publication_info

Commit b28c499 should be backported to become inspire-schemas==31.1.0.

Generate documentation based on JSONSchema

The documentation should have an explicit chapter that is automatically generated after the JSONSchema.

Please feel free to suggest on best practice and how this should looke like.

The aim of this project is to allow anybody to discover which fields exist and how to use them, and their structure, without having to open the JSON.

arXiv harvesting uses `collaborations`, schema `collaboration`

This should be harmonized, one way or another, for #58

Use google docstyle docs

Add the sphinx plugin for it (see https://github.com/inveniosoftware-contrib/json-merger/tree/master/json_merger for an example)

Check if it can validate that the params defined in the docstring matches the params in the function, that would be great.

builder: support `license.material`, `license.imposing` fields

Builder isn't populating license.material, license.imposing fields.

Rewrite the documentation of the builder

Depends on #107, rewrite/amend/write the docstrings of the methods of the builders on google docs style and verify the content (with a curator if needed).

Also make sure to generate a nice page for it so it can be easily accessed and consulted to builder users.

Unit tests for the builder

We do not have the unit tests for the builder.
This is the list of the unit test that we should write:

Are TeXkeys really in 999C5k?

Schema says (

inspire-schemas/inspire_schemas/records/elements/reference.yml

Line 203 in 1db9777

:MARC: ``999C5k``

) that 999C5k contains TeXkeys, but a search on legacy disproves this: https://inspirehep.net/search?p=999C5k%3A**.

Is this actually a new thing that is going to happen with your improvements to refextract that you made, @michamos ?

schemas: don't restrict names with enum or format

Continues inspirehep/inspire-next#1468.

hack: backport book builder functions

Commit f8133d3 should be backported to become inspire-schemas==31.2.0.

~~Depends on #143.~~

normalize name

In the utils.py module, when normalizing names, we just need to remove the space wherever we have a '. ' pair on the first name (that is, the second element after splitting by ',' a string like 'Caro, D. J.'.

jsonschema: harmonize document_type Vs. collection

@kaplun commented on Wed Jun 15 2016

Currently the collection field is just a porting of MARC 980. E.g.:

{"collections": [
    {"primary": "CORE"},
    {"primary": "Book"},
    {"primary": "HEP"},
    {"primary": "Citeable"}
]}

On the other hand the concept of document_type is managed by the enhancer facet_inspire_doc_type. E.g.:

{"facet_inspire_doc_type": ["book"]}

This is suboptimal.

Citeable should become a flag and be added at indexing time based on other values
CORE should be declared as a flag and be available in all schemas
HEP is actually redundant since it represents the fact that this is a record from Literature
facet_inspire_doc_type should become document_type and be populated by dojson, rather than enhanced before indexing.

@kaplun commented on Thu Aug 25 2016

I think we should bump priority of this one, since category is really scattered around the code base in a wrong way.

@jacquerie commented on Fri Aug 26 2016

This needs a spec. The thing I refactored in https://github.com/inspirehep/inspire-next/blob/25cba484c652d21c112628c4967e684c02d6fcfd/inspirehep/modules/records/receivers.py#L120-L210 is a 1 to 1 correspondence with the code that was there before, but makes no sense to me.

You need to define precisely:

What should we do with collections
What are the allowable document_types
How are the 980__a values mapped to those allowable values
What is the algorithm that sets Citeable

@kaplun commented on Mon Sep 19 2016

What should we do with collections

Should disappear.

What are the allowable document_types

Exactly the keys that you have defined in the two tables in the docstring populate_inspire_document_type().

How are the 980__a values mapped to those allowable values

Those that are document types are mapped to document types (possibly with the same value as in 980). Those that are flags, such as citeable and core should be mapped to a corresponding flag. (I think we have it for core already). deleted is also mapped to a deleted field.

What is the algorithm that sets Citeable:

Mmh. I guess it's more the question of what is not citeable. I see by default anything that comes from arXiv is citeable. @annetteholtkamp can you help here?

@jmartinm commented on Wed Oct 05 2016

Now that inspirehep/inspire-next#1589 is merged, and once we get rid of the collections field, note that we will still have a _collections field managed by invenio-collections.

This field gets populated based on a query matching the record (see config) so that config will have to be amended for the queries to match the new document_type field.

@jmartinm commented on Thu Oct 06 2016

Collection fields are:

   1100059 HEP 
    832449 Citeable 
    698265 CORE 
    584001 Published 
    401801 arXiv 
    312449 ConferencePaper 
     57480 Arxiv 
     51674  
     26942 Thesis 
     27168 Review 
     10148 Lectures 
      5637 NOTE 
      7727 Proceedings 
      7507 noncore 
      4643 THESIS 
      3488 Introductory 
      4003 Withdrawn 
      3982 Hep 
      3344 Book 
       172 D0-PRELIMINARY-NOTE 
      2069 BOOK 
      1239 NONCORE 
      1240 PROCEEDINGS 
      1115 citeable 
       891 BookChapter 
       452 Conference 
        33 Core 
        11 REPORT 
         5 Preprint 
         6 published 
         3 core 
         3 Note 
         2 Noncore 
         2 Report 
         1 PUBLISHED 
         1 thesis 
         1 book 
         1 proceedings 
         1 NonCore 
         1 Conferencepaper 
         1 Accelerators 
         1 Proceddings

Our schema says possible document types are:

[
"Published",
 "arXiv",
 "ActivityReport",
 "ConferencePaper",
 "Thesis",
 "Review",
 "Lectures",
 "Note",
 "Proceedings",
 "Introductory",
 "Book",
 "BookChapter",
 "Report"
  ],

And our current document type facet has the following mapping (from 980__a value to facet value):

        'published': 'peer reviewed',
        'thesis': 'thesis',
        'book': 'book',
        'bookchapter': 'book chapter',
        'proceedings': 'proceedings',
        'conferencepaper': 'conference paper',
        'note': 'note',
        'report': 'report',
        'activityreport': 'activity report',
        'lectures': 'lectures',
        'review': 'review',

'preprint' if no journal info

So for this issue to proceed we would need:

For each value in 980__a what document type from our schema to assign.
Should the document type in the schema be human readable ActivityReport vs Activity Report for example.
Complete the enum in our jsonschema to acomodate document types such as the hidden collections: Hal Hidden, notes from different experiments and so on.
Do we still need the receiver to convert the document types into a more 'user facing' facet, with values such as Preprint or Peer Reviewed which are not mentioned as document types in the schema.

@kaplun commented on Wed Oct 05 2016

@jmartinm Thanks. I'd suggest @annetteholtkamp et al. can help us removing all the outliers from 980__a

@kaplun commented on Thu Oct 06 2016

For each value in 980__a what document type from our schema to assign.

1100059 HEP -> Literature Schema
    832449 Citeable -> citeable flag
    698265 CORE -> 'core' flag: True
    584001 Published -> published flag
    401801 arXiv -> ignore (redundant)
    312449 ConferencePaper -> 'conference paper'
     57480 Arxiv ->  ignore (redundant)
     51674  -> ignore (W00t?)
     26942 Thesis -> 'thesis'
     27168 Review -> 'review'
     10148 Lectures -> 'lectures'
      5637 NOTE -> 'note'
      7727 Proceedings -> 'proceedings'
      7507 noncore -> 'core' flag: False
      4643 THESIS -> 'thesis'
      3488 Introductory -> 'introductory'
      4003 Withdrawn -> 'withdrawn' flag
      3982 Hep -> Literature Schema
      3344 Book -> 'book'
       172 D0-PRELIMINARY-NOTE 
       891 BookChapter -> 'book chapter'
       452 Conference -> Wot? In HEP?
        11 REPORT -> 'report'
         5 Preprint -> ignore redundant
         6 published -> ignore redundant

Should the document type in the schema be human readable ActivityReport vs Activity Report for example.

I believe so: anyway cataloguers will edit record either though scripts or through the editor, which will enforce the accepted values. Therefore there is no need to introduce a simplified spelling to avoid typos.

Complete the enum in our jsonschema to acomodate document types such as the hidden collections: Hal Hidden, notes from different experiments and so on.

I will create a dedicated issue for that.

Do we still need the receiver to convert the document types into a more 'user facing' facet, with values such as Preprint or Peer Reviewed which are not mentioned as document types in the schema.

I believe given the above point on how to spell document types, the answer is nope.

@annetteholtkamp commented on Thu Oct 13 2016

On 05 Oct 2016, at 12:39, Samuele Kaplun [email protected] wrote:

1100059 HEP -> Literature Schema
832449 Citeable -> citeable flag
698265 CORE -> 'core' flag: True
584001 Published -> published flag
401801 arXiv -> ignore (redundant)
why is that redundant?
312449 ConferencePaper -> 'conference paper'
57480 Arxiv -> ignore (redundant)
51674 -> ignore (W00t?)

what is this?

 26942 Thesis -> 'thesis'
 27168 Review -> 'review'
 10148 Lectures -> 'lectures'
  5637 NOTE -> 'note'
  7727 Proceedings -> 'proceedings'
  7507 noncore -> 'core' flag: False

Is there only true and false, or also undefined ?

  4643 THESIS -> 'thesis'
  3488 Introductory -> 'introductory'
  4003 Withdrawn -> 'withdrawn' flag
  3982 Hep -> Literature Schema
  3344 Book -> 'book'
   172 D0-PRELIMINARY-NOTE 
We should ask Heath whether this tag is still necessary if a record is in HEP
891 BookChapter -> 'book chapter'
452 Conference -> Wot? In HEP?

Yes, we never managed to clean them all up. Most of them are probably conf papers - but needs to be checked.

    11 REPORT -> 'report'
     5 Preprint -> ignore redundan
     6 published -> ignore redundant
Annette

You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub inspirehep/inspire-next#1215 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AM1-O0bt4NEgPP7sFdLO3gHQmNT0f53Fks5qw5qFgaJpZM4I2XVr.

@kaplun commented on Thu Oct 13 2016

401801 arXiv -> ignore (redundant)
why is that redundant?

We don't need to say arXiv. We already know from the arXiv ID.

 51674  -> ignore (W00t?)

what is this?

A collection with empty value 😄

  7507 noncore -> 'core' flag: False
Is there only true and false, or also undefined ?

Yes, all flags have also undefined values.

Rethink where/when/how to declare a record citable

Right now there's a small bit of logic in the builder that decides if the record is citable or not (can be overriden if need be) but maybe that's not the place or the way to set that flag. Maybe using a periodic bibcheck task, and/or moving the check to a standalone function that's dynamically called or something.

	schema:
	description: \|-
	Describes to which vocabulary the keyword in :ref:`value`
	belongs.

	``INSPIRE``
	:MARC: ``695__2:INSPIRE``

	The keyword has been assigned by Inspire, and
	belongs to its vocabulary.

	``JACOW``
	:MARC: ``6531_2:JACOW``

	The keyword is part of the `Joint Accelerator
	Conference Website (JACoW) vocabulary
	<http://jacow.org/Tools/Keywords>`_.

	``PACS``
	:MARC: ``084__2:PACS``

	The keyword is a number from the `Physics and
	Astronomy Classification Scheme (PACS)
	<https://publishing.aip.org/publishing/pacs/pacs-2010-regular-edition>`_.

	``PDG``
	:MARC: ``084__2:PDG``

	The keyword is a `PDG Indentifier
	<http://pdg.lbl.gov/2016/pdgid/PDGIdentifiers.html>`_.

	.. note::

	If not present, the keyword is a free-form keyword,
	not necessarily part of any vocabulary.
	enum:
	- INSPIRE
	- JACOW
	- PACS
	- PDG
	title: Keyword vocabulary
	type: string