Git Product home page Git Product logo

inspire-schemas's Introduction

inspire-schemas's People

Contributors

ammirate avatar chris-asl avatar david-caro avatar drjova avatar glignos avatar harunurhan avatar jacquerie avatar jmartinm avatar kaplun avatar lowks avatar michamos avatar miguelgrc avatar mjedr avatar monaawi avatar nooraangelva avatar pascalegn avatar pazembrz avatar rikirenz avatar spirosdelviniotis avatar szymonlopaciuk avatar tsgit avatar vbalbp avatar zzacharo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inspire-schemas's Issues

Record migrators

After versioning is introduced in #9 we will need to support upgrading records. There should be an API similar to:

def record_needs_upgrade(record):
     ...
     return True

def upgrade_record(record):
     ...
     return upgraded_record

That automatically upgrades the provided record.

Page numbers

To be renamed page_number or number_of_pages?
To be enforced to be a simple int.

external_system_numbers

  • external_system_numbers renamed to external_system_identifiers,
  • obsolete disappear.
  • Texkeys are moved to a dedicated fields. Any obsolete texkey is moved to the end of the list.
  • institutions renamed scheme

TeXKeys on is own field

TeXKeys are automatically generated, shouldn't be deleted, but can be declared obsolete by a cataloger.

Schema versioning

Schemas will evolve with time. For this reason we should introduce versioning. Each schema should be versioned with semver technique. Each time a modification is performed this is done by copying the last schema into a new file first and then performing the modification.

Not depending on node

Currently we depend on node in order to generate fake data from jsonschema as noted in
#13 (comment)
We should port this to use a pythonic solution possibly based e.g. on fake-factory with some jsonschema extension.

Privilege INSPIRE categories

Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.

However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.

It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).

Generate valid json from the schemas when packaging

In order to support multiline strings we are using yaml format for the schemas (see #97 ), that means that we have to generate valid json at package time in order to distribute them (also will have to see how to handle in the tests and such).

Copyright.material is not really populated

When populating it, it's actually overriding the 'url' field:

In [2]: from inspire_schemas.builders import LiteratureBuilder

In [3]: lb = LiteratureBuilder(source='mama', )

In [4]: lb
Out[4]: LiteratureBuilder(source="mama", record={})

In [5]: lb.add_copyright(material='i\'m not a url')

In [6]: lb
Out[6]: LiteratureBuilder(source="mama", record={'copyright': [{'url': "i'm not a url"}]})

Create enum for source field

The source.yml is currently a free text field. This will be a problem when we start using this field value as the way to retrieve records from different sources (arXiv, APS, ...) by the merger, since this value will be in the database and should always be the same.

An enum should be created instead.

NOTE: One of the values in the enum should be used for the user submission forms. The forms currently don't populate the source field, but they should start doing that once we have the merger working.

schema: /acquisition_source/datetime does not have a type

This part of schema acquisition_source/datetime doesn't have type which should be string

"acquisition_source": {
      "$schema": "http://json-schema.org/schema#", 
      "additionalProperties": false, 
      "description": "Only the first source is stored: if the record later gets enriched with\nmetadata coming from a second source, the `acquisition_source` is not\nupdated.\n\n:MARC: ``541``", 
      "properties": {
        "datetime": {
          "description": "This does not necessarily coincide with the creation date of the\nrecord, as there might be some delay between the moment the\noriginal information is obtained and a record is finally created in\nthe system.\n\n:MARC: ``541__d``", 
          "format": "date-time", 
          "title": "Date on which the metadata was obtained"
        },...

version_manager: handle inter-repo bug references.

Right now we only show in the changelog/releasenotes the bugs specified with closes #XXX but not the ones that have the external repo reference like addresses anotherorg/anotherrepo#XXX, we should show those too.

Finalize literature mini-schema

Before deployment of Inspire 3 to labs, we need to finalize the part of the schema that is used in the harvesters that are currently on labs, namely user literature suggestions and non-CORE arXiv harvesting.

The concerned keys are:

arXiv harvesting

{'abstracts', 'preprint_date', 'collections', 'external_system_numbers', 'license', 'report_numbers', 'collaborations', 'titles', 'arxiv_eprints', 'public_notes', 'acquisition_source', 'publication_info', 'copyright', 'authors', 'dois', 'page_nr', 'imprints'}

Literature suggestion

{'external_system_numbers', 'accelerator_experiments', 'arxiv_eprints', 'collaboration', 'publication_info', 'acquisition_source', 'license', 'report_numbers', 'public_notes', 'imprints', 'abstracts', 'thesis', 'titles', 'languages', 'thesis_supervisors', 'field_categories', 'dois', 'urls', 'collections', 'title_translations', 'hidden_notes', 'authors'}

Added by workflow

{'core', 'citeable', 'published'}

Union, to go through

  • abstracts
  • accelerator_experiments
  • acquisition_source
  • arxiv_eprints
  • authors
  • citeable
  • collaboration
  • collaborations
  • collections
  • copyright
  • core
  • dois
  • external_system_numbers
  • field_categories
  • hidden_notes
  • imprints
  • languages
  • license
  • page_nr
  • preprint_date
  • public_notes
  • publication_info
  • published
  • report_numbers
  • thesis
  • thesis_supervisors
  • title_translations
  • titles
  • urls

what we talk about when we talk about _files

The contents of the _files field for Literature record is supposed to contain the metadata to retrieve the file by invenio-records-files.

The schema we have for it was copied by Zenodo and so contains the basic info in the invenio-records-files schema, but also some additional Zenodo-specific stuff (previewer, type) that we probably don't need.

The workflow is using this field in yet another way, writing description and doctype there (for arXiv PDF and extracted plots), which are not currently in the schema. This doesn't cause any error now as the results of _files are discarded anyway and never sent to legacy, but we should decide on what information we really want to have there.

@kaplun and @tsgit know how files ares handled on legacy and could share their experience.
Discussing with @jacquerie, we identified the following keys that might be useful:

  • doctype (or document_type?): to signal what kind of document is attached. This would be an enum with values fulltext, plot, what else?
  • mime_type: how this document is encoded, which might warrant a different handling (e.g. PDF vs XML for a fulltext).
  • hidden: a flag to indicate whether this file is publicly visible (would be true for fulltexts used for indexing that we may not serve directly to our users).

Pumping up flags (CORE, Citeable, Refereed)

@michamos and I have identified that the flags: CORE, Citeable, Refereed are particular because they can be set by algorithms (that could evolve in time), but that could be overridden by a curator.

Since it's not currently possible to identify who set the flag, we have the issue of:

  • identifying when a curator should override the flag for a group of record because it's not sure whether the flag has been set by an algorithm or a curator
  • ditto for algorithms
  • we don't know when an algorithm should reprocess the flag.

We would propos that this flag are augmented with a source information. E.g.:

"citeable": {
    "flag": true,
    "source": "CURATOR",
}

or

"core": {
    "flag": false,
    "source": "core-guesser",
}

Alternatively we could have a list of objects:

"citeable": [
   {
       "flag": true,
       "source": "CURATOR",
   },
   {
       "flag": false,
       "source": "citeable-guesser",
   }
]

Possibly sorted chronologically (e.g. latest first), where the final value is computed at runtime (e.g. CURATOR has precedence over an algorithm.

The cons of this approach is that is adding quite some complexity.

Better ideas? Are we solving the wrong problem?

@annetteholtkamp, @jacquerie, @StellaCh ?

author schemas missing fields

Following fields are being used in author forms but are not on the schema. One of the side effects is that those fields are not visible in the record editor, so not editable from the Holding Pen.

  • _private_note
  • collections
  • _degree_type
  • _rank

Merge classification_number into keywords

classification_number is actually very similar to keywords to the point that we can simply merge them together.

Note that on display time:
PACS should be displayed in their human friendly way. PDG, should link to the PDG website.

Reference simplification

Originally we designed references to mimic mini-records. It looks like Catalogers will still want to curate them, so we shall simplify where possible the structure to make it nice when visualized in the record editor.

PR to soon come.

Add INIS keyword schema

INIS has a vocabulary of keywords that some of our records use, for example: https://inspirehep.net/record/132217/export/xme.

@annetteholtkamp says that we need to add it to

schema:
description: |-
Describes to which vocabulary the keyword in :ref:`value`
belongs.
``INSPIRE``
:MARC: ``695__2:INSPIRE``
The keyword has been assigned by Inspire, and
belongs to its vocabulary.
``JACOW``
:MARC: ``6531_2:JACOW``
The keyword is part of the `Joint Accelerator
Conference Website (JACoW) vocabulary
<http://jacow.org/Tools/Keywords>`_.
``PACS``
:MARC: ``084__2:PACS``
The keyword is a number from the `Physics and
Astronomy Classification Scheme (PACS)
<https://publishing.aip.org/publishing/pacs/pacs-2010-regular-edition>`_.
``PDG``
:MARC: ``084__2:PDG``
The keyword is a `PDG Indentifier
<http://pdg.lbl.gov/2016/pdgid/PDGIdentifiers.html>`_.
.. note::
If not present, the keyword is a free-form keyword,
not necessarily part of any vocabulary.
enum:
- INSPIRE
- JACOW
- PACS
- PDG
title: Keyword vocabulary
type: string
.

Generate documentation based on JSONSchema

The documentation should have an explicit chapter that is automatically generated after the JSONSchema.

Please feel free to suggest on best practice and how this should looke like.

The aim of this project is to allow anybody to discover which fields exist and how to use them, and their structure, without having to open the JSON.

Rewrite the documentation of the builder

Depends on #107, rewrite/amend/write the docstrings of the methods of the builders on google docs style and verify the content (with a curator if needed).

Also make sure to generate a nice page for it so it can be easily accessed and consulted to builder users.

Unit tests for the builder

We do not have the unit tests for the builder.
This is the list of the unit test that we should write:

  • add_abstract
  • add_arxiv_eprint
  • add_doi
  • add_author
  • make_author
  • add_book
  • add_isbn
  • add_book_series
  • add_book_edition
  • add_inspire_categories
  • add_private_note
  • add_publication_info
  • add_imprint_date
  • add_preprint_date
  • add_thesis
  • add_accelerator_experiments_legacy_name
  • add_language
  • add_license
  • add_public_note
  • add_title
  • add_title_translation
  • add_url
  • add_report_number
  • add_collaboration
  • add_acquisition_source
  • add_document_type
  • add_copyright
  • add_number_of_pages
  • add_special_collection
  • add_publication_type
  • set_core
  • set_refereed
  • set_withdrawn
  • set_citeable

normalize name

In the utils.py module, when normalizing names, we just need to remove the space wherever we have a '. ' pair on the first name (that is, the second element after splitting by ',' a string like 'Caro, D. J.'.

jsonschema: harmonize document_type Vs. collection

@kaplun commented on Wed Jun 15 2016

Currently the collection field is just a porting of MARC 980. E.g.:

{"collections": [
    {"primary": "CORE"},
    {"primary": "Book"},
    {"primary": "HEP"},
    {"primary": "Citeable"}
]}

On the other hand the concept of document_type is managed by the enhancer facet_inspire_doc_type. E.g.:

{"facet_inspire_doc_type": ["book"]}

This is suboptimal.

  • Citeable should become a flag and be added at indexing time based on other values
  • CORE should be declared as a flag and be available in all schemas
  • HEP is actually redundant since it represents the fact that this is a record from Literature
  • facet_inspire_doc_type should become document_type and be populated by dojson, rather than enhanced before indexing.

@kaplun commented on Thu Aug 25 2016

I think we should bump priority of this one, since category is really scattered around the code base in a wrong way.


@jacquerie commented on Fri Aug 26 2016

This needs a spec. The thing I refactored in https://github.com/inspirehep/inspire-next/blob/25cba484c652d21c112628c4967e684c02d6fcfd/inspirehep/modules/records/receivers.py#L120-L210 is a 1 to 1 correspondence with the code that was there before, but makes no sense to me.

You need to define precisely:

  • What should we do with collections
  • What are the allowable document_types
  • How are the 980__a values mapped to those allowable values
  • What is the algorithm that sets Citeable

@kaplun commented on Mon Sep 19 2016

What should we do with collections

Should disappear.

What are the allowable document_types

Exactly the keys that you have defined in the two tables in the docstring populate_inspire_document_type().

How are the 980__a values mapped to those allowable values

Those that are document types are mapped to document types (possibly with the same value as in 980). Those that are flags, such as citeable and core should be mapped to a corresponding flag. (I think we have it for core already). deleted is also mapped to a deleted field.

What is the algorithm that sets Citeable:

Mmh. I guess it's more the question of what is not citeable. I see by default anything that comes from arXiv is citeable. @annetteholtkamp can you help here?


@jmartinm commented on Wed Oct 05 2016

Now that inspirehep/inspire-next#1589 is merged, and once we get rid of the collections field, note that we will still have a _collections field managed by invenio-collections.

This field gets populated based on a query matching the record (see config) so that config will have to be amended for the queries to match the new document_type field.


@jmartinm commented on Thu Oct 06 2016

Collection fields are:

   1100059 HEP 
    832449 Citeable 
    698265 CORE 
    584001 Published 
    401801 arXiv 
    312449 ConferencePaper 
     57480 Arxiv 
     51674  
     26942 Thesis 
     27168 Review 
     10148 Lectures 
      5637 NOTE 
      7727 Proceedings 
      7507 noncore 
      4643 THESIS 
      3488 Introductory 
      4003 Withdrawn 
      3982 Hep 
      3344 Book 
       172 D0-PRELIMINARY-NOTE 
      2069 BOOK 
      1239 NONCORE 
      1240 PROCEEDINGS 
      1115 citeable 
       891 BookChapter 
       452 Conference 
        33 Core 
        11 REPORT 
         5 Preprint 
         6 published 
         3 core 
         3 Note 
         2 Noncore 
         2 Report 
         1 PUBLISHED 
         1 thesis 
         1 book 
         1 proceedings 
         1 NonCore 
         1 Conferencepaper 
         1 Accelerators 
         1 Proceddings 

Our schema says possible document types are:

[
"Published",
 "arXiv",
 "ActivityReport",
 "ConferencePaper",
 "Thesis",
 "Review",
 "Lectures",
 "Note",
 "Proceedings",
 "Introductory",
 "Book",
 "BookChapter",
 "Report"
  ],

And our current document type facet has the following mapping (from 980__a value to facet value):

        'published': 'peer reviewed',
        'thesis': 'thesis',
        'book': 'book',
        'bookchapter': 'book chapter',
        'proceedings': 'proceedings',
        'conferencepaper': 'conference paper',
        'note': 'note',
        'report': 'report',
        'activityreport': 'activity report',
        'lectures': 'lectures',
        'review': 'review',

'preprint' if no journal info

So for this issue to proceed we would need:

  • For each value in 980__a what document type from our schema to assign.
  • Should the document type in the schema be human readable ActivityReport vs Activity Report for example.
  • Complete the enum in our jsonschema to acomodate document types such as the hidden collections: Hal Hidden, notes from different experiments and so on.
  • Do we still need the receiver to convert the document types into a more 'user facing' facet, with values such as Preprint or Peer Reviewed which are not mentioned as document types in the schema.

@kaplun commented on Wed Oct 05 2016

@jmartinm Thanks. I'd suggest @annetteholtkamp et al. can help us removing all the outliers from 980__a


@kaplun commented on Thu Oct 06 2016

  • For each value in 980__a what document type from our schema to assign.
1100059 HEP -> Literature Schema
    832449 Citeable -> citeable flag
    698265 CORE -> 'core' flag: True
    584001 Published -> published flag
    401801 arXiv -> ignore (redundant)
    312449 ConferencePaper -> 'conference paper'
     57480 Arxiv ->  ignore (redundant)
     51674  -> ignore (W00t?)
     26942 Thesis -> 'thesis'
     27168 Review -> 'review'
     10148 Lectures -> 'lectures'
      5637 NOTE -> 'note'
      7727 Proceedings -> 'proceedings'
      7507 noncore -> 'core' flag: False
      4643 THESIS -> 'thesis'
      3488 Introductory -> 'introductory'
      4003 Withdrawn -> 'withdrawn' flag
      3982 Hep -> Literature Schema
      3344 Book -> 'book'
       172 D0-PRELIMINARY-NOTE 
       891 BookChapter -> 'book chapter'
       452 Conference -> Wot? In HEP?
        11 REPORT -> 'report'
         5 Preprint -> ignore redundant
         6 published -> ignore redundant
  • Should the document type in the schema be human readable ActivityReport vs Activity Report for example.

I believe so: anyway cataloguers will edit record either though scripts or through the editor, which will enforce the accepted values. Therefore there is no need to introduce a simplified spelling to avoid typos.

  • Complete the enum in our jsonschema to acomodate document types such as the hidden collections: Hal Hidden, notes from different experiments and so on.

I will create a dedicated issue for that.

  • Do we still need the receiver to convert the document types into a more 'user facing' facet, with values such as Preprint or Peer Reviewed which are not mentioned as document types in the schema.

I believe given the above point on how to spell document types, the answer is nope.


@annetteholtkamp commented on Thu Oct 13 2016

On 05 Oct 2016, at 12:39, Samuele Kaplun [email protected] wrote:

1100059 HEP -> Literature Schema
832449 Citeable -> citeable flag
698265 CORE -> 'core' flag: True
584001 Published -> published flag
401801 arXiv -> ignore (redundant)
why is that redundant?
312449 ConferencePaper -> 'conference paper'
57480 Arxiv -> ignore (redundant)
51674 -> ignore (W00t?)

what is this?

 26942 Thesis -> 'thesis'
 27168 Review -> 'review'
 10148 Lectures -> 'lectures'
  5637 NOTE -> 'note'
  7727 Proceedings -> 'proceedings'
  7507 noncore -> 'core' flag: False

Is there only true and false, or also undefined ?

  4643 THESIS -> 'thesis'
  3488 Introductory -> 'introductory'
  4003 Withdrawn -> 'withdrawn' flag
  3982 Hep -> Literature Schema
  3344 Book -> 'book'
   172 D0-PRELIMINARY-NOTE 

We should ask Heath whether this tag is still necessary if a record is in HEP
891 BookChapter -> 'book chapter'
452 Conference -> Wot? In HEP?

Yes, we never managed to clean them all up. Most of them are probably conf papers - but needs to be checked.

    11 REPORT -> 'report'
     5 Preprint -> ignore redundan
     6 published -> ignore redundant
  • Annette

You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub inspirehep/inspire-next#1215 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AM1-O0bt4NEgPP7sFdLO3gHQmNT0f53Fks5qw5qFgaJpZM4I2XVr.


@kaplun commented on Thu Oct 13 2016

401801 arXiv -> ignore (redundant)

why is that redundant?

We don't need to say arXiv. We already know from the arXiv ID.

 51674  -> ignore (W00t?)

what is this?

A collection with empty value ๐Ÿ˜„

  7507 noncore -> 'core' flag: False

Is there only true and false, or also undefined ?

Yes, all flags have also undefined values.

Rethink where/when/how to declare a record citable

Right now there's a small bit of logic in the builder that decides if the record is citable or not (can be overriden if need be) but maybe that's not the place or the way to set that flag. Maybe using a periodic bibcheck task, and/or moving the check to a standalone function that's dynamically called or something.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.