Git Product home page Git Product logo

mdkeywords's Introduction

mdkeywords

This repository handles harvesting of keywords from ScienceBase, GCMD, and USGS.
It is also used to host the files using a service called JSDelivr.

conf/

This directory contains the main configuration file and lists of vocabularies for ScienceBase and GCMD.

src/

This directory contains the source files for the harvesters.

resources/

This directory contains the manifest file for the list of vocabularies (manifest.json). It is automatically generated and contains the details about the vocabulary and where to access its configuration file. The configuration file will then point to the keywords file. The format of a thesaurus entry in the manifest file is as follows:

{
  "name": "",
  "url": ""
}

The name is optional and is for human readability of the thesaurus, it is not used by the mdEditor.

resources/thesaurus/

This directory contains all the thesaurus configuration files. The format for a thesaurus configuation is as follows:

{
  "citation": {
    "date": [
      {
        "date": "",
        "dateType": ""
      }
    ],
    "description": "",
    "title": "",
    "edition": "",
    "onlineResource": [
      {
        "uri": ""
      }
    ],
    "identifier": [
      {
        "identifier": ""
      }
    ]
  },
  "keywordType": "",
  "label": "",
  "keywords": null,
  "keywordsUrl": ""
}

Note that the keywords key is set to null. If a keywordsUrl is provided then keywords should be null, but you can optionally provide the keywords array directly inside of this configuration file if desired - that is not recommended. The keywordsUrl should use the jsdelivr url to point to the keywords file in the resources/keywords/ directory that is associated with the specified thesaurus. There is a resources/schema/ directory with a detailed specification for the configuration file format.

resources/keywords/

This directory contains all the keywords files. Depending on the source, the keywords file format is slightly different (this should probably be normalized).

GCMD keyword files are formatted as follows (contains nested children)

[
  {
    "uuid": "",
    "label": "",
    "broader": ,
    "definition": "",
    "children": [
      {
        "uuid": "",
        "label": "",
        "broader": "",
        "definition": "",
        "children": [ ]
      }
    ]
  }
]

ScienceBase keyword files are formatted as follows (children array is always empty)

[
  {
    "uuid": "",
    "parentId": "",
    "label": "",
    "definition": "",
    "children": []
  }
]

USGS Thesaurus is formatted as follows (contains nested children)

[
  {
    "uuid": "",
    "label": "",
    "definition": "",
    "children": [
      {
        "uuid": "",
        "label": "",
        "definition": "",
        "children": [ ]
      }
    ]
  }
]

Getting Started

This repository is not designed to be cloned and modified, the intent is for it to be used as provided by the maintainers.

If you are a developer and need to modify this repo, you can continue with the instructions below...

The harvesters can be run in different ways depending on the goal. There is also a GitHub Action that could be used to automate nightly runs of the main control process.

The main control process handles choosing which harvester to use based on the source of the keywords. Each harvester handles generating the configuration objects as well as the keywords files for a single source. You can run this process for only a specified source or for all sources.

Install and Run

This is an npm project, start by cloning and installing the normal way:

npm install

You can inspect package.json to find the following execution options:

npm start OR npm run build will run all the harvesters (both commands are aliases for node src/main.js)

mdkeywords's People

Contributors

jwaspin avatar actions-user avatar jlblcc avatar hmaier-fws avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

robertsellers

mdkeywords's Issues

Set dev as the default branch

If we want mdProfiles and mdKeywords to follow the same branching/merging process then the dev branch in mdKeywords should be set as the default and master should be protected.

Change manifest.json schema

The manifest.json file should be structured with only a name and url for the keywords.

  • Update harvester to use {name: "", url: ""} as the object structure for the manifest file
  • Update manifest file

Add documentation describing configuration and use of GCMD harvester

Create documentation that describes GCMD keyword harvest process and keyword creation process including use of configuration files and expected parameters. For details see:

1. Describe use and format of following files :

  • harvesters/src/gcmd-all.js
  • harvesters/src/gcmd.js
  • harvesters/conf/gcmd-vocabularies.json
  • harvesters/conf/gcmd-all.json
  • harvesters/conf/gcmd.json
  • resources/gcmd-vocabularies-dynamic.json.

2. Describe use of harvesters/src/index.js script and various command line options. From package.json:

"scripts": {
    "start": "node src/index.js",
    "gcmd": "node src/index.js source=gcmd",
    "usgs": "node src/index.js source=usgs",
    "sb": "node src/index.js source=sciencebase"
  }

3. Describe relation to profiles-list.json file from mdProfiles repository.

4. Describe format of output and use of 'resources/json/' directory

Develop Version 4.0.0

v4.0.0 requirements:

  • manifest file to list the vocabularies
  • update harvester(s) to produce individual thesaurus configuration files for each vocabulary
  • remove npm module code #10
  • decouple from mdProfiles
  • Readme

Migrate keywords from npm package to jsDelivr

Keywords are currently obtained from the NPM mdkeywords package. Changing keywords requires rebuilding and publishing the keyword package. Switching to a content delivery system such as jsDelivr will allow access to updated keywords without the need for a separate process to rebuild and publish a NPM module.

This will also position the mdEditor to more easily support custom keywords from other harvest sources.

See also:

Require Description in ScienceBase Thesaurus Config

The ScienceBase harvester's thesaurus config generator needs to use "No description available." as the default value for the description, which will be used if the harvester cannot find a description for the particular thesaurus.

Update package.json

Package.json has some missing / outdated info, needs to be updated to reflect that it is now at the root of the repository.

Refactor GCMD harvester

From PR #12:

The GCMD harvester was written from scratch using a completely different process. It now predominantly uses the concepts/ endpoint to gather the data and build the tree. It's still recursive, but now, for each keyword at least 1 request is made to the GCMD API to harvest the data and build the tree. It modifies both harvesters/src/gcmd.js and harvesters/conf/gcmd.json.

It creates the following structure:

{
  "uuid": "",
  "label": "",
  "broader": "",
  "definition": "",
  "children": [ ]
}

Note: if no definition found then "No definition available." is used instead.

New harvester control process to build all GCMD vocabularies

New harvesters/src/gcmd-all.js

Along with the new harvester process is this control process that will build all the GCMD vocabularies listed in the config file harvesters/conf/gcmd-vocabularies.json. It also uses harvesters/conf/gcmd-all.json config file and generates resources/gcmd-vocabularies-dynamic.json.

Deprecate v3.0.0

Remove the files associated with v3.0.0

Required:

Tasks:

  • Remove vocabularies.json
  • Remove resources/json/ directory along with its files

Create user guide for structure of thesaurus configuration file

Create documentation that allows someone to create a valid thesaurus configuration file. As I understand the currently proposed process entails:

  • A thesaurus manifest file (manifest.json), which identifies the thesauri and the uri to the associated thesauri configuration files.

  • Each thesaurus configuration file provides a thesaurus citation (e.g., usgs citation) and a link to the associated keyword file or a keyword array (e.g., usgs keywords).

  • Manifest file documentation

  • Thesaurus configuration file documentation

Fix GCMD Online Resource Uri

Fix the GCMD onlineResource uri, there is a trailing '}' that needs to be removed.

  • Fix harvester citation generator
  • Run harvester / replace files

Add dynamically generated vocabularies to default vocabularies.json configuration file

Add dynamically generated and custom NGGDPP vocabularies to the default vocabulary configuration file ([vocabularies.json}(https://github.com/adiwg/mdKeywords/blob/master/resources/vocabularies.json).

The custom vocabularies for the USGS National Geological and Geophysical Data Preservation Program (NGGDPP) are being included in the default vocabulary configuration to assist USGS with a time sensitive need.

These, as well as the "LCC" vocabularies, will eventually need to be removed from the default when the custom vocabulary feature becomes fully functional and organizations have the ability to host them at other locations.

GCMD earth science keywords do not have definitions

The keywords harvested from the GCMD do not have associated descriptions. Previously, the descriptions would be displayed as hint text when a user moused over the help icon, adjacent to the term. This is a deviation from the current expected behavior when a vocabulary is displayed in the mdEditor.

For example, the CALIBRATION keyword object should contain a definition key that corresponds to that of the GCMD source:

{
  "label": "CALIBRATION",
  "definition": "Calibration is defined as Data Analysis and Visualization Service that is designed to allow a user of the service to properly adjust Earth science data to achieve optimal accuracy and/or precision. EXAMPLE: An algorithm to correct for the decaying orbit of a satellite (calibration).",
  "children": [],
  "uuid": "ecf29317-bd5e-447b-b911-f8bfb153c83b"
}

Add harvester control process to build all GCMD vocabularies from configuration files

From PR #12:

New harvesters/src/gcmd-all.js

Along with the new harvester process is this control process that will build all the GCMD vocabularies listed in the config file harvesters/conf/gcmd-vocabularies.json. It also uses harvesters/conf/gcmd-all.json config file and generates resources/gcmd-vocabularies-dynamic.json.

New keywords files

To demonstrate the new GCMD harvester, new files for all the GCMD vocabularies are included.

New npm run scripts

package.json has new scripts

"gcmd": "node src/index.js source=gcmd",
"usgs": "node src/index.js source=usgs",
"sb": "node src/index.js source=sciencebase"

Main control process command line options

harvesters/src/index.js was modified to handle command line options. This allows only the specified source to be compiled and all other vocabularies to be ignored. It still uses the profiles-list.json file from mdProfiles.

Require vocabulary description

Vocabulary definitions (e.g., vocabulary.json file) hosted on the ADIwg mdKeywords repository should be required to contain a description element. The text of the description is used by the mdEditor to provide information to users about the content of a vocabulary. The description would allow users to make an informed decision about a vocabulary's content prior to selecting it for use. For example:

keyword-mouseover-help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.