Git Product home page Git Product logo

exchange-metadata-converter's People

Contributors

bdwyer2 avatar ckadner avatar ptitzler avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

exchange-metadata-converter's Issues

Determine how to handle subdataset level type fields

Information on how to locate all the files belonging to a certain subdataset is important for the DAX API and how it handles loading in subdatasets. Note, this is different from a subdataset's format, which is simply the file format of the subdataset.

Examples of subdataset types:

  • a simple file = it's path name (e.g. txt, csv)
  • a directory (e.g. a directory of image files, a directory of subdirectories)
  • a list of files (e.g. a txt with paths to all files in the validation set)
  • a regex (e.g. all train files have train_ appended at the start of the filename)

We need to determine what subdataset types there are and how to include this information. The current proposal is this:

Simple file:

  - file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
    ...
    format: CSV  # rename from type to format
    ...
    type: path_name
       value: noaa-weather-data-jfk-airport/jfk_weather.csv

Regex:

  - file_name: publaynet/train
    ...
    type: regex  
      value: "train/*"

List of files:

  - file_name: tfsc/train_list.txt
    ...
    type: list_of_files  
      value: tfsc/train_list.txt

There's probably a better way of structuring this that avoids the file_name being the same as the value field in some cases, but it's a start.

Support optional properties

Currently all {{...}} placeholders in https://github.com/CODAIT/exchange-metadata-converter/tree/main/templates are considered to be required and therefore each placeholder input file must define them. Annotations would solve this issue. Investigate what it takes to support something like this, where @annotation_key serves as a hint to the processing engine and doesn't invalidate the YAML file because it is specified as a comment.

property: '{{value}}'         #  just a comment
another:
  property2: '{{value2}}'     # @optional and a comment
third_property:               #  comment only
fourth_property:
fifth_property: '{{value6}}'  # @annotation_only

Determine way to flatten ORSD archive metadata & check to see if any other DAX archives have similarly complex nested structures

Current proposal is to release a new version of ORSD which no longer has nested archives. Then we can use a structure like:

content:
  - file_name: data/SPE9-TRIANGLE.Aspect1/test
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect1/train
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect2/json_test
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect2/json_train 
     ...
  - file_name: data/SPE9-TRIANGLE.Aspect3.compressed.h5
     ...
  - file_name: data/SPE9-MAX.Aspect1
     ...
  - file_name: data/SPE9-MAX.Aspect2
     ...
  - file_name: data/SPE9-MAX.Aspect3.compressed.h5
     ...

Keeping in mind, the archive level description field for the dataset will need to describe the content composition, e.g. "...contains two versions of the dataset, SPE9-TRIANGLE which... and SPE9-MAX which..."

Determine how to handle archive level format field

Currently the example uses:

# TBD how to handle compound types (a data set comprises of multiple files using different format)
format:
  type: CSV
  mime_type: text/csv

But as we know some DAX datasets contain more than one subdataset format. Some potential solutions:

  1. Remove this field and rely on subdataset level format field for this info
  2. Add a compound type to be used when an archive contains more than one type of subdataset
  3. Use a list that contains all the different subdataset formats

Add column type information on a subdataset level

This data would be fed to both the DAX API and to our DAX data previews.

Propose this structure:

content:
  - file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
    description: Raw data file
    records: 114546
    size: 30M
    type: CSV
    mime_type: text/csv
    column_types:
      STATION: str
      STATION_NAME: str
      ELEVATION: float
      LATITUDE: float
      ...

Review Feedback

My review is in the perspective of usage in OpenAIHub and what end-users want in general.

Reference:

  • Existing OpenAIHub YAML
  • Dataset Landing Page

Comments:

  • Can we add details about the archive contents of the dataset?
  • Would like to see dataset coverage as well. Having this will set the expectation of the users right.

I used only JFK yaml for this review

@ptitzler

Non-compliant metadata.name in the generated DLF YAML

The metadata.name in the generated DLF YAML does not comply with the Kubernetes spec for DNS-1123 subdomain names.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Dataset.com.ie.ibm.hpsys \"Finance Proposition Bank\" is invalid: metadata.name: Invalid value: \"Finance Proposition Bank\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
  "reason": "Invalid",
  "details": {
    "name": "Finance Proposition Bank",
    "group": "com.ie.ibm.hpsys",
    "kind": "Dataset",
    "causes": [
      {
        "reason": "FieldValueInvalid",
        "message": "Invalid value: \"Finance Proposition Bank\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
        "field": "metadata.name"
      }
    ]
  },
  "code": 422
}

@ptitzler

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.