Git Product home page Git Product logo

nlpsandbox-schemas's Introduction

nlpsandbox

Home repository

nlpsandbox-schemas's People

Contributors

dependabot[bot] avatar gkowalski avatar thomasyu888 avatar tschaffter avatar yy6linda avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlpsandbox-schemas's Issues

Creating dataset fails

curl -X POST "http://10.23.55.45:8080/api/v1/datasets?datasetId=test" -H  "accept: application/json" -H  "Content-Type: application/json"
{
  "detail": "None is not of type 'object'",
  "status": 400,
  "title": "Bad Request",
  "type": "about:blank"
}

Am I doing something wrong?

Validation Data Node OpenAPI specification with IBM OpenAPI Validator

The openapi validates the data node API, however the python-flask server created from it throws the following error:

connexion.exceptions.InvalidSpecification: Required list has not defined properties: ['noteId']

I could not find the issue from a quick look at the spec, so I googled and found this comment.

Turns out the swagger.yaml file had a lot of errors which did not show up in a lot of the validators.
https://github.com/IBM/openapi-validator shows them correctly to be fixed.

This is the second time I come across IBM OpenAPI Validator, so I decide to give it a try in #71

Getting clinical notes/patients returns objects that are inconsistent from the rest of the calls

curl -X GET "http://10.23.55.45:8080/api/v1/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note?limit=10" -H  "accept: application/json"
{
  "limit": 10,
  "links": {
    "next": ""
  },
  "notes": [
    {
      "id": "5fb5d438a7c859d8acf9d672",
      "noteType": "",
      "patientId": "5fb5d3eca7c859d8acf9d671",
....

All the other rest calls return name, just like getting annotations:

curl -X GET "http://10.23.55.45:8080/api/v1/datasets/awesome-dataset/annotationStore/awesome-annotation-store/annotations?limit=10" -H  "accept: application/json"

{
  "annotations": [
    {
      "annotationSource": {
        "name": "/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note/5fb5d438a7c859d8acf9d672"
      },
      "name": "datasets/awesome-dataset/annotationStores/awesome-annotation-store/annotations/5fb5d570a7c859d8acf9d676",
....

This becomes more complex when I'm writing code to get a clinical note in the client. We would have:

# Example of getting an annotation - we can directly use "name" from above
get_annotation(annotation_name)

# But when getting clinical note, it separates into "fhirstore_name" and "clinical note id"

There seem to be different standards within this API.

Migrate content from data2health/nlp-sandbox-data-node-schemas

The repo data2health/nlp-sandbox-data-node-schemas has been initially created to host the API specification for the NLP Sandbox Data Nodes. We have then created this repository in an effort to centralize all the API specifications, thus enabling to define Components and Paths that can be reused in more than one APIs.

Objectives

  • Create a PR to integrate the contributions made to data2health/nlp-sandbox-data-node-schemas (George)
  • Delete the repo data2health/nlp-sandbox-data-node-schemas (Thomas)

Add example to Entity.id

It is required to automatically generate an example of clinical note that can be given as input to a path in the interactive documentation.

Return paginated results when getting FHIR stores

I initially expected that there wouldn't be many FHIR stores in a dataset and because the object is currently light (only one property), it was more userfriendly to return an array than paginated results. However this is kind of an exception in the Data Node API, so I prefer to move to a paginated response to increase the coherence of the API.

Creating duplicated dataset should return 409, not 500

curl -X POST "http://10.23.55.45:8080/api/v1/datasets?datasetId=test" -H  "accept: application/json" -H  "Content-Type: application/json" -d {}
{
  "detail": "Tried to save duplicate unique keys (E11000 duplicate key error collection: nlpsandbox.dataset index: name_1 dup key: { name: \"datasets/test\" }, full error: {'index': 0, 'code': 11000, 'keyPattern': {'name': 1}, 'keyValue': {'name': 'datasets/test'}, 'errmsg': 'E11000 duplicate key error collection: nlpsandbox.dataset index: name_1 dup key: { name: \"datasets/test\" }'})",
  "status": 500,
  "title": "Internal error"
}

Add support for returning partial representation

Enable the client to specify the object fields to be returned among a set of fields approved.

A possible solution is to add a query parameter named fields, for example:

/persons/x7y8z9?fields=firstName,lastName

Write description for the Date annotation task

This description will be used on the NLP Sandbox website.

@yy6linda Our date annotation task is based on the content of the 2014 i2b2 dataset. This dataset includes dates (MM/DD/YYYY and similar) but also day of the week, month, etc. Are all those date/time information considered as HIPAA PHI? We should categorize the date/time information that are relevant for HIPAA based on the date format (depends on https://github.com/Sage-Bionetworks/nlp-sandbox-analysis/issues/5)

Fix remote: Permission to Sage-Bionetworks/nlp-sandbox-schemas.git denied to github-actions[bot].

When trying to fix #32 , I may have introduce a bug that lead to the following error to be generated in a PR:

Run ad-m/github-push-action@master
Started: bash /home/runner/work/_actions/ad-m/github-push-action/master/start.sh
Push to branch gh-pages
remote: Permission to Sage-Bionetworks/nlp-sandbox-schemas.git denied to github-actions[bot].
fatal: unable to access 'https://github.com/Sage-Bionetworks/nlp-sandbox-schemas.git/': The requested URL returned error: 403
Error: Invalid status code: 128
    at ChildProcess.<anonymous> (/home/runner/work/_actions/ad-m/github-push-action/master/start.js:9:19)
    at ChildProcess.emit (events.js:210:5)
    at maybeClose (internal/child_process.js:1021:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:283:5) {
  code: 128
}
Error: Invalid status code: 128
    at ChildProcess.<anonymous> (/home/runner/work/_actions/ad-m/github-push-action/master/start.js:9:19)
    at ChildProcess.emit (events.js:210:5)
    at maybeClose (internal/child_process.js:1021:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:283:5)

Identify schema of the submission object

Define a JSON object that includes all the information required for a submission.

Submission:

  • docker_repository
  • docker_digest

Submitted by (not part of the submission object):

  • submitter_id
  • submitter token

@thomasyu888 What else?

Fix `major.minor` release to gh-pages

Add confidence property to text annotations

Usually we use a confidence value between 0 and 1 in DREAM Challenge. Looking around, I found that AWS Recognition uses a Confidence value between 0 and 100.

Confidence

The confidence that Amazon Rekognition has in the accuracy of the detected text and the accuracy of the geometry points around the detected text.

Type: Float

Valid Range: Minimum value of 0. Maximum value of 100.

Required: No

Source

From a UI design perspective, it is more user friendly to show a score in % like "63" (%) than "0.63". The latter approach takes to additional character that will almost always be the same (0.). For these reasons, let's define the score as taking values in [0,100].

List annotation stores returns name but with a typo

curl -X GET "http://10.23.55.45:8080/api/v1/datasets/awesome-dataset/annotationStore?limit=10" -H  "accept: application/json"
{
  "annotationStores": [
    {
      "name": "datasets/awesome-dataset/annotationStores/awesome-annotation-store"
    }
  ],
  "limit": 10,
  "links": {
    "next": ""
  },
  "offset": 0
}

It should be annotationStore not annotationStores

Automatically build and push API documentation during CI/CD

We can use ga4gh/gh-openapi-docs to generate a single HTML page that contains an entire API that has been generated using Redocly/create-openapi-repo.

Goals

Replace clinical note schema by a standard schema

The current definition of a clinical note is available:
https://github.com/Sage-Bionetworks/nlp-sandbox-schemas/blob/develop/openapi/commons/components/schemas/Note.yaml

This format is largely inspired from the format of the 2014 i2b2 dataset. This was great to get the 2014 i2b2 Data Node working more quickly. Ultimately we want to adopt a more standard format that hospital and other data sites are more likely to use. Relying on this more standard definition of a clinical note would make the NLP Tool easier to use on clinical notes that follow this standard. A could place to look for a candidate schema is FHIR.

Task

  • Identify if FHIR has a schema for clinical notes
  • Identify the schemas used by Google de-id and Amazon Comprehend
  • Create a PR that modifies our definition of the note object to match a standard note schema

Identify date format that are HIPAA PHI

Background

On dates, the HIPAA specification says:

(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

Source

The representation of date format that we use is defined here.

@tschaffter is extending the 2014 i2b2 dataset to include information about the date format of each Date annotation. The reason is because we want to engage developers to be able to predict the format of the date string they detect, which in turn enable to convert a date string programmatically is a standard date object.

We are considering reporting the performance of the date annotators and other annotators for their ability to detect PHI (HIPAA or other standard). Currently our Date detection task is relatively generic and is aimed to be reused for other, more complex NLP tasks that do not necessary require to know if a date string is PHI or not (mainly only relevant for deidentification).

Task

Find a set of regular expression that we can apply to date format to identify if the date string is PHI or not.

Reorder Error properties from most to less likely to be specified

In Python, an Error object is currently instantiated with:

Error(None, "Internal error", 500, str(error))

The first property, type, is the less likely to be specified. Proposed property order

  1. title
  2. status
  3. detail
  4. type

The above Error could then be instantiated with:

Error("Internal error", 500, str(error))

Replace clinical note and patient objects with FHIR schemas

Motivations

Other services relies on FHIR such as Google de-id service and Medical comprehend. Hospitals are also more likely to have notes formated according to FHIR schemas.

The current schema of Patient and Note is derived from the format of the 2014 i2b2 dataset. As we have discussed previously with @gkowalski , we wanted to go first for a straightforward schema to get the server up and running so we can use it to test other part of the infrastructure, then update the schema to adopt one of the existing standard like FHIR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.