nlpsandbox / nlpsandbox-schemas Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 4.0 3.43 MB

OpenAPI specifications of the NLP Sandbox services

Home Page: https://nlpsandbox.io

License: Apache License 2.0

nlpsandbox-schemas's Introduction

nlpsandbox

Home repository

nlpsandbox-schemas's People

Contributors

Stargazers

Watchers

Forkers

mcw-bmi boyleconnor tschaffter yy6linda

nlpsandbox-schemas's Issues

Creating dataset fails

curl -X POST "http://10.23.55.45:8080/api/v1/datasets?datasetId=test" -H  "accept: application/json" -H  "Content-Type: application/json"
{
  "detail": "None is not of type 'object'",
  "status": 400,
  "title": "Bad Request",
  "type": "about:blank"
}

Am I doing something wrong?

Date annotator: HEAD or GET Request cannot have a body.

The schema of the Date Annotator still relies on GET to process the clinical notes. @gkowalski changed this in the implementation of the annotator but the change has never been brought back to the schema specification.

@gkowalski solution using POST:
https://github.com/Sage-Bionetworks/nlp-sandbox-date-annotator-example/blob/develop/server/openapi_server/openapi/openapi.yaml#L26-L51

Validation Data Node OpenAPI specification with IBM OpenAPI Validator

The openapi validates the data node API, however the python-flask server created from it throws the following error:

connexion.exceptions.InvalidSpecification: Required list has not defined properties: ['noteId']

I could not find the issue from a quick look at the spec, so I googled and found this comment.

Turns out the swagger.yaml file had a lot of errors which did not show up in a lot of the validators.
https://github.com/IBM/openapi-validator shows them correctly to be fixed.

This is the second time I come across IBM OpenAPI Validator, so I decide to give it a try in #71

PersonName annotator returns incorrect response key

Document 201 code in data node

Update Deidentifier API

Remove "PHI" from title
Use POST for deid path

Getting clinical notes/patients returns objects that are inconsistent from the rest of the calls

curl -X GET "http://10.23.55.45:8080/api/v1/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note?limit=10" -H  "accept: application/json"
{
  "limit": 10,
  "links": {
    "next": ""
  },
  "notes": [
    {
      "id": "5fb5d438a7c859d8acf9d672",
      "noteType": "",
      "patientId": "5fb5d3eca7c859d8acf9d671",
....

All the other rest calls return name, just like getting annotations:

curl -X GET "http://10.23.55.45:8080/api/v1/datasets/awesome-dataset/annotationStore/awesome-annotation-store/annotations?limit=10" -H  "accept: application/json"

{
  "annotations": [
    {
      "annotationSource": {
        "name": "/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note/5fb5d438a7c859d8acf9d672"
      },
      "name": "datasets/awesome-dataset/annotationStores/awesome-annotation-store/annotations/5fb5d570a7c859d8acf9d676",
....

This becomes more complex when I'm writing code to get a clinical note in the client. We would have:

# Example of getting an annotation - we can directly use "name" from above
get_annotation(annotation_name)

# But when getting clinical note, it separates into "fhirstore_name" and "clinical note id"

There seem to be different standards within this API.

date/location/person annotator doesn't return noteId, how should scoring be done?

Currently the evaluation code relies on "noteId" to be part of the response.

Example- (Both the expected goldstandard and prediction file look like this):

{
    "date_annotations": [
        {
            "noteId": 0,
            "start": 20,
            "length": 10,
            "text": "11/21/2019",
            "dateFormat": "MM/DD/YYYY"
        },
...

Add lint job to CI GH Action

Use openapi-cli

Error schema should be JSON serialized according to Problem Details for HTTP APIs

nlpsandbox/data-node#68 (comment)

Migrate content from data2health/nlp-sandbox-data-node-schemas

The repo data2health/nlp-sandbox-data-node-schemas has been initially created to host the API specification for the NLP Sandbox Data Nodes. We have then created this repository in an effort to centralize all the API specifications, thus enabling to define Components and Paths that can be reused in more than one APIs.

Objectives

Create a PR to integrate the contributions made to data2health/nlp-sandbox-data-node-schemas (George)
Delete the repo data2health/nlp-sandbox-data-node-schemas (Thomas)

Generate changelogs

Try https://github.com/github-changelog-generator/github-changelog-generator

Add example to Entity.id

It is required to automatically generate an example of clinical note that can be given as input to a path in the interactive documentation.

Release 0.1.4 still has wrong required Error properties

From the data node:

      required:
        - code
        - message

But Error.yaml on develop has:

required:
  - title
  - status

Return paginated results when getting FHIR stores

I initially expected that there wouldn't be many FHIR stores in a dataset and because the object is currently light (only one property), it was more userfriendly to return an array than paginated results. However this is kind of an exception in the Data Node API, so I prefer to move to a paginated response to increase the coherence of the API.

Fix schedule CI

https://github.com/Sage-Bionetworks/nlp-sandbox-schemas/runs/1241559918?check_suite_focus=true

Update OpenAPI specs of the data node and date annotator before testing

Update the openapi.yaml file for new Notes results.

This includes the items attribute, result count and pagination

{
"count": 514,
"items": [
{
}

Creating duplicated dataset should return 409, not 500

curl -X POST "http://10.23.55.45:8080/api/v1/datasets?datasetId=test" -H  "accept: application/json" -H  "Content-Type: application/json" -d {}
{
  "detail": "Tried to save duplicate unique keys (E11000 duplicate key error collection: nlpsandbox.dataset index: name_1 dup key: { name: \"datasets/test\" }, full error: {'index': 0, 'code': 11000, 'keyPattern': {'name': 1}, 'keyValue': {'name': 'datasets/test'}, 'errmsg': 'E11000 duplicate key error collection: nlpsandbox.dataset index: name_1 dup key: { name: \"datasets/test\" }'})",
  "status": 500,
  "title": "Internal error"
}

Find Licenses enum or create one

This enum should contains the identifiers from this list:
https://spdx.org/licenses/

Add support for returning partial representation

Enable the client to specify the object fields to be returned among a set of fields approved.

A possible solution is to add a query parameter named fields, for example:

/persons/x7y8z9?fields=firstName,lastName

Identify schemas for evaluations

Work in progress: https://lucid.app/lucidchart/d53fc967-6d6b-4807-83b0-c1480a43012c

Update response of data node to support pagination

Change response for:

get all clinical notes
get all annotations (any types)

Write description for the Date annotation task

This description will be used on the NLP Sandbox website.

@yy6linda Our date annotation task is based on the content of the 2014 i2b2 dataset. This dataset includes dates (MM/DD/YYYY and similar) but also day of the week, month, etc. Are all those date/time information considered as HIPAA PHI? We should categorize the date/time information that are relevant for HIPAA based on the date format (depends on https://github.com/Sage-Bionetworks/nlp-sandbox-analysis/issues/5)

Fix remote: Permission to Sage-Bionetworks/nlp-sandbox-schemas.git denied to github-actions[bot].

When trying to fix #32 , I may have introduce a bug that lead to the following error to be generated in a PR:

Run ad-m/github-push-action@master
Started: bash /home/runner/work/_actions/ad-m/github-push-action/master/start.sh
Push to branch gh-pages
remote: Permission to Sage-Bionetworks/nlp-sandbox-schemas.git denied to github-actions[bot].
fatal: unable to access 'https://github.com/Sage-Bionetworks/nlp-sandbox-schemas.git/': The requested URL returned error: 403
Error: Invalid status code: 128
    at ChildProcess.<anonymous> (/home/runner/work/_actions/ad-m/github-push-action/master/start.js:9:19)
    at ChildProcess.emit (events.js:210:5)
    at maybeClose (internal/child_process.js:1021:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:283:5) {
  code: 128
}
Error: Invalid status code: 128
    at ChildProcess.<anonymous> (/home/runner/work/_actions/ad-m/github-push-action/master/start.js:9:19)
    at ChildProcess.emit (events.js:210:5)
    at maybeClose (internal/child_process.js:1021:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:283:5)

Validate JSON schemas in ci.yml

Identify schema of the submission object

Define a JSON object that includes all the information required for a submission.

Submission:

docker_repository
docker_digest

Submitted by (not part of the submission object):

submitter_id
submitter token

@thomasyu888 What else?

Add object to get information about an implementation

Example of object properties:

Python setup.py (see this example from the NLP Sandbox Client)

Fix `major.minor` release to gh-pages

The first release is out (0.1.0).

Each API listed in the folder openapi comes with its set of documentation links. For example for the Data Node API:

https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/latest/docs/ (last release created)
https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/0.1.0/docs/
https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/0.1/docs/
https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/0/docs/
https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/edge/docs/ (default branch)
https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/develop/docs/ (develop branch)
https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/nightly/docs/ (default branch built nightly)

All the above links works but this one:

https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/0.1/docs/ (bug)
https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/nightly/docs/ (because the nightly task has not been triggered yet)

Add service user

Add the service user nlp-sandbox-bot to this project.

id not allowed in requestbody for date/location/person annotator, but id is part of the returned clinical notes

Did you want me to remove "id" from each clinical note prior to calling the restPOST?

The rest call works if I remove the "id" from the request body.

Identify best solution to enable local customization of commons schemas

See #54 (comment)

Get Data Node info

Similarly to what we have recently added to get Tool info

Add Tool JSON schema

Try this: https://www.npmjs.com/package/@openapi-contrib/openapi-schema-to-json-schema

Publish JSON schemas to GitHub Pages

Add IBM OpenAPI validation to CI/CD workflow

Add confidence property to text annotations

Usually we use a confidence value between 0 and 1 in DREAM Challenge. Looking around, I found that AWS Recognition uses a Confidence value between 0 and 100.

Confidence

The confidence that Amazon Rekognition has in the accuracy of the detected text and the accuracy of the geometry points around the detected text.

Type: Float

Valid Range: Minimum value of 0. Maximum value of 100.

Required: No

Source

From a UI design perspective, it is more user friendly to show a score in % like "63" (%) than "0.63". The latter approach takes to additional character that will almost always be the same (0.). For these reasons, let's define the score as taking values in [0,100].

List annotation stores returns name but with a typo

curl -X GET "http://10.23.55.45:8080/api/v1/datasets/awesome-dataset/annotationStore?limit=10" -H  "accept: application/json"
{
  "annotationStores": [
    {
      "name": "datasets/awesome-dataset/annotationStores/awesome-annotation-store"
    }
  ],
  "limit": 10,
  "links": {
    "next": ""
  },
  "offset": 0
}

It should be annotationStore not annotationStores

Update the openapi.yaml to match the data node output

This includes the items attribute, result count and pagination

{
"count": 514,
"items": [
{
}

Create specification for PHI deidentifier

Version 1

Input: clinical note
Output: processed clinical note

Version 2

Access a configuration as input to provide more control over how PHIs are de identified

Automatically build and push API documentation during CI/CD

We can use ga4gh/gh-openapi-docs to generate a single HTML page that contains an entire API that has been generated using Redocly/create-openapi-repo.

Goals

Try to apply gh-openapi-docs to generate one HTML page for each API listed in the folder openapi/
Figure out how to organize the HTML pages as GitHub Pages
Automate the building of the of the HTML pages and publication as GitHub Pages using GitHub Action
- master -> https://data2health.github.io/nlp-sandbox-schemas//docs
- develop -> https://data2health.github.io/nlp-sandbox-schemas/develop//docs

Add document on how to contribute to this repo

The following documents could be useful:

https://github.com/ga4gh/gh-openapi-docs/blob/master/docs/ga4gh-cloud-migration.md

Replace clinical note schema by a standard schema

The current definition of a clinical note is available:
https://github.com/Sage-Bionetworks/nlp-sandbox-schemas/blob/develop/openapi/commons/components/schemas/Note.yaml

This format is largely inspired from the format of the 2014 i2b2 dataset. This was great to get the 2014 i2b2 Data Node working more quickly. Ultimately we want to adopt a more standard format that hospital and other data sites are more likely to use. Relying on this more standard definition of a clinical note would make the NLP Tool easier to use on clinical notes that follow this standard. A could place to look for a candidate schema is FHIR.

Task

Identify if FHIR has a schema for clinical notes
Identify the schemas used by Google de-id and Amazon Comprehend
Create a PR that modifies our definition of the note object to match a standard note schema

Redirect documentation to our reference LICENSE

Identify date format that are HIPAA PHI

Background

On dates, the HIPAA specification says:

(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

Source

The representation of date format that we use is defined here.

@tschaffter is extending the 2014 i2b2 dataset to include information about the date format of each Date annotation. The reason is because we want to engage developers to be able to predict the format of the date string they detect, which in turn enable to convert a date string programmatically is a standard date object.

We are considering reporting the performance of the date annotators and other annotators for their ability to detect PHI (HIPAA or other standard). Currently our Date detection task is relatively generic and is aimed to be reused for other, more complex NLP tasks that do not necessary require to know if a date string is PHI or not (mainly only relevant for deidentification).

Task

Find a set of regular expression that we can apply to date format to identify if the date string is PHI or not.

Validate remaining API with IBM OpenAPI Validator

Date annotator
Person name annotator
Physical address annotator
Deidentifier

Reorder Error properties from most to less likely to be specified

In Python, an Error object is currently instantiated with:

Error(None, "Internal error", 500, str(error))

The first property, type, is the less likely to be specified. Proposed property order

title
status
detail
type

The above Error could then be instantiated with:

Error("Internal error", 500, str(error))

Replace clinical note and patient objects with FHIR schemas

Motivations

Other services relies on FHIR such as Google de-id service and Medical comprehend. Hospitals are also more likely to have notes formated according to FHIR schemas.

The current schema of Patient and Note is derived from the format of the 2014 i2b2 dataset. As we have discussed previously with @gkowalski , we wanted to go first for a straightforward schema to get the server up and running so we can use it to test other part of the infrastructure, then update the schema to adopt one of the existing standard like FHIR.

Update status returned by POST requests

200: object created and returned
201: object created and link to resource returned (link must be set as Location in the headers)
204: resource created and no payload returned

Source: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

What is the types of notes in the i2b2 dataset?

@gkowalski Pahtology? Phone call?

Then set Note.type accordingly

nlpsandbox / nlpsandbox-schemas Goto Github PK

nlpsandbox-schemas's Introduction

nlpsandbox

nlpsandbox-schemas's People

Contributors

Stargazers

Watchers

Forkers

nlpsandbox-schemas's Issues

Objectives

Version 1

Version 2

Goals

Task

Background

Task

Motivations

Recommend Projects

Recommend Topics

Recommend Org