Git Product home page Git Product logo

data-node's Introduction

NLP Sandbox Data Node

GitHub Release GitHub CI GitHub License Docker Pulls Discord Coverage Status

Repository of FHIR and annotation resources used to benchmark NLP Sandbox tools

Overview

This repository provides a Python-Flask implementation of the NLP Sandbox Data Node. This Data Node relies on a MongoDB instance to store FHIR and annotation resources used to benchmark NLP Sandbox tools.

This Data Node can be used to:

  • Create and manage datasets
  • Create and manage FHIR stores
    • Store and retrieve FHIR patient profiles
    • Store and retrieve clinical notes
  • Create and manage annotation stores
    • Store and retrieve text annotations

The figure below illustrates the organization of the data. A Dataset can have one or more FhirStores and AnnotationStores. An AnnotationStore can include different types of annotations. In NLPSandbox.io, the gold standard of a dataset is stored in one AnnotationStore. We then use N AnnotationStores to store the predictions generated by N tools contributed to NLPSandbox.io.

Specification

Usage

Running with Docker

Create the configuration file.

cp .env.example .env

The command below starts the Data Node locally.

docker-compose up --build

You can stop the container run with Ctrl+C, followed by docker-compose down.

Running with Python

We recommend using a Conda environment to install and run the Data Node.

conda create --name data-node python=3.9.4
conda activate data-node

Create the configuration file and export its parameters to environment variables.

cp .env.example .env
export $(grep -v '^#' .env | xargs -d '\n')

Start the MongoDB instance.

docker-compose up -d db

Install and start the Data Node.

cd server/
pip install -r requirements.txt
cd server && python -m openapi_server

Acessing the UI

The Data Node provides a web interface that you can use to create and manage resources. The address of this interface depends on whether you run the Data Node using Docker (production mode) or the Python development server.

Contributing

Thinking about contributing to this project? Get started by reading our Contributor Guide.

License

Apache License 2.0

data-node's People

Contributors

dependabot[bot] avatar gkowalski avatar thomasyu888 avatar tschaffter avatar yy6linda avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

mcw-bmi yy6linda

data-node's Issues

Required fields break Flask app when the required property is inherited

I had this error before validating the API spec with IBM OpenAPI Validator. Now I'm back to this issue which has not been solved by one of the many improvements I made using IBM OpenAPI Validator.

$ python -m openapi_server
Traceback (most recent call last):
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/thoma/Documents/dev/nlp-sandbox-data-node-i2b2-2014/server/openapi_server/__main__.py", line 18, in <module>
    main()
  File "/mnt/c/Users/thoma/Documents/dev/nlp-sandbox-data-node-i2b2-2014/server/openapi_server/__main__.py", line 11, in main
    app.add_api('openapi.yaml',
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apps/flask_app.py", line 57, in add_api
    api = super(FlaskApp, self).add_api(specification, **kwargs)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apps/abstract.py", line 144, in add_api
    api = self.api_cls(specification,
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apis/abstract.py", line 75, in __init__
    self.specification = Specification.load(specification, arguments=arguments)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/spec.py", line 153, in load
    return cls.from_file(spec, arguments=arguments)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/spec.py", line 107, in from_file
    return cls.from_dict(spec)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/spec.py", line 145, in from_dict
    return OpenAPISpecification(spec)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/spec.py", line 38, in __init__
    self._validate_spec(raw_spec)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/spec.py", line 239, in _validate_spec
    raise InvalidSpecification.create_from(e)
connexion.exceptions.InvalidSpecification: Required list has not defined properties: ['noteId']

Implement results pagination

We want to be able to limit how many objects a client can request. If we don't control this, a request asking for too much data will slow down the response for other requests.

Proposed format for the response payload object:

  • count (integer): number of objects included in the page
  • items (array): the list (page) of objects
  • next (string): URL to request the next page of results
  • previous (string): URL to request the previous page of results

/notes/{id} returns {"items": ....}

Should this API call return just the items themselves without the {"items": ....}

Example:

{"fileName": "111.txt", "id": 11, "text": "asdfasdf"}

Fix annotation start and length type

Exexample of date annotation returned by the data node.

    {
      "createdAt": "",
      "createdBy": "",
      "format": "",
      "id": 258430,
      "length": "10",
      "noteId": 11583,
      "start": "16",
      "text": "2077-03-31",
      "updatedAt": "",
      "updatedBy": ""
    },
  • length should be a number
  • start should be a number

Add configuration options to enable/disable the initialization of the DB

For example, add the environment variable INIT_DB=0 or 1

I remember how I implemented this for another project. I defined the env var APP_INIT_DB_SEED_NAME=default, which the following behavior based on the value provided:

  • "default": default initialization of the DB where only essential objects are added to the DB
  • "<seed_name>": specific initialization of the DB
  • "": no initialization

Create objects from Models instead of creating a Python dictionary

In particular, make sure that model inheritance work as expected as well as the JSON serialization and deserialization of these objects. Here is the list of current inheritance:

  • BaseModel < Entity < Annotation < DateAnnotation
  • BaseModel < Entity < Note

The issue is that openapi-generator does not implement inheritance properly, so this should be manually checked and fixed. I went halfway through this work of fixing the inheritance. Note that the class BaseModel does not exist in our schema but that's a class that is part of the codebase generated by openapi-generator.

Run service in production mode

This message provides some tips on how to run the server in production mode.

$ python -m openapi_server
 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)

Schema of error object should be consistent

Our definition of an error object:
https://github.com/Sage-Bionetworks/nlp-sandbox-schemas/blob/develop/openapi/commons/components/schemas/Error.yaml

The package connexion used in Python implementation like for the data node throw automatically this type of 500 errors:

image

Task

Ideally we want to have full control over the schema of all the error objects. This mean that we should find a way to change connexion error format. Or does the format used by connexion is part of OpenAPI standard?

Alternatively we could update our API schema for 500 errors IF most/all languages/frameworks that can be used with openapi-generator stick to one error schema for 500 (and maybe other errors).

Update to edge specification (near 0.1.3)

Download edge spec

curl -O https://sage-bionetworks.github.io/nlp-sandbox-schemas/data-node/edge/openapi.yaml

Validate spec

$ npx @openapitools/openapi-generator-cli validate -i openapi.yaml
Download 4.3.1 ...
Downloaded 4.3.1
Validating spec (openapi.yaml)
[main] WARN  o.o.codegen.utils.ModelUtils - [deprecated] inheritance without use of 'discriminator.propertyName' is deprecated and will be removed in a future release. Generating model for composed schema name: null. Title: null
[main] WARN  o.o.codegen.utils.ModelUtils - [deprecated] inheritance without use of 'discriminator.propertyName' is deprecated and will be removed in a future release. Generating model for composed schema name: null. Title: null
[main] WARN  o.o.codegen.utils.ModelUtils - [deprecated] inheritance without use of 'discriminator.propertyName' is deprecated and will be removed in a future release. Generating model for composed schema name: null. Title: null
[main] WARN  o.o.codegen.utils.ModelUtils - [deprecated] inheritance without use of 'discriminator.propertyName' is deprecated and will be removed in a future release. Generating model for composed schema name: null. Title: null
[main] WARN  o.o.codegen.utils.ModelUtils - [deprecated] inheritance without use of 'discriminator.propertyName' is deprecated and will be removed in a future release. Generating model for composed schema name: null. Title: null
[main] WARN  o.o.codegen.utils.ModelUtils - [deprecated] inheritance without use of 'discriminator.propertyName' is deprecated and will be removed in a future release. Generating model for composed schema name: null. Title: null
No validation issues detected.

Fix error UNKNOWN_BASE_TYPE

This error occurs as I'm trying to generate a python-flask server stub:

$ python -m openapi_server
Failed to add operation for GET /api/v1/datasets/{datasetId}/annotationStore/{storeId}/annotations
Traceback (most recent call last):
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apis/abstract.py", line 209, in add_paths
    self.add_operation(path, method)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apis/abstract.py", line 162, in add_operation
    operation = make_operation(
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/operations/__init__.py", line 8, in make_operation
    return spec.operation_cls.from_spec(spec, *args, **kwargs)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/operations/openapi.py", line 128, in from_spec
    return cls(
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/operations/openapi.py", line 75, in __init__
    super(OpenAPIOperation, self).__init__(
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/operations/abstract.py", line 96, in __init__
    self._resolution = resolver.resolve(self)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/resolver.py", line 40, in resolve
    return Resolution(self.resolve_function_from_operation_id(operation_id), operation_id)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/resolver.py", line 64, in resolve_function_from_operation_id
    raise ResolverError(msg, sys.exc_info())
connexion.exceptions.ResolverError: <ResolverError: Cannot resolve operationId "openapi_server.controllers.annotation_controller.list_annotations"! Import error was "No module named 'openapi_server.models.unknownbasetype'">

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/thoma/Documents/dev/nlp-sandbox-data-node-i2b2-2014/server/openapi_server/__main__.py", line 18, in <module>
    main()
  File "/mnt/c/Users/thoma/Documents/dev/nlp-sandbox-data-node-i2b2-2014/server/openapi_server/__main__.py", line 11, in main
    app.add_api('openapi.yaml',
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apps/flask_app.py", line 57, in add_api
    api = super(FlaskApp, self).add_api(specification, **kwargs)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apps/abstract.py", line 144, in add_api
    api = self.api_cls(specification,
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apis/abstract.py", line 111, in __init__
    self.add_paths()
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apis/abstract.py", line 216, in add_paths
    self._handle_add_operation_error(path, method, err.exc_info)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/apis/abstract.py", line 231, in _handle_add_operation_error
    raise value.with_traceback(traceback)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/resolver.py", line 61, in resolve_function_from_operation_id
    return self.function_resolver(operation_id)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/utils.py", line 123, in get_function_from_name
    raise last_import_error
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/site-packages/connexion/utils.py", line 111, in get_function_from_name
    module = importlib.import_module(module_name)
  File "/home/tschaffter/.conda/envs/data-node-redesign/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/mnt/c/Users/thoma/Documents/dev/nlp-sandbox-data-node-i2b2-2014/server/openapi_server/controllers/annotation_controller.py", line 7, in <module>
    from openapi_server.models.unknownbasetype import UNKNOWN_BASE_TYPE  # noqa: E501
ModuleNotFoundError: No module named 'openapi_server.models.unknownbasetype'

Why is offset not populated in interactive doc?

image

Here are the definition of offset and limit, which are almost identical:

  /notes:
    get:
      description: Returns the clinical notes
      operationId: notes_read_all
      parameters:
      - description: Maximum number of results returned
        explode: true
        in: query
        name: limit
        required: false
        schema:
          default: 10
          minimum: 10
          type: integer
        style: form
      - description: Index of the first result that must be returned
        explode: true
        in: query
        name: offset
        required: false
        schema:
          default: 0
          minimum: 0
          type: integer
        style: form

It does not seem to be a caching issue...

Fix ci.yml

Remove references to the now-removed branch master

Run the docker image as non-root users

By default, Docker containers run processes with an admin user. For security reason, it is recommended to make sure that a container uses a non-admin user.

Docker images:

  • DB
  • REST API

Fix log output

it looks like some logs that are not printed to stdout and are printed to the log file (conserved across multiple run of the same container):

$ docker exec data-node-api cat /var/log/app/current
2020-09-27 03:07:27.257467000  Starting data node server
2020-09-27 03:07:27.849866800   * Serving Flask app "__main__" (lazy loading)
2020-09-27 03:07:27.849869000   * Environment: production
2020-09-27 03:07:27.849870000     WARNING: This is a development server. Do not use it in a production deployment.
2020-09-27 03:07:27.849891100     Use a production WSGI server instead.
2020-09-27 03:07:27.849891900   * Debug mode: off
2020-09-27 03:09:46.104603600  Starting data node server
2020-09-27 03:09:46.687441400   * Serving Flask app "__main__" (lazy loading)
2020-09-27 03:09:46.687443600   * Environment: production
2020-09-27 03:09:46.687444600     WARNING: This is a development server. Do not use it in a production deployment.
2020-09-27 03:09:46.687467600     Use a production WSGI server instead.
2020-09-27 03:09:46.687468300   * Debug mode: off

Other logs are printed to stdout but are not printed to the log file:

data-node-api | [s6-init] making user provided files available at /var/run/s6/etc...exited 0.
data-node-api | [s6-init] ensuring user provided files have correct perms...exited 0.
data-node-api | [fix-attrs.d] applying ownership & permissions fixes...
data-node-api | [fix-attrs.d] done.
data-node-api | [cont-init.d] executing container initialization scripts...
data-node-api | [cont-init.d] 20-create-app-logfolder: executing...
data-node-api | [cont-init.d] 20-create-app-logfolder: exited 0.
data-node-api | [cont-init.d] 30-get-data: executing...
data-node-api | [cont-init.d] 30-get-data: exited 0.
data-node-api | [cont-init.d] 40-populate-db: executing...
data-node-api | [cont-init.d] 40-populate-db: exited 0.
data-node-api | [cont-init.d] done.
data-node-api | [services.d] starting services
data-node-api | [services.d] done.
data-node-api |  * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
data-node-api | 172.22.0.1 - - [27/Sep/2020 03:10:44] "GET /api/v1/ui/ HTTP/1.1" 200 -
data-node-api | 172.22.0.1 - - [27/Sep/2020 03:10:44] "GET /api/v1/ui/swagger-ui-standalone-preset.js HTTP/1.1" 304 -
data-node-api | 172.22.0.1 - - [27/Sep/2020 03:10:44] "GET /api/v1/ui/swagger-ui-bundle.js HTTP/1.1" 304 -
data-node-api | 172.22.0.1 - - [27/Sep/2020 03:10:44] "GET /api/v1/openapi.json HTTP/1.1" 200 -

Remove home page or upgrade the content

The current home page is a non standard component of a web service. Either remove it or replace it with a page that provides useful information (and looks better), for example showing help information.

Reset the DB table indexes as long as we reset the DB on restart

DB indexes keep increasing after each restart of the server, which make the DB results not reproducible after a restart of the server.

In the future we may stop resetting the DB at each restart, typically if we start accepting requests that changes the DB. Meanwhile, we should implement the reset properly.

The request body in API doc to create Patient must not include the property id

The property ID is marked as readOnly in the spec. In the API doc of the data node server, the request body automatically includes the property id. This should not be the case. When trying to create a patient object that has id, the API doc correctly says that id is readOnly and the request fails.

Try to find if there is a way to remove the property id from the request body. This is probably a bug that Flask should solve (try to find existing bug reports).

Read server port value from configuration

Currently we do

app.run(port=8080)

Instead we should read the port from an environment variables and use a default value (e.g. 8080) if the env var is not set. Consider creating a config object to store configuration.

Refactor codebase

Following our discussion:

  • rename folder flask-app to server
  • move the init code to the server implementation (we no longer need the init cli)
  • move the server Dockerfile to the folder server

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.