Git Product home page Git Product logo

Comments (6)

tschaffter avatar tschaffter commented on August 27, 2024

We store Annotation objects in the AnnotationStores. Here is an example of one Annotation object:

{
      "annotationSource": {
        "name": "/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note/5fb5d438a7c859d8acf9d672"
      },
      "textDateAnnotations": [
        {
          "dateFormat": "MM/DD/YYYY",
          "length": 10,
          "start": 42,
          "text": "10/26/2020"
        },
        {
          "dateFormat": "MM/DD/YYYY",
          "length": 10,
          "start": 42,
          "text": "10/26/2020"
        }
      ],
      "textPersonNameAnnotations": [
        {
          "length": 11,
          "start": 42,
          "text": "Chloe Price"
        },
        {
          "length": 11,
          "start": 42,
          "text": "Chloe Price"
        }
      ],
      "textPhysicalAddressAnnotations": [
        {
          "addressType": "city",
          "length": 11,
          "start": 42,
          "text": "Seattle"
        },
        {
          "addressType": "city",
          "length": 11,
          "start": 42,
          "text": "Seattle"
        }
      ]
    }

In this representation, the reference to the "object" annotated is stored in annotationSource. For now we annotate only clinical note (Note schema) but in the future we may annotate different objects. The format of annotationSource is currently being finalized and is related to best practices in handling IDs and Linking.

The above design follows Google Healthcare API. One reason I adopted it is because I already wanted to remove noteId from the specific annotation objects in an effort to simplify the task to NLP developers and to make the system more robust. Fact 1: because an annotation request takes as input a single note (and not a collection of note), it does not make sense to the developer to have to deal with a note id. Instead, it is the responsibility of the user/client to link the annotation prediction received to the note given as input to the annotator. Thus, the developer of an NLP tool is not responsible for incorrectly linking a note and the annotation extracted from it. Even if we were to ask the NLP developer to do the linking, we would have to write code that check that the linking is correct.

In the current schemas, a Date Annotator returns an array of TextDateAnnotation objects. For example:

{
  "textDateAnnotations": [
    {
      "format": "MM/DD/YYYY",
      "length": 10,
      "start": 3,
      "text": "12/26/2020"
    },
    {
      "format": "YYYY",
      "length": 4,
      "start": 9,
      "text": "2020"
    }
  ]
}

After receiving this output, the client should create an Annotation object that effectively link the source/reference of the note and the annotator output.

{
      "annotationSource": {
        "name": "/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note/5fb5d438a7c859d8acf9d672"
      },
    "textDateAnnotations": [
    {
      "format": "MM/DD/YYYY",
      "length": 10,
      "start": 3,
      "text": "12/26/2020"
    },
    {
      "format": "YYYY",
      "length": 4,
      "start": 9,
      "text": "2020"
    }
  ]
}

This Annotation object can then be stored in an AnnotationStore.

The evaluation script should take as input two arrays of Annotation objects: one array containing the gold standard annotation and the other one the prediction. From my perspective, we can move on with updating the evaluation script to support this format if we all agree with it and once the format of property annotationSource is finalized.

@thomasyu888 @yy6linda What are your thoughts?

from nlpsandbox-schemas.

yy6linda avatar yy6linda commented on August 27, 2024

Hi @tschaffter,

I think noteId is important. I am in favor of keeping it. Treating each object as a note and assigning a unique index helps us to keep track of the notes. Besides each "start" index in the annotation is associated with specific noteId and only makes sense when we know which note the annotation refers to.

from nlpsandbox-schemas.

tschaffter avatar tschaffter commented on August 27, 2024

@thomasyu888 @yy6linda The purpose of the Annotation object is to group specific annotations like Annotation.textDateAnnotations and Annotation.textPersonNameAnnotations. The first reason is that it's convenient to have "all" the specific annotations for a resource grouped in a single object (actually the "all" part is not fully true, see below). The second reason is that this lead to having less objects stored in the DB collection "annotations" than if we were to store specific annotation objects. This lead in turn to faster queries as the DB has to search through a smaller collection, thus being able to return the result faster.

Making the contract that the specific annotations listed in an Annotation object are linked to the same source, we can then specify this source once in the property Annotation.annotationSource instead of specifying it for each specific annotation objects like in

    {
      "format": "MM/DD/YYYY",
      "length": 10,
      "start": 3,
      "text": "12/26/2020",
      "annotationSource": ".../my-awesome-note"
    },
    {
      "format": "YYYY",
      "length": 4,
      "start": 9,
      "text": "2020",
      "annotationSource": ".../my-awesome-note"
    }

which would break the DRY (Do not Repeat Yourself) principle, and make the object bigger which would slow down the response time.

For the above reasons and a few others, let's go ahead with the design proposed initially. The format of the property Annotation.annotationSource is being reviewed in #99 and its implementation (#101 ).

Back to the scoring approach referenced in the title of this ticket:

The evaluation module of the client should expect to receive two arrays of Annotation objects. Such object could come from a Data Node or from JSON files specified by the user. Then, it's up to the evaluation script to transform the Annotation objects if needed.

The first operation is probably to collapse each array of Annotation objects to an internal type of objects that is easier for the evaluation script to process.

Using the Annotation object described above as input also provide information on the type of a list of specific annotations. For example, the property Annotation.textDateAnnotations is an array of TextDateAnnotation objects. The evaluation script can then check the Annotation object to see if Annotation.textDateAnnotations is not None and has at least one item. If this is the case, the evaluation script should compute and return the performance for the TextDateAnnotation objects.

The same check should be performed for the other types of specific annotations like Annotation.textPersonNameAnnotations and Annotation.textPhysicalAddressAnnotations.

=> If the gold standard includes at least one Annotation object that includes at least one TextDateAnnotation object, then the script should return performance for date annotation. If in addition the gold standard includes at least one Annotation object that includes at least one TextPersonNameAnnotation object, then the evaluation script should also return the performance for person name annotation.

@thomasyu888 You are more familiar than @yy6linda with the data node design. Could you please write the methods required to transform the input described above to the format required by the performance evaluation score?

from nlpsandbox-schemas.

tschaffter avatar tschaffter commented on August 27, 2024

@thomasyu888 The property Annotation.annotationSource has been updated in the schemas and implemented in the data node.

from nlpsandbox-schemas.

tschaffter avatar tschaffter commented on August 27, 2024

@thomasyu888 I closed this ticket because the format of the Annotation should now be in its final form. Further discussion on how to implement the evaluation of annotations should be done in the client repository.

from nlpsandbox-schemas.

github-actions avatar github-actions commented on August 27, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from nlpsandbox-schemas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.