Git Product home page Git Product logo

metamaplite's Introduction

MetaMapLite: A lighter named-entity recognizer

The primary goal of MetaMapLite is to provide a near real-time named-entity recognizer which is not as rigorous as MetaMap but much faster while allowing users to customize and augment its behavior for specific purposes.

It uses some of the tables used by MetaMap but all lexical variants used in the table are pre-processed. Named Entities are found using longest match. Restriction by UMLS source and Semantic type is optional. Part-of-speech tagging which improves precision by a small amount (at the cost of speed) is also optional. Negation detection is available using either Wendy Chapman's context or a native negation detection algorithm based on Wendy Chapman's NegEx which is somewhat less effective, but faster.

It has:

  • longest match based entity detection
  • Negation Detection (either ConTexT or negation function based on Wendy Chapman's NegEx)
  • Restriction by UMLS source and semantic type
  • Part of Speech tagging (optional)
  • Abbreviation detection using Lynette Hirschman's algorithm.
  • Scoring approximating the original MetaMap's scoring
  • MMI Ranking similar to the original MetaMap

What is missing:

  • No detection of disjoint entities
  • No derivational variants
  • No word sense disambiguation (to be added later)
  • No overmatching
  • No term processing
  • No dynamic variant generation

Prerequisites

For running

  • Java 1.8 JRE

For Development

  • Java 1.8 JDK
  • Maven primarily. Ant and Gradle build scripts are provided but are not supported.

Command Line Usage

Example of invocation on Linux or MINGW using script:

./metamaplite.sh [options] [<input file>|--]

Example of invocation on Windows using batch file:

metamaplite.bat [options] [<input file>|--]

Example of invocation using Java VM directly when running from the public_mm_lite directory:

$ java -cp target/metamaplite-3.6.2rc5-standalone.jar \
      gov.nih.nlm.nls.ner.MetaMapLite \
      --indexdir=data/ivf/strict \
      --modelsdir=data/models \
      --specialtermsfile=data/specialterms.txt  [options] [<input file>|--]

Reading from standard input

echo "asymptomatic patient populations" | ./metamaplite.sh --pipe

or

cat file | ./metamaplite.sh --pipe

Output will be sent to standard output.

Restricting to a set of semantic types

The list of concepts returned can be restricted to a only those that refer to a subset of the UMLS semantic types by semantic type abbreviations:

./metamaplite.sh --restrict_to_sts=abbrev,abbrev ...

The following option restricts to concepts that have the semantic types disease or syndrome (dsyn) and hazardous or poisonous substance (hops):

./metamaplite.sh --restrict_to_sts=dsyn,hops

A full list of semantic types and their abbreviations is at:

https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt

Restricting to a set of source vocabularies

The follows option specifies that only concepts that appear in the MeSH (MSH), and NCBI Taxonomy (NCBI) vocabularies will be returned:

./metamaplite.sh --restrict_to_sources=MSH,NCBI ...

A full list of the current source vocabularies and their abbreviations can be found at:

https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html

Annotating a brat directory

To create annotation files (.ann) in a directory from the associated text files (.txt)

./metamaplite.sh --brat directory/*.txt

User defined acronyms

The option –uda allows a user to supply a list of user defined acronyms or abbreviations with associated long forms. When MetaMapLite encounters a user defined acronym, it will attach the information associated with the acronym's long form.

./metamaplite.sh --uda=acronymfile ...

The acronym file is the form "acronym|long form", for example:

 LAD|Left anterior descending coronary artery
 SVG|Saphenous Vein Graft
 PLB|Posterior lateral branch
 PDA|Patent Ductus Arteriosus
 IM|Intramuscular

User defined concepts

The option –cuitermlistfile allows a user to add a list of concepts not present in MetaMapLite’s dataset at invocation:

./metamaplite.sh --cuitermlistfile=conceptfile ...

The concepts file is the form "cui|term", for example:

 C5203670|COVID-19
 C5203671|Suspected COVID-19
 C5203672|SARS-CoV-2 vaccination
 C5203673|Antigen of SARS-CoV-2
 C5203674|Antibody to SARS-CoV-2
 C5203675|Exposure to SARS-CoV-2
 C5203676|severe acute respiratory syndrome coronavirus 2

Current options

input options:

--                             Read from standard input
--pipe                         Read from standard input

Configuration Options:

--configfile=<filename>        Use configuration file
--set_property=name=value      set property "name" to value

--filelistfn=<filename>        file containing a list of files to processed, one line per file
--filelist=<file0,file1,...>   list of files to processed separated by commas

--uda=<filename>               user defined acronyms file.
--cuitermlistfile=<filename>   user defined concepts file.

Options that can be used to override configuration file or when configuration file is not present:

--indexdir=<directory>         location of program's index directory
--modelsdir=<directory>        location of models for sentence breaker and part-of-speech tagger
--specialtermsfile=<filename>  location of file of terms to be excluded

document processing options:

--freetext                  Text with no markup. (default)
--inputformat=<loadername>	Use input format specified by loader name.
--inputformat=pubmed        PubMed XML format
--inputformat=medline       Medline format
--inputformat=ncbicorpus    NCBI Disease Corpus: tab separated fields: id \t title \t abstract
--inputformat=chemdner      CHEMDNER document: tab separated fields: id \t title \t abstract
--inputformat=chemdnersldi  CHEMDNER document: id with pipe followed by tab separated fields: id |t title \t abstract

output options:

--mmilike|mmi               similar to MetaMap Fielded MMI output (default)
--bioc|cdi|bc|bc-evaluate   output compatible with evaluation program bc-evaluate
--brat                      BRAT annotation format
--outputformat=<format>       
--outputformat=json         JSON output format

processing options:

--restrict_to_sts=<semtype>[,<semtype>,<semtype>...]

      Restrict output to concepts that have at least one member
      in the list of user-specified semantic-types. The list of
      supported semantic type short forms used by this program is
      available at https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt

--restrict_to_sources=<source>[,<source>...]
      Restrict output to concepts that belong to the list of specified vocabularies.
      A full list of the current source vocabularies and their abbreviations
      can be found at https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html

--segmentation_method=SENTENCES|BLANKLINES|LINES
                       Set method for text segmentation
--segment_sentences    Segment text by sentence
--segment_blanklines   Segment text by blankline
--segment_lines        Segment text by line
--usecontext           Use ConText Negation Detector instead of NLM's implementation of NegEx
--negationDetectorClass=className
                       Use a user-defined class for negation detector, class must implement to
                       gov.nih.nlm.nls.metamap.lite.NegationDetector interface.
--postaglist=tag,tag,...
                       List of part-of-speech tags to use for term lookup (each Penn Treebank
                       part-of-speech tag is separated by commas.)

alternate output options:

--list_sentences          list sentences in input
--list_acronyms           list acronyms in input if present.
--list_sentences_postags  list sentences in input with part-of-speech tags

Properties

Command line and System properties for metamaplite

These properties can be set using a System property (-D{propertyname}={value}).

| metamaplite.property.file             | load configuration from file (default: ./config/metamaplite.properties)

These properties can be set using a System property (-D{propertyname}={value}) or in configuration file.

| metamaplite.document.inputtype        | document input type (default: freetext)
| metamaplite.outputextension           | result output file extension (default: .mmi)
| metamaplite.outputformat              | result output format (default: mmi)

Processing properties

| metamaplite.segmentation.method       | Set method for text segmentation (values: SENTENCES, BLANKLINES, LINES; default: SENTENCES)
| metamaplite.sourceset                 | use only concepts from listed sources (default: all)
| metamaplite.semanticgroup             | use only concepts belonging to listed semantic types (default: all)
| metamaplite.negation.detector         | negation detector class: default: gov.nih.nlm.nls.metamap.lite.NegEx
                                                                   Alternate: 
| metamaplite.normalized.string.cache.size | set maximum size of string -> normalized string cache
| metamaplite.normalized.string.cache.enable | if true enable string -> normalized string cache
| metamaplite.entitylookup4.term.concept.cache.enable | if true enable term -> concept info cache
| metamaplite.entitylookup4.term.concept.cache.size | set maximum size of term -> concept info cache
| metamaplite.entitylookup4.cui.preferredname.cache.enable |  if true enable cui -> preferred name cache
| metamaplite.entitylookup4.cui.preferredname.cache.size | set maximum size cui -> preferred name cache

Configuration properties

| metamaplite.excluded.termsfile        | cui/terms pairs that are exclude from results (default: data/specialterms.txt)
| metamaplite.index.directory           | the directory the indexes reside
| opennlp.models.directory              | the directory the models reside (sets the following properties. default: data/models)
| opennlp.en-pos.bin.path               | (default: data/models/en-pos-maxent.bin)
| opennlp.en-token.bin.path             | (default: data/models/en-token.bin)
| opennlp.en-sent.bin.path              | (default: data/models/en-sent.bin)
| metamaplite.enable.postagging         | Enable part of speech tagging (default: "true" [on])
| metamaplite.postaglist                | List of part-of-speech tags to use for term lookup
                                        | (each Penn Treebank part-of-speech tag is separated by commas.)
| metamaplite.enable.scoring            | score concepts (I.E.: turn on chunker [currently OpenNLP]).

| metamaplite.uda.filename              | user defined acronyms file.
| metamaplite.cuitermlistfile.filename  | user defined concepts file.

Environment Variables

currently one

MML_INDEXDIR

Using MetaMapLite from Java

Creating properties for configuring MetaMapLite Instance:

Properties myProperties = new Properties();
MetaMapLite.expandModelsDir(myProperties,
               "/home/piro/public_mm_lite/data/models");
MetaMapLite.expandIndexDir(myProperties,
		       "/home/piro/Projects/public_mm_lite/data/ivf/strict");
myProperties.setProperty("metamaplite.excluded.termsfile",
			   "/home/piro/Projects/public_mm_lite/data/specialterms.txt");

Loading properties file in "config":

FileReader fr = new FileReader("config/metamaplite.properties");
myProperties.load(fr);
fr.close();

Creating a metamap lite instance:

MetaMapLite metaMapLiteInst = new MetaMapLite(myProperties);

Creating a document list with one or more documents:

BioCDocument document = FreeText.instantiateBioCDocument("diabetes");
document.setID("1");
List<BioCDocument> documentList = new ArrayList<BioCDocument>();
documentList.add(document);

Getting a list of entities for the document list:

List<Entity> entityList = metaMapLiteInst.processDocumentList(documentList);

Traversing the entity list displaying cui and matching text:

List<Entity> entityList = metaMapLiteInst.processDocumentList(documentList);
for (Entity entity: entityList) {
  for (Ev ev: entity.getEvSet()) {
 	System.out.print(ev.getConceptInfo().getCUI() + "|" + entity.getMatchedText());
    System.out.println();
  }
}

Processing Single Terms (without periods)

Disable the Part of Speech Tagger using the following property: "metamaplite.enable.postagging=false". Add the following line right before instantiating the MetaMapLite instance.

myProperties.setProperty("metamaplite.enable.postagging", "false");
MetaMapLite metaMapLiteInst = new MetaMapLite(myProperties);

Add each term as a single document:

BioCDocument document = FreeText.instantiateBioCDocument(term);

Using Maven

Installing metamaplite and dependencies into local Maven repository

From public_mm_lite directory install Context, BioC, and NLS NLP libraries

$ mvn install:install-file \
     -Dfile=lib/context-2012.jar \
     -DgroupId=context \
     -DartifactId=context \
     -Dversion=2012 \
     -Dpackaging=jar

$ mvn install:install-file \
     -Dfile=lib/bioc-1.0.1.jar \
     -DgroupId=bioc \
     -DartifactId=bioc \
     -Dversion=1.0.1 \
     -Dpackaging=jar

$ mvn install:install-file \
     -Dfile=lib/nlp-2.4.C.jar \
     -DgroupId=gov.nih.nlm.nls \
     -DartifactId=nlp \
     -Dversion=2.4.C \
     -Dpackaging=jar

$ mvn install:install-file  \
     -Dfile=lib/lvgdist-2020.0.jar \
     -DgroupId=gov.nih.nlm.nls.lvg \
     -DartifactId=lvgdist \
     -Dversion=2020.0 \
     -Dpackaging=jar

Then install metamaplite into your local Maven repository:

$ mvn install

Add metamaplite dependency to POM file

Add the following dependency to your webapps pom.xml:

<dependency>
  <groupId>gov.nih.nlm.nls</groupId>
  <artifactId>metamaplite</artifactId>
  <version>3.0-SNAPSHOT</version>
</dependency>

Irutils Indexes

Tables and Indexes

Currently, five tables are used:

  • cuisourceinfo
  • cuisemantictype (cuist)
  • cuiconcept
  • meshtcrelaxed (MeSH treecodes)
  • vars (lexical variants)

New indexes used for MetaMap-like scoring and MMI ranked output

Two new indexes have been introduced to support scoring similar to the original MetaMap and MMI ranking of which MetaMap scoring is a component.

  • treecodes - an indexing of MeSH Terms and their associated positions in MeSH hierarchy.
  • vars - an index of terms and their lexical variants.

NOTE: Currently, the only mechanism for generating the treecodes and vars (variants) tables from a UMLS subset (generated by Metamorphosys) is by installing the original MetaMap and the Data File Builder using the Data Builder to generate the necessary Treecodes and Vars table files. See the next section for information on adding these indices to a custom dataset.

Generating indexes from UMLS tables

The CreateIndexes class generates the tables needed by MetaMapLite for a particular UMLS release. The tables cuiconcept, cuisourceinfo, cuist, meshtcrelaxed, and vars are derived from the UMLS tables MRCONSO.RRF, MRSAT.RRF, and MRSTY.RRF and then CreateIndexes produces the associated indexes for derived tables. To produce the vars table you will need to install the Lexical Variant Generator (LVG) which is available from the Lexical Tools Page: https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/web/index.html . LVG is used when generating the vars table from MRCONSO.RRF.

Usage:

java -Xmx15g -cp target/metamaplite-<version>-standalone.jar \
   gov.nih.nlm.nls.metamap.dfbuilder.CreateIndexes \
   <mrconsofile> <mrstyfile> <mrsatfile> <ivfdir>

When running CreateIndexes must be provided the location of LVG configuration file by one of two mechanisms: Setting an environment variable LVG_DIR or LVG_CONFIG or setting a system property when invoking java. The simplest way of doing this by setting LVG_DIR to the location of LVG and CreateIndexes program will infer the location of the properties file:

on windows:

 set LVG_DIR=<location of lvg2020>

in bash on MacOS or Linux:

 export LVG_DIR=<location of lvg2020>

If you have a modified lvg.property file in an custom location you can set the variable LVG_CONFIG to the location of your custom lvg property file.

Alternatively, you can set system property gv.lvg.dirname to the location of LVG or setting the property gv.lvg.config.file to the location of lvg.properties, usually lvg2020/data/config/lvg.properties:

 -Dgv.lvg.dirname={location of lvg}

or:

 -Dgv.lvg.config.file={location of lvg.properties}

The resulting indices are in /indices. The tables the indexes are generated from are in /tables.

Checking newly generated indexes

You can use the class irutils.MappedMultiKeyIndexLookup to check the new indexes:

 java -Xmx20g -cp target/metamaplite-<version>-standalone.jar \
  irutils.MappedMultiKeyIndexLookup lookup workingdir indexname column query

For example:

 java -Xmx20g -cp target/metamaplite-<version>-standalone.jar \
  irutils.MappedMultiKeyIndexLookup lookup data/ivf/2016AB/USAbase/strict cuisourceinfo 3 heart

Using newly generated indexes with MetaMapLite

To use the new indexes do one of the following:

Use the --indexdir= option:

java -cp target/metamaplite-<version>-standalone.jar \
 gov.nih.nlm.nls.ner.MetaMapLite --indexdir=<ivfdir> <other-options> <other-args>

Or modify the configuration file config/metamap.properties:

metamaplite.index.directory: <ivfdir>

Adding custom input document formats

New document loader class must conform to BioCDocumentLoader interface.

Example implementations of BioCDocumentLoader are available in public_mm_lite/src/main/java/gov/nih/nlm/nls/metamap/document and on Github: https://github.com/lhncbc/metamaplite/tree/master/src/main/java/gov/nih/nlm/nls/metamap/document.

One can add a document loader class in MetaMapLite's classpath to MetaMapLite's list of document loaders by adding it to the properties using System properties or modifying MetaMapLite's configuration file:

Set as system property:

-Dbioc.document.loader.<name>=<fully-specified class name>

For example creating a loader with the name "qadocument":

-Dbioc.document.loader.qadocument=gov.nih.nlm.nls.metamap.document.QAKeyValueDocument

Or add it to config/metamaplite.properties:

bioc.document.loader.qadocument: gov.nih.nlm.nls.metamap.document.QAKeyValueDocument

An example of using the new custom document format through a properties:

-Dmetamaplite.document.inputtype=<name>

For example:

-Dmetamaplite.document.inputtype=qadocument

On the command line use the ‘inputformat=’:

--inputformat=qadocument

Adding custom result output formats

New result formatter class must conform to ResultFormatter interface. One can add the result formatter to MetaMapLite by adding its class file to MetaMapLite's classpath and then adding a reference to it as a property:

Set as system property:

-Dmetamaplite.result.formatter.<name>=<fully-specified class name>

For example creating a formatter with the name "bratsemtype":

-Dmetamaplite.result.formatter.brat=examples.BratSemType

Or add it to config/metamaplite.properties:

 metamaplite.result.formatter.brat: examples.BratSemType

Source code for the BratSemType result formatter is provided in the directory public_mm_lite/src/main/java/examples/BratSemType.java.

Adding MetaMapLite to a webapp (servlet).

WebApp Local Configuration

A extensive example of providing a servlet complete with data and configuration files in the war (web archive) file is available on the MetaMap website on the MetaMapLite web page (https://metamap.nlm.nih.gov/MetaMapLite.shtml).

Alternate Configuration

Below is an alternate configuration for users who don't want to place the configuration and data in webapp deployment archive file (war).

Place the "metamaplite.properties" file in the tomcat "conf/" directory and specify that in servlet:

public class SampleWebApp extends HttpServlet {
  /** location of metamaplite.properties configuration file */
  static String configPropertyFilename =
    System.getProperty("metamaplite.property.file", "conf/metamaplite.properties");
  Properties properties;
  MetaMapLite metaMapLiteInst;

  public SampleWebApp() {
    try {
      this.properties = new Properties();
      // default properties that can be overriden 
      this.properties.setProperty("metamaplite.index.directory","data/ivf/strict");
	  ...
      // load user properties
      FileReader fr = new FileReader(configPropertyFilename);
      myProperties.load(fr);
      fr.close();
      this.metaMapLiteInst = new MetaMapLite(this.properties);
	  ...
  	} catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
  ...
}

The absolute locations of indexes and model files can be specified in "metamaplite.properties".

Command Line Usage:

The file biocprocess.sh is a wrapper for the BioC to BioC pipeline.

./biocprocess.sh <bioc-xml-input-file> <bioc-xml-output-file>

Note that the class is missing the command line options handler that is present in the gov.nih.nlm.nls.ner.MetaMapLite class, use the config/metamaplite.properties file to set any custom properties or place the properties in another custom file and use the system property "metamaplite.propertyfile" to refer to the custom file.

Files in which sources are not included

The following java archive files have sources available from NLM but the sources are not provided by this distribution.

The sources are available at the following locations:

Future Work

  • Add support for composite phrases from chunked phrases
  • Create a pipeline using a full parser.
  • Add a mechanism to use custom user-supplied segmenters.

metamaplite's People

Contributors

amadanmath avatar dcronkite avatar dependabot[bot] avatar stevenbedrick avatar willjrogers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metamaplite's Issues

Improper concept index in MMI output

For the input file: 00000086.txt

In MetaMap, the outputs used to come like these:

'00000086-0'|MMI|17.80|Mediastinum|C0025066|[blor]|["MEDIASTINUM"-tx-2-"mediastinum"-noun-0]|TX|50/11|A01.923.761.800.500
'00000086-73'|MMI|8.34|Lung|C0024109|[bpoc]|["LUNGS"-tx-1-"lungs"-noun-0]|TX|1/5|A04.411

But the output in MetaMapLite comes like these:
00000000.tx|MMI|2.37|Mediastinum|C0025066|[blor]|"Mediastinum"-text-0-"mediastinum"-NN-0|63/11|A01.923.761.800.500
00000000.tx|MMI|0.98|Lung|C0024109|[bpoc]|"Lungs"-text-0-"lungs"-NNS-0|103/5|A04.411

The first field in MetaMap output represents filename and the char start pos of the sentence. But in MetaMapLite that information is missing. This makes programs using that information fail.

I had tested using the release: MetaMapLite 3.6.1p1

Semantic Type/Source filtering happens after longest match exlcuding matches

If the string 'Metastatic prostate cancer' is processed with the semantic type of bpoc (Body Part, Organ or Organ Component) then no concepts are found.

This is because 'Metastatic prostate cancer' matches concept C1282496 and C0936223 but these are both neop (Neoplastic Process) concepts and are discarded when the semantic type filtering is applied.

I would have expected the longest match to be applied after the semantic type and source filtering.

i.e. the concept C0033572: prostate should have been returned.

Indexing of UMLS subset

Hi,
I am following the Generating Tables section in the Design.md to create the index files for a custome subset of the UMLS. The CreateIndexes program runs without any errors (just the warnings about missing meshtcrelaxed.txt and vars.txt).

However, the irutils.MappedMultiKeyIndexLookup lookup command does not return any results when applied on the generated indexes. When applying the same command on the indices distributed with MetaMapLite, results are returned. When browsing the subset for which the indices are generated with Metamorphosys, the searched term is also found.

If there is any debug information I can provide, I'm more than happy to provide it. Thanks for your help!

Negation Missing in JSON Output

The json output in version MetaMapLite 3.6.2rc6 (and previous versions) does not contain negation output from the NegEx or Context algorithms.

E.g., running echo "no covid-19" | ./metamaplite.sh --pipe --outputformat=json

[
  {
    "matchedtext": "covid-19",
    "evlist": [
      {
        "score": 0,
        "matchedtext": "covid-19",
        "start": 3,
        "length": 8,
        "id": "ev0",
        "conceptinfo": {
          "conceptstring": "COVID-19",
          "sources": [
            "MTH",
            "SNOMEDCT_US"
          ],
          "cui": "C5203670",
          "preferredname": "COVID-19",
          "semantictypes": [
            "dsyn"
          ]
        }
      }
    ],
    "docid": "00000000.tx",
    "start": 3,
    "length": 8,
    "id": "en0",
    "fieldid": "text"
  }
]

Whereas mmi output contains the information:

00000000.tx|MMI|0.46|COVID-19|C5203670|[dsyn]|"COVID-19"-text-0-"covid-19"-NN-1|text|3/8||

Can this be added?

`getBasename` selects higher-level directory if it contains dots/periods

When processing files without extensions, getBasename will split on a directory name if it contains periods in it:

public String getBasename(String filename) {
String basename = "sentences";
if (filename.lastIndexOf(".") >= 0) {
basename = filename.substring(0,filename.lastIndexOf("."));
} else {
basename = filename;
}
return basename;
}

E.g., if I supply the path /mapr/r.ds/mml/file_0, it will attempt to write the output json file to /mapr/r.json rather than /mapr/r.ds/mml/file_0.json:

[main] INFO gov.nih.nlm.nls.ner.MetaMapLite - Loading and processing /mapr/r.ds/mml/file_0
[main] INFO gov.nih.nlm.nls.ner.MetaMapLite - outputing results to /mapr/r.json
Exception in thread "main" java.io.FileNotFoundException: /mapr/r.json (Operation not permitted)

Workaround
I've used the workaround of requiring all input files to have a .txt extension (e.g., /mapr/r.ds/mml/file_0.txt in the above example). This will put the output in the correct directory (e.g., /mapr/r.ds/mml/file_0.json).

FileNotFoundException when building Indices

I built a custom UMLS and then attempted to index the output using bin/create_indexes.bat, but ran into an issue:

Exception in thread "main" java.io.FileNotFoundException: C:\path\to\indices\meshtcrelaxed\postings (The system cannot find the path specified)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:124)
        at irutils.MappedMultiKeyIndexDiskBasedGeneration.writeFinalIndex(MappedMultiKeyIndexDiskBasedGeneration.java:265)
        at gov.nih.nlm.nls.metamap.dfbuilder.BuildIndex.createIndex(BuildIndex.java:120)
        at gov.nih.nlm.nls.metamap.dfbuilder.BuildIndex.main(BuildIndex.java:155)

This seems to be caused by the containing directory (in my case, 'meshtcrelaxed') not existing when instantiating a new RandomAccessFile class here:

RandomAccessFile postingsRaf = new RandomAccessFile

If I manually create the containing directory C:\path\to\indices\meshtcrelaxed and then re-run, I don't encounter this error.

MetamapLite cannot use outside of the dir 'public_mm_lite'?

image

Qhen I run the code outside of the dir 'public_mm_lite', it always show error of configuration failure of indexing dataset saying "data/ivf/2020AA/USAbase does not exist, aborting", although I have set indexdir with an absolute path like "./metamaplite.sh --indexdir=ABSOLUTE_PATH/data/ivf/2020AA/USAbase".

Is this a normal case?

Not getting response in near real time

As your Document say , The primary goal of MetaMapLite to provide a near real-time named-entity recognizer which is not a rigorous as MetaMap but much faster while allowing users to customize and augment its behavior for specific purposes.

But when i tried with running metamaplite.sh as below, i am not getting response in real time. On an average it take 5-6 seconds.

./metamaplite.sh example.txt

As per below screen shot there should be significant improvement in term of processing time between metamap and metamaplite. But when i tried to run metamaplite its not giving response in near real time.

Please let me know if i need to do any set of configuration.

image

How to change logging level?

How can I alter the logging level when running Metamaplite? I'd like to set it to at least WARN in order to reduce the amount of console output when running the application, but I can't seem to change the defaults.

I tried (on 3.6.2rc8):

  • Running metamaplite.bat but changing 'debug' to 'warn' in config/log4j2.xml
  • Specifying the changed log4j2.xml file with both the parameters log4j.configurationFile and metamaplite.log4jconfig in metamaplite.bat
  • Passing in the --log4jconfig option to point to config/log4j2.xml (this results in an error: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at gov.nih.nlm.nls.ner.MetaMapLite.main(MetaMapLite.java:1323)).

EntityLookup4 not checking for phrase type as well as PoS

While investigating a difference in the behavior of EntityLookup4 vs EntityLookup5, I ultimately traced it back to the part of findLongestMatch() that checks to see whether the part of speech of the first token of tokenSubList is in allowedPartOfSpeechSet: https://github.com/lhncbc/metamaplite/blob/d3171e5d1deb2ceeeeeca9b757a85b8617e5c01b/src/main/java/gov/nih/nlm/nls/metamap/lite/EntityLookup5.java#LL393C12-L393C12

In EntityLookup5, the check will allow tokens that are not of an allowed PoS if the phrase under consideration is of a type listed in allowedPhraseTypeSet. The corresponding place in EntityLookup4 doesn't do this check, so certain things are getting bounced out from EntityLookup4 that EntityLookup5 allows:

this.allowedPartOfSpeechSet.contains(firstToken.getPartOfSpeech())) {

Is there a reason for this? If not, I will add in the corresponding check to EntityLookup4 and do a PR.

JVM Version Compatibility

Hello! I see from the README that MetaMapLite is tested/designed with Java 1.8; a quick experiment with Java 18 showed that it would compile just fine with a more modern dialect of the language, except for gov.nih.nlm.nls.tools.MedlineDomReader, which relies on a deprecated set of J2EE classes, specifically javax.xml.bind.annotation.W3CDomHandler. Has there been any thought to re-working this part of the program to use a different (and non-deprecated) method?

Files not found lib/bioc-1.0.1.jar lib/context-2012.jar

Hello I can build with java. 1.8 all the maven steps except:

$ mvn install:install-file
-Dfile=lib/context-2012.jar
-DgroupId=context
-DartifactId=context
-Dversion=2012
-Dpackaging=jar

$ mvn install:install-file
-Dfile=lib/bioc-1.0.1.jar
-DgroupId=bioc
-DartifactId=bioc
-Dversion=1.0.1
-Dpackaging=jar

 Should I try to recreate context-2012.jar from https://github.com/chapmanbe/negex/tree/master/GeneralNegEx.Java.v.1.2.05092009

and Bioc from wherhwere else? Thank you for your help and maintaining this repo

FileNotFoundException when opening default gv.words.temp.filename on Windows

The default bin/create_indexes.bat runs into a FileNotFoundException when running GenerateVariants (see

java -Xmx4g -cp %projectdir%\target\metamaplite-%MML_VERSION%-standalone.jar ^
gov.nih.nlm.nls.metamap.dfbuilder.GenerateVariants ^
%MRCONSO% %IVFDIR%\tables\vars.txt
) as the default value for wordsFilename is /tmp/words.txt.tmp:
wordsFilename = System.getProperty("gv.words.temp.filename",
"/tmp/words.txt.tmp");
. It appears Java on Windows is unable to resolve the /tmp directory.

I was able to workaround the issue by adding the system property gv.words.temp.filename to the command line as just words.txt.tmp:

java -Xmx4g "-Dgv.words.temp.filename=words.txt.tmp" -cp %projectdir%\target\metamaplite-%MML_VERSION%-standalone.jar ^
     gov.nih.nlm.nls.metamap.dfbuilder.GenerateVariants ^
     %MRCONSO% %IVFDIR%\tables\vars.txt

Can this be added the create_indexes.bat script? Or, /tmp be replaced by something like System.getProperty("java.io.tmpdir")?

w/system: a resource failed to call end(close).

Although this may not happen on desktop or laptop machine, directly using metamaplite in android produces w/system: a resource failed to call end(close). This error is basically caused by un-released fileinputstream or fileoutputstream. On android, the newly opened fileinputstream or fileoutputstream needs to be closed explicitly. In particular, this error comes from 2 sides when using metamaplite on android:

  1. from metamaplite:
    For example, in Example.java and Example2.java, the newly created FileReader is not closed after usage, so they need to be explicitly closed as shown below:
            FileReader fr = new FileReader("./config/metamaplite.properties");
            myProperties.load(fr);
            ...
            fr.close();
  1. from 3rd party library:
    For example, in org.apache.opennlp, opennlp/tools/util/model/BaseModel.java(loadModel) and opennlp/tools/util/model/BaseModel.java(finishLoadingArtifacts), ZipInputStream zip is not explicitly closed. This can be achieved by the following:
  • In loadModel method:
            final ZipInputStream zip = new ZipInputStream(in);
            ...
            zip.close()
  • In finishLoadingArtifacts
            ZipInputStream zip = new ZipInputStream((InputStream)in);
            ...
            zip.close()

If more attention is paied, here I replaced the final ZipInputStream zip = new ZipInputStream(in); with ZipInputStream zip = new ZipInputStream((InputStream)in);. Yes, only in this way I can re-compile the the opennlp source code into a jar file in maven. If there is an alternative, please let me know.

--chemdnersldi option not available

./metamaplite.sh --chemdnersldi input_file.txt

This returns

unknown option: --chemdnersldi

Content of input_file.txt

1|'Heart Attack'
2|'John had a huge heart attack'

Update:
After going through the source code: https://github.com/lhncbc/metamaplite/blob/master/src/main/java/gov/nih/nlm/nls/ner/MetaMapLite.java

found that I should be using the command:
./metamaplite.sh --inputformat=sldiwi input_file.txt

I was wrong in assuming that I should be using chemdnersldi.
SingleLineDelimitedInputWithID should be the document input format.

public_mm_lite directory not found

I am trying to use the package as a standalone command line tool thus I may be missing some Java project magic.
How is it supposed to be created?

Metamaplite in other language.

First let me thank you for this great work.

I'm working as a part of research on NER in spanish text, I want to use metamap,
how do you think the pipeline is to achieve this goal?

Class Mrconso missing from package dfbuilder

Class Mrconso is missing from package gov.nih.nlm.nls.metamap.dfbuilder, and so the class ExtractTreecodes does not compile. I had to remove the class so I could use the latest version of the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.