lhncbc / metamaplite Goto Github PK
View Code? Open in Web Editor NEWA near real-time named-entity recognizer
Home Page: https://metamap.nlm.nih.gov/MetaMapLite.shtml
License: Other
A near real-time named-entity recognizer
Home Page: https://metamap.nlm.nih.gov/MetaMapLite.shtml
License: Other
I built a custom UMLS and then attempted to index the output using bin/create_indexes.bat
, but ran into an issue:
Exception in thread "main" java.io.FileNotFoundException: C:\path\to\indices\meshtcrelaxed\postings (The system cannot find the path specified)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:124)
at irutils.MappedMultiKeyIndexDiskBasedGeneration.writeFinalIndex(MappedMultiKeyIndexDiskBasedGeneration.java:265)
at gov.nih.nlm.nls.metamap.dfbuilder.BuildIndex.createIndex(BuildIndex.java:120)
at gov.nih.nlm.nls.metamap.dfbuilder.BuildIndex.main(BuildIndex.java:155)
This seems to be caused by the containing directory (in my case, 'meshtcrelaxed') not existing when instantiating a new RandomAccessFile class here: https://github.com/lhncbc/metamaplite/blob/8aae39319a4a4b40a013180bf6cde09b172c78a8/src/main/java/irutils/MappedMultiKeyIndexDiskBasedGeneration.java#L265
If I manually create the containing directory C:\path\to\indices\meshtcrelaxed
and then re-run, I don't encounter this error.
./metamaplite.sh --chemdnersldi input_file.txt
This returns
unknown option: --chemdnersldi
Content of input_file.txt
1|'Heart Attack'
2|'John had a huge heart attack'
Update:
After going through the source code: https://github.com/lhncbc/metamaplite/blob/master/src/main/java/gov/nih/nlm/nls/ner/MetaMapLite.java
found that I should be using the command:
./metamaplite.sh --inputformat=sldiwi input_file.txt
I was wrong in assuming that I should be using chemdnersldi.
SingleLineDelimitedInputWithID should be the document input format.
"Location" field was provided in MetaMap (mmi output).
But this seems to be missing in MetaMapLite.
Any plans to introduce that to make it consistent with MetaMap?
I was facing this issue while extending metamap python wrapper to work for MetaMapLite.
Is there any documentation available on the output fields for MetaMapLite?
While investigating a difference in the behavior of EntityLookup4 vs EntityLookup5, I ultimately traced it back to the part of findLongestMatch()
that checks to see whether the part of speech of the first token of tokenSubList
is in allowedPartOfSpeechSet
: https://github.com/lhncbc/metamaplite/blob/d3171e5d1deb2ceeeeeca9b757a85b8617e5c01b/src/main/java/gov/nih/nlm/nls/metamap/lite/EntityLookup5.java#LL393C12-L393C12
In EntityLookup5, the check will allow tokens that are not of an allowed PoS if the phrase under consideration is of a type listed in allowedPhraseTypeSet
. The corresponding place in EntityLookup4 doesn't do this check, so certain things are getting bounced out from EntityLookup4 that EntityLookup5 allows: https://github.com/lhncbc/metamaplite/blob/d3171e5d1deb2ceeeeeca9b757a85b8617e5c01b/src/main/java/gov/nih/nlm/nls/metamap/lite/EntityLookup4.java#L352
Is there a reason for this? If not, I will add in the corresponding check to EntityLookup4 and do a PR.
In README.md:
For MetaMapLite 3.6.1p1 (with Category 0+4+9 (USAbase) 2017AA UMLS dataset),
it should be --indexdir=public_mm_lite/data/ivf/2017AA/USAbase/strict
Class Mrconso is missing from package gov.nih.nlm.nls.metamap.dfbuilder, and so the class ExtractTreecodes does not compile. I had to remove the class so I could use the latest version of the code.
The default bin/create_indexes.bat
runs into a FileNotFoundException when running GenerateVariants (see https://github.com/lhncbc/metamaplite/blob/827a5c1f7a0174247ea499117d82745af827c628/bin/create_indexes.bat#L58-L60) as the default value for wordsFilename
is /tmp/words.txt.tmp
:
https://github.com/lhncbc/metamaplite/blob/8aae39319a4a4b40a013180bf6cde09b172c78a8/src/main/java/gov/nih/nlm/nls/metamap/dfbuilder/GenerateVariants.java#L226-L227. It appears Java on Windows is unable to resolve the /tmp
directory.
I was able to workaround the issue by adding the system property gv.words.temp.filename
to the command line as just words.txt.tmp
:
java -Xmx4g "-Dgv.words.temp.filename=words.txt.tmp" -cp %projectdir%\target\metamaplite-%MML_VERSION%-standalone.jar ^
gov.nih.nlm.nls.metamap.dfbuilder.GenerateVariants ^
%MRCONSO% %IVFDIR%\tables\vars.txt
Can this be added the create_indexes.bat
script? Or, /tmp
be replaced by something like System.getProperty("java.io.tmpdir")
?
As your Document say , The primary goal of MetaMapLite to provide a near real-time named-entity recognizer which is not a rigorous as MetaMap but much faster while allowing users to customize and augment its behavior for specific purposes.
But when i tried with running metamaplite.sh as below, i am not getting response in real time. On an average it take 5-6 seconds.
./metamaplite.sh example.txt
As per below screen shot there should be significant improvement in term of processing time between metamap and metamaplite. But when i tried to run metamaplite its not giving response in near real time.
Please let me know if i need to do any set of configuration.
First let me thank you for this great work.
I'm working as a part of research on NER in spanish text, I want to use metamap,
how do you think the pipeline is to achieve this goal?
Qhen I run the code outside of the dir 'public_mm_lite', it always show error of configuration failure of indexing dataset saying "data/ivf/2020AA/USAbase does not exist, aborting", although I have set indexdir with an absolute path like "./metamaplite.sh --indexdir=ABSOLUTE_PATH/data/ivf/2020AA/USAbase".
Is this a normal case?
For the input file: 00000086.txt
In MetaMap, the outputs used to come like these:
'00000086-0'|MMI|17.80|Mediastinum|C0025066|[blor]|["MEDIASTINUM"-tx-2-"mediastinum"-noun-0]|TX|50/11|A01.923.761.800.500
'00000086-73'|MMI|8.34|Lung|C0024109|[bpoc]|["LUNGS"-tx-1-"lungs"-noun-0]|TX|1/5|A04.411
But the output in MetaMapLite comes like these:
00000000.tx|MMI|2.37|Mediastinum|C0025066|[blor]|"Mediastinum"-text-0-"mediastinum"-NN-0|63/11|A01.923.761.800.500
00000000.tx|MMI|0.98|Lung|C0024109|[bpoc]|"Lungs"-text-0-"lungs"-NNS-0|103/5|A04.411
The first field in MetaMap output represents filename and the char start pos of the sentence. But in MetaMapLite that information is missing. This makes programs using that information fail.
I had tested using the release: MetaMapLite 3.6.1p1
When processing files without extensions, getBasename
will split on a directory name if it contains periods in it:
E.g., if I supply the path /mapr/r.ds/mml/file_0
, it will attempt to write the output json
file to /mapr/r.json
rather than /mapr/r.ds/mml/file_0.json
:
[main] INFO gov.nih.nlm.nls.ner.MetaMapLite - Loading and processing /mapr/r.ds/mml/file_0
[main] INFO gov.nih.nlm.nls.ner.MetaMapLite - outputing results to /mapr/r.json
Exception in thread "main" java.io.FileNotFoundException: /mapr/r.json (Operation not permitted)
Workaround
I've used the workaround of requiring all input files to have a .txt
extension (e.g., /mapr/r.ds/mml/file_0.txt
in the above example). This will put the output in the correct directory (e.g., /mapr/r.ds/mml/file_0.json
).
How can I alter the logging level when running Metamaplite? I'd like to set it to at least WARN in order to reduce the amount of console output when running the application, but I can't seem to change the defaults.
I tried (on 3.6.2rc8):
metamaplite.bat
but changing 'debug' to 'warn' in config/log4j2.xml
log4j.configurationFile
and metamaplite.log4jconfig
in metamaplite.bat
--log4jconfig
option to point to config/log4j2.xml
(this results in an error: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at gov.nih.nlm.nls.ner.MetaMapLite.main(MetaMapLite.java:1323)
).Hello! I see from the README that MetaMapLite is tested/designed with Java 1.8; a quick experiment with Java 18 showed that it would compile just fine with a more modern dialect of the language, except for gov.nih.nlm.nls.tools.MedlineDomReader
, which relies on a deprecated set of J2EE classes, specifically javax.xml.bind.annotation.W3CDomHandler
. Has there been any thought to re-working this part of the program to use a different (and non-deprecated) method?
The json
output in version MetaMapLite 3.6.2rc6 (and previous versions) does not contain negation output from the NegEx or Context algorithms.
E.g., running echo "no covid-19" | ./metamaplite.sh --pipe --outputformat=json
[
{
"matchedtext": "covid-19",
"evlist": [
{
"score": 0,
"matchedtext": "covid-19",
"start": 3,
"length": 8,
"id": "ev0",
"conceptinfo": {
"conceptstring": "COVID-19",
"sources": [
"MTH",
"SNOMEDCT_US"
],
"cui": "C5203670",
"preferredname": "COVID-19",
"semantictypes": [
"dsyn"
]
}
}
],
"docid": "00000000.tx",
"start": 3,
"length": 8,
"id": "en0",
"fieldid": "text"
}
]
Whereas mmi output contains the information:
00000000.tx|MMI|0.46|COVID-19|C5203670|[dsyn]|"COVID-19"-text-0-"covid-19"-NN-1|text|3/8||
Can this be added?
I am trying to use the package as a standalone command line tool thus I may be missing some Java project magic.
How is it supposed to be created?
Hi,
I am following the Generating Tables section in the Design.md to create the index files for a custome subset of the UMLS. The CreateIndexes program runs without any errors (just the warnings about missing meshtcrelaxed.txt and vars.txt).
However, the irutils.MappedMultiKeyIndexLookup lookup command does not return any results when applied on the generated indexes. When applying the same command on the indices distributed with MetaMapLite, results are returned. When browsing the subset for which the indices are generated with Metamorphosys, the searched term is also found.
If there is any debug information I can provide, I'm more than happy to provide it. Thanks for your help!
If the string 'Metastatic prostate cancer' is processed with the semantic type of bpoc (Body Part, Organ or Organ Component) then no concepts are found.
This is because 'Metastatic prostate cancer' matches concept C1282496 and C0936223 but these are both neop (Neoplastic Process) concepts and are discarded when the semantic type filtering is applied.
I would have expected the longest match to be applied after the semantic type and source filtering.
i.e. the concept C0033572: prostate should have been returned.
i download the "MetaMapLite 3.6.2rc8 and UMLS 2022", and also the dataset [2022AA UMLS Level 0+4+9 DataSet].
when i run ./metamaplite.sh. it failed to run.
Hello I can build with java. 1.8 all the maven steps except:
$ mvn install:install-file
-Dfile=lib/context-2012.jar
-DgroupId=context
-DartifactId=context
-Dversion=2012
-Dpackaging=jar
$ mvn install:install-file
-Dfile=lib/bioc-1.0.1.jar
-DgroupId=bioc
-DartifactId=bioc
-Dversion=1.0.1
-Dpackaging=jar
Should I try to recreate context-2012.jar from https://github.com/chapmanbe/negex/tree/master/GeneralNegEx.Java.v.1.2.05092009
and Bioc from wherhwere else? Thank you for your help and maintaining this repo
It seems the Interactive MetaMapLite Page seems to not work.
Although this may not happen on desktop or laptop machine, directly using metamaplite in android produces w/system: a resource failed to call end(close)
. This error is basically caused by un-released fileinputstream
or fileoutputstream
. On android, the newly opened fileinputstream
or fileoutputstream
needs to be closed explicitly. In particular, this error comes from 2 sides when using metamaplite on android:
FileReader
is not closed after usage, so they need to be explicitly closed as shown below: FileReader fr = new FileReader("./config/metamaplite.properties");
myProperties.load(fr);
...
fr.close();
ZipInputStream zip
is not explicitly closed. This can be achieved by the following:loadModel
method: final ZipInputStream zip = new ZipInputStream(in);
...
zip.close()
finishLoadingArtifacts
ZipInputStream zip = new ZipInputStream((InputStream)in);
...
zip.close()
If more attention is paied, here I replaced the final ZipInputStream zip = new ZipInputStream(in);
with ZipInputStream zip = new ZipInputStream((InputStream)in);
. Yes, only in this way I can re-compile the the opennlp source code into a jar file in maven. If there is an alternative, please let me know.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.