intersystems / iknow Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 18.0 187.86 MB

Community development repository for iKnow

License: MIT License

C++ 99.87% C 0.01% Makefile 0.03% Python 0.07% Shell 0.01% Batchfile 0.01% XSLT 0.02% Cython 0.01%

intersystems-iknow nlp

iknow's People

Contributors

Stargazers

Watchers

Forkers

bdeboe-test josdenysgithub laranea whitten jslevinson-cloud makorin0315 sylvainguilbaud mgoldenisc bdeboe bdotgradb isc-adang rsi7700 isc-sde ookura jwijffels vishalbelsare james-iwanicki g9020

iknow's Issues

Python interface: 'level' for Certainty is missing

The Certainty attribute in the English language model has a marker, span and level. When using the m_index property in the Python interface, the marker can be found through ['sent_attributes'], the span through ['path_attributes']. The level, currently either 0 (uncertain) or 9 (certain), should be in ['sent_attributes'] too, but it is missing.
Example:
Input = "This might be a problem."
['sent_attributes'] = [{'type': 'Certainty', 'offset_start': 7, 'offset_stop': 12, 'marker': 'might', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 1}]

Permissions issue with autoupdate workflows

Hi Aohan,

you've probably seen these workflow permission issues with the autoupdate script, but wrapping them in an issue so it's tracked:

Pushing pull request branch to 'origin/autoupdate-buildcache'
  /usr/bin/git push --force-with-lease origin HEAD:refs/heads/autoupdate-buildcache
  To https://github.com/intersystems/iknow
   ! [remote rejected] HEAD -> autoupdate-buildcache (refusing to allow a GitHub App to create or update workflow `.github/workflows/dependencies.sh` without `workflows` permission)
  error: failed to push some refs to 'https://github.com/intersystems/iknow'
  Error: The process '/usr/bin/git' failed with exit code 1

Example failure: https://github.com/intersystems/iknow/runs/1471344935?check_suite_focus=true

iKnow indexing in genRAW: Time/Frequency/Duration attribute not shown if Measurement attribute in same Concept

The Frequency attribute for 'daily' is missing in the RAW output for the following example:
input:
60 mg daily
-> attributes: attr type="measurement" literal="60 mg daily" token="60 mg" value="60" unit="mg"

The Frequency attribute is however present in the genTrace output:
input:
60 mg daily
-> index="daily" labels="ENCon;ENFrequency(a:Entity,Frequency,);ENInMeasspan;

When the token with Measurement attribute is removed, 'daily' does get the Frequency attribute in the RAW output:
input:
daily
-> attr type="frequency" literal="daily." token="daily."

Segmentation fault when unsupported language is passed to iKnowEngine::index

When an unsupported language is passed as the second argument to iKnowEngine::index, a segmentation fault occurs. It would be better to throw an exception, provide an error code, or to document a precondition.

Build error on XCode 12

The iKnow engine does not build with XCode 12, which emits a warning that earlier XCode versions do not. The problem is that (value_int > 0 || value_int <= 9) is always true. Is there a mistake in this if-statement?

clang++ -std=c++14 -D_DOUBLEBYTE -DCACHE_COM_DISABLE -c -arch x86_64 -mmacosx-version-min=10.9 -stdlib=libc++ -DMY_BIG_ENDIAN=__BIG_ENDIAN__ -D_ISC_BIGENDIAN=__BIG_ENDIAN__ -DBIT64PLAT=__LP64__ -DSIZEOF_LONG=8 -DMACOSX  -fPIC -DUNIX -stdlib=libc++ -g -O3 -Wno-long-long -Werror -Wall -Wextra -pedantic-errors -fdiagnostics-show-option -Wno-parentheses -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-local-typedef -Wno-unknown-warning-option  -I/Users/travis/build/adang1345/iknow/modules/shell/src -I/Users/travis/build/adang1345/iknow/modules/shell/src/SDK/headers -I/Users/travis/build/adang1345/iknow/modules/base/src/headers -I/Users/travis/build/adang1345/iknow/modules/ali -I/Users/travis/build/adang1345/iknow/modules/core/src/headers -I/Users/travis/build/adang1345/iknow/modules/aho -I/Users/travis/build/adang1345/iknow/shared/System/unix -I/Users/travis/build/adang1345/iknow/shared/System -I/Users/travis/build/adang1345/iknow/shared/Utility -I/Users/travis/build/adang1345/iknow/kernel/common/h -I/Users/travis/build/adang1345/iknow/thirdparty/icu/include -o /Users/travis/build/adang1345/iknow/built/macx64/release/libiknowshell/Process.o /Users/travis/build/adang1345/iknow/modules/shell/src/Process.cpp
In file included from /Users/travis/build/adang1345/iknow/modules/shell/src/CompiledKnowledgebase.cpp:1:
In file included from /Users/travis/build/adang1345/iknow/modules/shell/src/CompiledKnowledgebase.h:3:
In file included from /Users/travis/build/adang1345/iknow/modules/shell/src/SharedMemoryKnowledgebase.h:17:
/Users/travis/build/adang1345/iknow/modules/shell/src/KbRule.h:146:24: error: overlapping comparisons always evaluate to true [-Werror,-Wtautological-overlap-compare]
                                        if (value_int > 0 || value_int <= 9) lexrep_length_ = static_cast<short> (value_int);
                                            ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
1 error generated.

Japanese model consumes inordinate memory when compiled by Clang

Compiling an optimized version of the Japanese model on Mac with clang consumes just about all the memory my system can give it (32GB).

I've tried to reduce this down to a test case I could report to the clang team but without success to date.

I think we need to adjust the Makefile for the Japanese model to not attempt to optimize on any platform (it causes less severe but notable problems on Linux with gcc as well, IIRC). -O0

Handle dependency on Visual C++ Redistributable for Visual Studio 2015

Importing iknowpy fails if the machine does not have Visual C++ Redistributable for Visual Studio 2015 installed.

>>> import iknowpy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python36\lib\site-packages\iknowpy\__init__.py", line 23, in <module>
    from .engine import iKnowEngine, UserDictionary
ImportError: DLL load failed: The specified module could not be found.

Specifically, we need the files msvcp140.dll, vcruntime140.dll, vcruntime140_1.dll, and concrt140.dll. I can think of 3 possible solutions:

Bundle these files into the wheel.
Document that the user must install the appropriate Visual C++ Redistributable package prior to using iknowpy.
Declare a dependency on the msvc-runtime package (https://pypi.org/project/msvc-runtime/), which installs the necessary DLLs into the Python instance.

iknowpy: Certain characters cause a shift in word and sentence boundaries

The input below contains some Greek characters which seem to mess up the detection of word boundaries. Boundaries (spaces) appear at the wrong positions, causing splitting and incomplete words. This is especially clear at the end of the sentence: the first 3 characters of the second sentence become part of the first sentence. The shift continues until the end of the input file.
The input file is UTF-8 encoded, as required.

input:
Syloïde blijkt een vergelijkbaar of zelfs groter effect te hebben op sommige parameters (bijv. 𝑎2, 𝑎3, 1 𝑡1 en 1 𝑡3) van de compressievergelijking. Dit vergelijkbare effect wordt echter vaak alleen bereikt bij een hogere concentratie Syloid in vergelijking met magnesiumstearaat.

output:
S1: Syloïde blijkt een vergelijkbaar of zelfs groter effect te hebben op sommige parameters (bijv. 𝑎2, 3, 1 𝑡1 en 1 3) van de om ressievergelijking. Dit
S2: ver elijkbare effect wor t ech er vaa < all> en ber ikt bij een hog re concentratie Syloid in ergelijking met mag esiumstearaat.

Request: support label UDCertainty for Certainty markers

Additional markers for negation, sentiment, etc. can be defined through a user dictionary. Please add this functionality for certainty markers too.

Re-enable manylinux2010_x86_64 builds

With 363be5d, I have temporarily disabled manylinux2010_x86_64 builds due to pypa/manylinux#836. When that issue is resolved, this change should be reverted.

lexrep identification discrepancy in Japanese

After adding some PathRelevant entities and simple path expansion, I’ve compared the outputs between IRIS NLP and iknowpy. I’ve found one difference which seems to result from different ways that IRIS & iknowpy identiy lexreps.

Sentence: また、大川小のある釜谷地区では住民と在勤者、来訪者計232人のうち、181人が犠牲となったとの調査結果を報告。

Lexrep identification for the part "232人のうち、181人が" in IRIS:
Lexrep("232")=Numeric
Lexrep("人")=JPCon+JPCount+JPRule3437+Lit_人
Lexrep("のうち")=JPParticlePREPO
Lexrep("、")=JPComma+Lit_、
Lexrep("181")=Numeric
Lexrep("人")=JPCon+JPCount+JPRule3437+Lit_人
Lexrep("が")=JPga+Lit_が

Lexrep identification for the same part in iknowpy:
LexrepIdentified:232:Numeric;
LexrepIdentified:人:JPCon;JPRule3437;JPCount;Lit_人;
LexrepIdentified:のうち:JPParticlePREPO;
LexrepIdentified:、:JPComma;Lit_、;
LexrepIdentified:181人:JPCon;JPNumber;Lit_1人;
LexrepIdentified:が:JPga;Lit_が;

As can be seen, “181人” is identified differently: IRIS identifies the whole chunk of numbers “181” first, whereas iknowpy identifies the lexrep “1人” first. This difference results in different indexing results for the character "が", now that it can sometimes be PathRelevant rather than NonRelevant. With the general left-to-right principle, the IRIS way should be kept.

Japanese: request ability to switch off (or to modify) Furigana detection

NOTE: this is a request that came from Dr. Torikai & Dr. Noguchi @ Gunma University Hospital.

BACKGROUND

For Japanese text, we've implemented automatic detection of Furigana. The specifications are as follows:

Consider the content of a set of parentheses （） or () NonRelevant, if the text consist of All Hiragana, All Katakana or All Numbers.
This setting (NonRelevant) is to be true, even if the word itself or part of it is labeled Concept or Relation in the lexreps.
This only applies to （） or () - i.e., it is NOT applicable to other types of brackets.

For example:

将棋の高校生プロ、藤井聡太棋聖（18）がまたしても金字塔を打ち立てた。 => All Numbers, in this example to indicate the person's age.
黎智英（ジミー・ライ）氏や**活動家の周庭（アグネス・チョウ）氏が逮捕された。 => All Katakana, to indicate pronunciation of the previously mentioned proper nouns (often of Chinese/Korean origin) in Kanji.
北海道・阿寒（あかん）湖温泉で自然体験ツアーに出かけた。 => All Hiragana, to indicate pronunciation of the previously mentioned proper noun in Kanji.

This implementation works well in most cases, as such text is just another way of describing (or supplementary information for) the Concept that immediately precedes the set of parentheses. If the text inside the parentheses contains multiple types of characters or consists of all alphabetic characters, the implementation does not apply, as likely the information is more than just a repeat of the preceding Concept.

When the specification was originally designed back in 2013, there was a request to make this feature a switch that could be turned off, but we have no such switch as of date.

WHAT WE WOULD LIKE TO EXPERIMENT

In the machine learning experiment Dr. Noguchi is conducting, he often comes across names of medications in the form: GENERIC_NAME (PRODUCT_NAME), i.g.,グリメピリド（アマリール）
Since most medication names are written in Katakana, the product name is almost always indexed as NonRelevant.

In iKnow sense, making アマリール in above example NonRelevant may not be a problem, since it is essentially repeating グリメピリド. In fact, considering アマリール to be a separate Concept may give more weight to the medication names than we need.

However, Dr. Noguchi is wondering if his model can give better results if the Furigana text is tweaked. There are a couple of different ways he wants to experiment:

Consider the content of the parentheses to be a separate entity, i.e., グリメピリド（アマリール）would yield 2 Concepts.
Consider the whole text グリメピリド（アマリール）to be one single Concept.

Either way would have impact on the Entity Vector, proximity and dominance, but his experiment may not use them.

Currently, the Furigana implementation is outside of the language model CSV files, i.e., the iKnow engine does the work. We need Jos's help to enable ability to 1) switch off the Furigana implementation if chosen by the user; and to 2) make changes to the Furigana implementation (if chosen by the user) so that it still applies to certain types of characters but not all of the default ones.

Rules with SEnd and optional elements don't fire if the input is more than 1 element shorter than the pattern

This rule has 10 elements (9 + SEnd), 3 of which are optional.
The rule fires for
"of sneezing, a sore throat and fatigue." -> 9 elements (8 + SEnd)
but not for:
"of sneezing, a headache and fatigue." -> 8 elements (7 + SEnd)

Example 2: rule 2377 in the English language model
2377;65;ENCertainty|.ENNegation|ENPBegin+ENCertStop+^ENConj|.^ENPBegin+^SEnd|ENPBegin:SEnd;||-ENCertStop|*|+ENCertStop;;

This rules has 5 elements, 2 of which are optional.
The rule fires for
"perhaps what else" -> 4 elements (3 + SEnd)
but not for:
"perhaps what" -> 3 elements (2 + SEnd)

For more concerned rules and examples, please contact me directly.

Python interface: enable use of literal labels

In IRIS, it is possible to use literal labels, which get collected automatically from the rules and added to lexreps. It would be very helpful to have that functionality in Python interface as well.

"No knowledgebases with rules loaded" exception on 32-bit Linux

I was doing some testing with 32-bit builds on Linux (where IKNOWPLAT is set to lnxrhx86), and I get an exception when I run iknowenginetest.

$ ./iknowenginetest 
*** Unit Test Failure ***
No knowledgebases with rules loaded.

GDB gives the following information for the point the exception is thrown.

(gdb) bt
#0  0xf741f16a in __cxa_throw () from /lib/i386-linux-gnu/libstdc++.so.6
#1  0xf7050c9c in iknow::shell::CProcess::CProcess (this=0xffffccbc, languageKbMap=std::map with 1 element = {...}) at /iknow/iknow/modules/shell/src/Process.cpp:57
#2  0xf760b9f6 in iKnowEngine::index (this=0xffffce44, 
    text_input=u"こんな台本でプロットされては困る、と先生言った。志望学部の決定時期につい経営関し表（）済示すだ外国人入試スポーツ推薦標大きが小くミリディングを避けめ除あ概観分かど区おも高校年最普通点一方般セタ利用合格後やう群率み非常達ら解釈注意要数値以上受験段階併願より発者ち中創価ば良考え「ま来」多存在ル勉対問題抱可能性十力レベ動機面見倣ろ", 
    utf8language="ja", b_trace=false) at /iknow/iknow/modules/engine/src/engine.cpp:326
#3  0x080543e4 in testing::iKnowUnitTests::test1 (this=0xffffcf37, pMessage=0x8058b58 "Japanese output must generate entity vectors")
    at /iknow/iknow/modules/enginetest/iKnowUnitTests.cpp:105
#4  0x0805363e in testing::iKnowUnitTests::runUnitTests () at /iknow/iknow/modules/enginetest/iKnowUnitTests.cpp:22
#5  0x0804e815 in main (argc=1, argv=0xffffd1a4) at /iknow/iknow/modules/enginetest/enginetest.cpp:109
(gdb) frame 1
#1  0xf7050c9c in iknow::shell::CProcess::CProcess (this=0xffffccbc, languageKbMap=std::map with 1 element = {...}) at /iknow/iknow/modules/shell/src/Process.cpp:57
57	in /iknow/iknow/modules/shell/src/Process.cpp
(gdb) p languageKbMap
$1 = std::map with 1 element = {[u"ja"] = 0xffffccf0}
(gdb) set cit = languageKbMap.begin()
(gdb) p cit->second
$2 = (iknow::core::IkKnowledgebase *) 0xffffccf0
(gdb) p *(cit->second)
$3 = {_vptr.IkKnowledgebase = 0xf707da5c <vtable for iknow::shell::CompiledKnowledgebase+8>, cache_ = 0x0, m_strIdentifier = ""}
(gdb) p cit->second->RuleCount()
$4 = 0

Enhancement request: move and rename compiler_report.log

Current situation: compiler_report.log is generated in the release/bin directory when lang_update.ban is run. It overwrites the existing compiler_report.log, even if it contains data from another language model.

Request: rename the file to xx_compiler_report.log where xx represents the language code of the concerned model + put it in the language_development folder, where it is needed for genTrace.py.

Python interface: collect markers of the same type per entity

When an entity contains more than one marker of the same type, e.g. two Negation markers or two DateTime markers, the m_index property in the Python interface outputs them as two separate items. It would be better to collect them into one item.

Example 1: Il n'y avaient jamais des chiens.
concerned entity: n'y avaient pas
attribute output:
[{'type': 'Negation', 'offset_start': 5, 'offset_stop': 8, 'marker': "n'y", 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 1},
{'type': 'Negation', 'offset_start': 17, 'offset_stop': 23, 'marker': 'jamais', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 1}]

Example 2: These reports are for the 1997-1998 academic year.
concerned entity: 1997-1998 academic year
attribute output:
[{'type': 'DateTime', 'offset_start': 28, 'offset_stop': 37, 'marker': '1997-1998', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 4}, {'type': 'DateTime', 'offset_start': 47, 'offset_stop': 52, 'marker': 'year.', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 4}]

Compiler auto-detection required

Currently, the Makefiles require an IKNOWPLAT environment variable set to one of a few InterSystems-specific platform identifiers. This should be easy to replace with some generic definitions with good defaults for e.g. CXX.

Create a generic attribute

The iknow engine supports several attributes: negation, time, certainty, measurements, time, sentiment. However, customers may benefit from a 'generic' attribute that they can use for specific patterns in their data.

Requirements:

The generic label has markers and spans.
The markers can be defined through a user dictionary or directly in a language model.
Rules to define the position of the span (Begin, End) in a sentence can be created in a language model.

build issue with Japanese system - warning C4819

This issue happens when building iKnowEngineTest or iKnowALI on Japanese Windows and Japanese Visual Studio that gets installed, basically the setup that would be used by most Japanese users.

The warning is "The file contains a character that cannot be represented in the current code page (932). Save the file in Unicode format to prevent data loss." The warning is for the line 3153 in uchar.h: u_isWhitespace(UChar32 c);

The build can continue if "Treat warnings as errors" is set to No, but why this is happening needs to be investigated.

Support lexrep certainty level manipulation in rules processing.

This is the iKnow standalone implementation of a request to manipulate certainty levels in rules processing, as described in ISC Confluence page : https://usconfluence.iscinternal.com/pages/viewpage.action?spaceKey=ILT&title=Certainty+Levels

Part 1 : select lexreps based on certainty level conditions (rule matching).
Part 2 : manipulate lexrep certainty levels (rule output actions).
Part 3 : the generic "Certainty" label, and how it relates to certainty levels.
Part 4 : joining lexreps: how to handle certainty levels.

iKnow indexing: .^SEnd|SEnd is not processed correctly

Rules containing the pattern ".^SEnd|SEnd" do not fire on that pattern.

Example:
Rule
2363;65;SBegin|ENCertBegin|","|ENPBegin+ENCertStop|.^ENPBegin+^SEnd|ENPBegin:SEnd;|||-ENCertStop||+ENCertStop;;

Input
Clearly, regional measurements can provide more detailed information about morphologic changes.
-> Actual pattern = ".^ENPBegin+^SEnd|SEnd" -> rule doesn't fire

Input
Clearly, regional measurements can provide more detailed information about morphologic changes that cannot be gained by globally averaged evaluations alone.
-> Actual pattern = ".^ENPBegin+^SEnd|ENPBegin" -> rule does fire

The second example shows that the rule is applicable for this input.

UIMA support?

The UIMA standard enables interoperability between different NLP and more general unstructured data processing tools. The iKnow technology embedded in the InterSystems IRIS Data Platform where this repo has its roots has supported calling the iKnow engine through the UIMA interface and conceptually, that would be a very reasonable entry point to publish to this open source project as well.

This placeholder is more of an open question rather than an outright commitment or project plan, as UIMA has a Java-based implementation and the UIMACPP bridge we'd have to lean on is not seeing as much dev activity, so eager to see +1s or potential users chime in on their needs for this before we plan any dev work.

Japanese kb_data.inl appears 'modified' because of random bytes

Running lang_update.bat modifies modules\aho\inl\ja\kb_data.inl even if the language model has not changed.

Japanese: 'MissingEntityVector' warning should be included in Trace

When iKnow processes Japanese text, it's possible that a Concept doesn't get any EntityVector attribute such as Subject, Object, Topic, OtherEntity, and DateTime (which all include EVValue). In such case, iKnow assigns the lowest priority EV value so that the entity gets included in the Entity Vector.

For example:

吊（つ）り戸棚からはだし類が次々と15袋以上出てくる。

Since the Furigana（つ）interferes with grammatical interpretation of the sentence, the character 吊 cannot receive any EV attribute, thus receiving the lowest EV priority.

This happens mostly when the sentence's format is unconventional, like in this example, but it's possible that a new rule that doesn't take EVs into consideration causes a similar situation.

In the IRIS version of genTrace, the generated trace includes a line like below to indicate which entity of which sentence within the input file was a Concept without EV attribute:

*** MissingEntityVector !Lexrep("吊")=JPVerbOther

However, when genTrace.py script gets run against a file that includes such sentence, "MissingEntityVector" appears after the input file name in the Command Prompt but there is no further information. The generated trace log file doesn't include any additional information about "MissingEntityVector" either. This is a problem, since it will not be possible to identify which entity/sentence is generating such warning.

Request for language model development: emit a warning if phase of rule and label don't match

Rules are organized in "phases". The label definitions contain a list of phases in which the label can be used.
If a label is used in a rule in a phase for which it is not defined, the rule won't fire, because the input pattern is not found. Can you let the engine emit a warning about the mismatch in phases?

iknowpy's setup.py uses clang instead of clang++ on Mac

I'm trying to follow the instructions (currently ISC-local but should be posted here) to build the iknowpy integration. But it looks like the tool is invoking clang rather than clang++, which is very unhappy with all the C++ constructs in the iKnow source.

ambassador:iknowpy woodfin$ python3 setup.py build_ext --inplace
Compiling iknowpy.pyx because it changed.
[1/1] Cythonizing iknowpy.pyx
running build_ext
building 'iknowpy' extension
creating build
creating build/temp.macosx-10.14-x86_64-3.7
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I. -I../engine/src -I../core/src/headers -I../base/src/headers -I/usr/local/Cellar/icu4c/64.2/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c iknowpy.cpp -o build/temp.macosx-10.14-x86_64-3.7/iknowpy.o
In file included from iknowpy.cpp:651:
In file included from ./../core/src/headers/IkConceptProximity.h:10:
../base/src/headers/PoolAllocator.h:60:29: error: 'T' does not refer to a value
        size_t alignment = alignof(T);
... <many more errors> ...

Missing characters in output for 'Sentence_found' (part of engine.m_traces in genTrace.py)

Although this issue is about a Python script, it actually concerns something that needs to be changed in the engine.

Where PreprocessToken introduces a space within a token (thus creating 2 tokens), the last character of the first part is replaced by space in Sentence_found, e.g. PreprocessToken "It's" -> "It 's", Sentence_found "I 's".

build issue with Japanese system - compiler errors C2001 & C2143

This issue happens when building iKnowEngineTest or iKnowALI on Japanese Windows and Japanese Visual Studio that gets installed, basically the setup that would be used by most Japanese users.

Error C2001 - newline in constant
Error C2143 - missing ')' before 'if'
Error C2143 - missing ';' before 'if'

These errors occur in the file enginetest.cpp for the line that starts with if (language_code == "ja") return. It fails because of the Japanese text. I've found that some very short text may work, but they are certainly not long enough to be called sentences:

お金 - works
お金をなくした - doesn't work
コーヒー - works
コーヒー代 - works
テキスト - doesn't work
ＡＢＣＺ - works (all double-width characters)
ＡＢＣＺ。 - doesn't work (same as above)
雅子 - doesn't work
まさこ - works

iKnow indexing in genRAW: only one Measurement value and unit shown when more than one value and unit in same Concept

If more than one value and unit are present in the same Concept only one value and one unit are shown in the RAW output. In the following example only '7 ounces' gets an attribute, and '7 pounds' doesn't.

input:
The baby weighs 7 pounds 7 ounces.

current output:
<attr type="measurement" literal="7 pounds 7 ounces." token="7 pounds 7 ounces." value="7" unit="ounces">

desired output:
<attr type="measurement" literal="7 pounds 7 ounces." token="7 pounds 7 ounces." value="7" unit="pounds" value2="7" unit2="ounces">

C API for the iKnow engine

The IRIS / iKnow engine interface has traditionally been specified as C++ APIs. This was fine for internal use, as we controlled the entire build & deployment infrastructure.

But a pure C API would have advantages for cross-compiler compatibility & deployment. For integration with spaCy and other tools, we should consider supporting such an API, either exclusively or supplementing the "richer" C++ APIs.

Python Interface: supplying attribute properties through the User Dictionary

With #37 now out of the way (thanks @JosDenysGitHub !), we'll want to have a means to seed (or even override?) this kind of attribute properties through the User Dictionary, on top of the ones in the language models. The level property for the certainty attribute is the first and foremost example of these.

We'll need this first on the C++ side and then expose it in the iknowpy.UserDictionary interface on the Python end

Stop lang_update in case of non-matching number of labels in rule input and output

During lang_update processing a rule with non-matching number of labels in the input and output goes unnoticed. The language model compilation should stop, so the conflict can be resolved.

Example:
;39;CSPrep+^"před":"po":"za"|CSNum+^CSPartTime|CSYear|^CSEndTime;||-CSTimeConcept-CSTime;
-> 4 input labels and 3 output labels

Result:
Successfully installed iknowpy

Entity Vector values need to be availble from Python interface for Japanese text

When I run the test Python interface program, I get the following with Japanese text, i.e., Proximity & Path are empty, which is not the case with English text. Entity Vector (Path for Japnese) is probably the most important part of output for Japanese, so please make it available.

(base) C:\Users\mohira\Documents>python test.py
Languages Set:
{'cs', 'sv', 'ru', 'ja', 'nl', 'de', 'fr', 'pt', 'uk', 'es', 'en'}

Input text:
これはiKnowエンジンへのPythonインターフェースです。

Index:
{'proximity': [],
'sentences': [{'entities': [{'dominance_value': 0.0,
'entity_id': 0,
'index': 'これ',
'offset_start': 0,
'offset_stop': 2,
'type': 'NonRelevant'},
{'dominance_value': 0.0,
'entity_id': 0,
'index': 'は',
'offset_start': 2,
'offset_stop': 3,
'type': 'NonRelevant'},
{'dominance_value': 9.223372036854776e+18,
'entity_id': 1,
'index': 'iknowエンジン',
'offset_start': 3,
'offset_stop': 12,
'type': 'Concept'},
{'dominance_value': 0.0,
'entity_id': 2,
'index': 'へ',
'offset_start': 12,
'offset_stop': 13,
'type': 'Relation'},
{'dominance_value': 0.0,
'entity_id': 0,
'index': 'の',
'offset_start': 13,
'offset_stop': 14,
'type': 'NonRelevant'},
{'dominance_value': 9.223372036854776e+18,
'entity_id': 3,
'index': 'pythonインターフェース',
'offset_start': 14,
'offset_stop': 28,
'type': 'Concept'},
{'dominance_value': 0.0,
'entity_id': 4,
'index': 'です',
'offset_start': 28,
'offset_stop': 30,
'type': 'Relation'},
{'dominance_value': 0.0,
'entity_id': 0,
'index': '。',
'offset_start': 30,
'offset_stop': 31,
'type': 'NonRelevant'}],
'path': [],
'path_attributes': [],
'sent_attributes': []}]}

Stop lang_update in case of duplicate rule numbers

During lang_update processing a duplicate rule number goes unnoticed. The language model compilation should stop, so the conflict can be resolved.

Example:
852;39;^SBegin:CSPunctuation|"no"+CSCapitalInitial;*|CSCon+CSBeforeNumber;
852;39;"no"+CSCapitalAll;CSCon+CSBeforeNumber;
-> By first commenting out rule 852 and later reactivating the rule, rule number '852' existed twice in the rules file

Result:
Successfully installed iknowpy

genRAW script: normalized lowercase alphabet characters are emitted for Entity Vectors

For alphabet characters within the Entity Vector, the genRAW python script currently uses normalized lowercase single-width characters, even if the original text uses uppercase or double-width alphabet. The expectation is to get the literal values, i.e., identical to the original text.

Request for language model development: emit a warning if unknown labels are used in lexreps.csv

If an undefined label is used in lexreps.csv, the build of the language compiler fails with this error:
57>C:\iKnow_GH\modules\aho\inl\de\lexrep\MatchObjs.inl(1,22): error C2466: cannot allocate an array of constant siz
e 0 [C:\iKnow_GH\modules\aho\model0_de.vcxproj]

It would be more convenient to get a warning like "undefined label DEVerbadj in lexreps.csv, line 32424".

Stop lang_update in case of duplicate regular expressions

During lang_update processing a duplicate regular expression goes unnoticed. The language model compilation should stop, so the conflict can be resolved.

Example:
footnote;[\d{1,3}]
footnote;[\d{3}]
-> the same regular expression name was used twice

Result:
Successfully installed iknowpy

ref_testing.py crashes with SIGSEGV on Mac OS X

I am trying to add a run of ref_testing.py to the CI workflow. It runs smoothly on Windows and Linux but crashes the Python interpreter on Mac OS X.

Here is the script output.

GENERATING RAW OUTPUT
Processing cs_core.txt
language: cs
Processing ja_core.txt
language: ja
Processing de_core.txt
language: de
Processing ru_core.txt
language: ru
Processing en_core.txt
language: en
Processing uk_core.txt
language: uk
zsh: segmentation fault  python3 ref_testing.py

And here are the crash details.

Process:               Python [821]
Path:                  /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Resources/Python.app/Contents/MacOS/Python
Identifier:            Python
Version:               3.8.2 (3.8.2)
Build Info:            python3-73040006000000~117
Code Type:             X86-64 (Native)
Parent Process:        zsh [676]
Responsible:           Terminal [673]
User ID:               501

Date/Time:             2021-01-12 09:12:35.686 -0500
OS Version:            macOS 11.1 (20C69)
Report Version:        12
Anonymous UUID:        4A4343A2-84B4-48A9-9F69-AD3834F1FA06


Time Awake Since Boot: 970 seconds

System Integrity Protection: disabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x00007f99a78ffffe
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Segmentation fault: 11
Termination Reason:    Namespace SIGNAL, Code 0xb
Terminating Process:   exc handler [821]

VM Regions Near 0x7f99a78ffffe:
    MALLOC_TINY              7f99a7700000-7f99a7800000 [ 1024K] rw-/rwx SM=PRV  
--> 
    MALLOC_LARGE_REUSABLE    7f99a7900000-7f99a791f000 [  124K] rw-/rwx SM=PRV  

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libiknowcore-36353c21.dylib   	0x0000000108afce43 iknow::core::IkIndexProcess::FindNextSentence(iknow::core::IkIndexInput*, std::__1::vector<iknow::core::IkLexrep, iknow::base::PoolAllocator<iknow::core::IkLexrep> >&, int&, unsigned long, bool, std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> >&, double&, iknow::core::IkKnowledgebase*, double, int) + 1907
1   libiknowcore-36353c21.dylib   	0x0000000108af8b81 iknow::core::IkIndexProcess::Start(iknow::core::IkIndexInput*, iknow::core::IkIndexOutput*, iknow::core::IkIndexDebug<std::__1::list<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >*, bool, bool, bool, unsigned long, iknow::core::IkKnowledgebase*) + 1345
2   libiknowshell-1bceda57.dylib  	0x0000000108aae95d iknow::shell::CProcess::IndexFunc(iknow::core::IkIndexInput&, void (*)(iknow::core::IkIndexOutput*, iknow::core::IkIndexDebug<std::__1::list<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >*, void*, iknow::core::IkStemmer<std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> >, std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> > >*), void*, bool, bool) + 653
3   libiknowengine-2c57ddd0.dylib 	0x0000000108a3c472 iKnowEngine::index(std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 498
4   libiknowengine-2c57ddd0.dylib 	0x0000000108a40dd5 iKnowEngine::index(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 53
5   engine.cpython-38-darwin.so   	0x00000001089ee6f4 __pyx_pw_7iknowpy_6engine_11iKnowEngine_7index(_object*, _object*, _object*) + 3604
6   engine.cpython-38-darwin.so   	0x00000001089df3ac __Pyx_CyFunction_CallAsMethod(_object*, _object*, _object*) + 92
7   com.apple.python3             	0x000000010839bdd6 _PyObject_MakeTpCall + 374
8   com.apple.python3             	0x000000010839f185 method_vectorcall + 229
9   com.apple.python3             	0x000000010847c012 call_function + 354
10  com.apple.python3             	0x00000001084786d6 _PyEval_EvalFrameDefault + 29782
11  com.apple.python3             	0x000000010839c78d function_code_fastcall + 237
12  com.apple.python3             	0x000000010847c012 call_function + 354
13  com.apple.python3             	0x000000010847878a _PyEval_EvalFrameDefault + 29962
14  com.apple.python3             	0x000000010847d097 _PyEval_EvalCodeWithName + 3287
15  com.apple.python3             	0x00000001084711e0 PyEval_EvalCode + 48
16  com.apple.python3             	0x00000001084c2933 PyRun_FileExFlags + 291
17  com.apple.python3             	0x00000001084c1d9f PyRun_SimpleFileExFlags + 271
18  com.apple.python3             	0x00000001084e1267 Py_RunMain + 2103
19  com.apple.python3             	0x00000001084e1793 pymain_main + 403
20  com.apple.python3             	0x00000001084e17eb Py_BytesMain + 43
21  libdyld.dylib                 	0x00007fff20445621 start + 1

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x00007f99a7900000  rbx: 0x0000000000000000  rcx: 0x00007f99a7900000  rdx: 0x0000000000000000
  rdi: 0x00007ffee789bc20  rsi: 0x0000000000000004  rbp: 0x00007ffee789af20  rsp: 0x00007ffee789ada0
   r8: 0x00007f99a6aa53b0   r9: 0x00007f99a6aa4f30  r10: 0x00007f99a6921340  r11: 0x0000000000000001
  r12: 0x0000000108b32120  r13: 0x0000000000000000  r14: 0x00007f99a7900000  r15: 0x00007f99a7900000
  rip: 0x0000000108afce43  rfl: 0x0000000000010293  cr2: 0x00007f99a78ffffe
  
Logical CPU:     1
Error Code:      0x00000004 (no mapping for user data read)
Trap Number:     14

Thread 0 instruction stream:
  18 ff ff ff 4c 8b 7d a0-0f 83 47 01 00 00 66 2e  ....L.}...G...f.
  0f 1f 84 00 00 00 00 00-0f 1f 44 00 00 41 8b 07  ..........D..A..
  ff c0 41 89 07 f6 45 cc-01 0f 85 26 01 00 00 48  ..A...E....&...H
  63 c8 48 39 8d 10 ff ff-ff 0f 87 07 fb ff ff e9  c.H9............
  11 01 00 00 49 63 c4 48-8b 4d a8 4c 8d 34 41 49  ....Ic.H.M.L.4AI
  63 07 48 8d 04 41 4c 8d-25 e0 52 03 00 49 89 c7  c.H..AL.%.R..I..
 [0f]b7 40 fe 66 83 f8 7f-77 43 8d 48 d0 41 bd 03  [email protected]..	<==
  00 00 00 66 83 f9 0a 72-5c 89 c1 83 e1 df 83 c1  ...f...r\.......
  bf 66 83 f9 1a 72 4e 41-bd 03 00 00 00 66 83 f8  .f...rNA.....f..
  0d 77 42 b9 00 34 00 00-0f a3 c1 73 38 4d 39 f7  .wB..4.....s8M9.
  77 1b eb 23 66 0f 1f 84-00 00 00 00 00 0f b7 f8  w..#f...........
  e8 48 a0 02 00 41 89 c5-4d 39 f7 76 0a 49 8d 47  .H...A..M9.v.I.G
  
Thread 0 last branch register state not available.


Binary Images:
       0x108363000 -        0x108366fff  com.apple.python3 (3.8.2 - 3.8.2) <6BA1329E-13BF-3FAB-ABA2-E57C991180AC> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Resources/Python.app/Contents/MacOS/Python
       0x108378000 -        0x1085c7fff  com.apple.python3 (3.8.2 - 3.8.2) <EC3F4640-FA3E-3557-88A7-A8402EDEFFFB> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Python3
       0x108895000 -        0x108898fff +_heapq.cpython-38-darwin.so (73.40.6) <A1BA08C1-54F6-3C26-8634-0FDA853D57DC> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/lib-dynload/_heapq.cpython-38-darwin.so
       0x1088a5000 -        0x1088a8fff +_opcode.cpython-38-darwin.so (73.40.6) <FB35708A-9CB9-3C35-9EF8-C8392E91A29A> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/lib-dynload/_opcode.cpython-38-darwin.so
       0x1088f5000 -        0x108900fff +libicuio-dc9f50b5.68.2.dylib (0) <9496F364-D088-3C24-88BF-E51D492FD765> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicuio-dc9f50b5.68.2.dylib
       0x108909000 -        0x10890cfff +libiknowmodelcom-abcf95e0.dylib (0) <662B9ADA-022F-30FE-92DE-7F027019C8BA> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelcom-abcf95e0.dylib
       0x1089d5000 -        0x1089f8fff +engine.cpython-38-darwin.so (0) <D0C9DFA3-2BE6-36D0-9B94-AF3FF31F59C0> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/engine.cpython-38-darwin.so
       0x108a1d000 -        0x108a48fff +libiknowengine-2c57ddd0.dylib (0) <581F1DDC-842C-3CB7-95F8-A35B77672BD0> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowengine-2c57ddd0.dylib
       0x108a71000 -        0x108a80fff +libiknowbase-e6b6bc45.dylib (0) <F03732C4-8DAB-3F12-BA50-A6602B6F1673> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowbase-e6b6bc45.dylib
       0x108a99000 -        0x108ab8fff +libiknowshell-1bceda57.dylib (0) <8FD2F6B9-CFB1-383B-AC6D-CBA73A556551> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowshell-1bceda57.dylib
       0x108ad5000 -        0x108b30fff +libiknowcore-36353c21.dylib (0) <D8BB5AD4-45E5-3844-8F2D-A621E1EDF0E7> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowcore-36353c21.dylib
       0x108b75000 -        0x108d64fff +libicui18n-66033737.68.2.dylib (0) <6E6161F5-D061-3DD5-AABD-6CFDF9C0A5DE> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicui18n-66033737.68.2.dylib
       0x108e75000 -        0x108fd0fff +libicuuc-38a45a2b.68.2.dylib (0) <869DF4C9-1805-3B5F-9CC2-A762118A470D> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicuuc-38a45a2b.68.2.dylib
       0x109045000 -        0x10ab84fff +libicudata-fc5cc17d.68.2.dylib (0) <9B328C2D-4277-36F3-87B5-B4842F4B95BC> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicudata-fc5cc17d.68.2.dylib
       0x10ab89000 -        0x10ab90fff +libiknowali-86fc8c44.dylib (0) <3A48AC15-501D-3EE7-9A24-9A2C67FD1DAA> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowali-86fc8c44.dylib
       0x10ab99000 -        0x10af10fff +libiknowmodelde-0c73cc5e.dylib (0) <E9DF6E9A-D6CF-39B7-A5F3-B34DE53DC1C3> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelde-0c73cc5e.dylib
       0x10af21000 -        0x10af28fff +libiknowmodeldex-f9dbd60a.dylib (0) <F4A9D38B-2FC2-3E39-9FB7-2EEFBF92C53F> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeldex-f9dbd60a.dylib
       0x10af35000 -        0x10b5a4fff +libiknowmodelen-1f3801c6.dylib (0) <4211658B-7B0A-387F-8298-70D8F7752DFB> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelen-1f3801c6.dylib
       0x10b5e9000 -        0x10b610fff +libiknowmodelenx-24cb2a74.dylib (0) <5F55A0CC-4BA8-3D20-A68F-9A310CBE2C06> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelenx-24cb2a74.dylib
       0x10b61d000 -        0x10bd08fff +libiknowmodeles-abff9de2.dylib (0) <92768581-E483-334E-A94D-86E0896C4E78> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeles-abff9de2.dylib
       0x10bd25000 -        0x10bd2cfff +libiknowmodelesx-72f179bf.dylib (0) <BC07BD7A-3098-32B0-9957-32A2187404E0> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelesx-72f179bf.dylib
       0x10bd39000 -        0x10c448fff +libiknowmodelfr-3dd7ab3d.dylib (0) <C34FE180-294B-31D6-9031-B8ADB2E0E8BE> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelfr-3dd7ab3d.dylib
       0x10c469000 -        0x10c470fff +libiknowmodelfrx-bfcfa387.dylib (0) <A829EF38-EA2F-3B99-9FBE-57C1F08B475B> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelfrx-bfcfa387.dylib
       0x10c47d000 -        0x10ca3cfff +libiknowmodelja-ca57a02b.dylib (0) <49E10BB0-5196-3FD0-B851-950978572A89> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelja-ca57a02b.dylib
       0x10cb5d000 -        0x10cb64fff +libiknowmodeljax-f84c9c91.dylib (0) <6BC0C101-BC09-3EA3-9EEF-4A574512B303> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeljax-f84c9c91.dylib
       0x10cb71000 -        0x10cf84fff +libiknowmodelnl-40411dbf.dylib (0) <02FB0098-9EF0-3497-A8EA-424279782E59> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelnl-40411dbf.dylib
       0x10cfa5000 -        0x10cfacfff +libiknowmodelnlx-1c7a6666.dylib (0) <C3BDAB87-0179-324A-AA21-1E36521F1016> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelnlx-1c7a6666.dylib
       0x10cfb9000 -        0x10da4cfff +libiknowmodelpt-5c2f4caf.dylib (0) <7734CFDB-B28D-3582-8531-08205C03B255> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelpt-5c2f4caf.dylib
       0x10da69000 -        0x10da70fff +libiknowmodelptx-85d4654a.dylib (0) <04E02350-C643-3EE3-8D33-4405AD74C493> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelptx-85d4654a.dylib
       0x10da7d000 -        0x10dce4fff +libiknowmodelru-8ea2598f.dylib (0) <045F1501-416B-3513-AD34-18A8F0207F3C> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelru-8ea2598f.dylib
       0x10dd01000 -        0x10dd28fff +libiknowmodelrux-c5df6bc8.dylib (0) <DF613DD8-47B6-37C1-AF9B-EC27913DE18A> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelrux-c5df6bc8.dylib
       0x10dd39000 -        0x10df10fff +libiknowmodeluk-f8322ec4.dylib (0) <E6F6CF0E-A7AE-3EE1-8C9A-5958277CB6BB> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeluk-f8322ec4.dylib
       0x10df31000 -        0x10df54fff +libiknowmodelukx-fef454ad.dylib (0) <7CF0F9AA-7FF9-3708-8B6C-2053C1C4CA09> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelukx-fef454ad.dylib
       0x10df65000 -        0x10e554fff +libiknowmodelsv-e28a4052.dylib (0) <BB9E995F-9A80-34EA-A9E2-9278ED856D2E> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelsv-e28a4052.dylib
       0x10e571000 -        0x10e57cfff +libiknowmodelsvx-11383aed.dylib (0) <BAED8D7D-3280-3ED0-B6FB-36F8E5E953C9> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelsvx-11383aed.dylib
       0x10e589000 -        0x10f388fff +libiknowmodelcs-8de28af0.dylib (0) <80B5985B-F1FF-3845-8442-D6E96CE8AEE4> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelcs-8de28af0.dylib
       0x10f3b9000 -        0x10f3d0fff +libiknowmodelcsx-82298ec8.dylib (0) <95DB5E6D-DFC3-3ADA-B18B-6F1587F0570A> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelcsx-82298ec8.dylib
       0x112dc6000 -        0x112e61fff  dyld (832.7.1) <DEA51514-B4E8-3368-979B-89D0F8397ABC> /usr/lib/dyld
    0x7fff2015f000 -     0x7fff20160fff  libsystem_blocks.dylib (78) <9CF131C6-16FB-3DD0-B046-9E0B6AB99935> /usr/lib/system/libsystem_blocks.dylib
    0x7fff20161000 -     0x7fff20196fff  libxpc.dylib (2038.40.38) <003A027D-9CE3-3794-A319-88495844662D> /usr/lib/system/libxpc.dylib
    0x7fff20197000 -     0x7fff201aefff  libsystem_trace.dylib (1277.50.1) <48C14376-626E-3C81-B0F5-7416E64580C7> /usr/lib/system/libsystem_trace.dylib
    0x7fff201af000 -     0x7fff2024dfff  libcorecrypto.dylib (1000.60.19) <92F0211E-506E-3760-A3C2-808BF3905C07> /usr/lib/system/libcorecrypto.dylib
    0x7fff2024e000 -     0x7fff2027afff  libsystem_malloc.dylib (317.40.8) <2EF43B96-90FB-3C50-B73E-035238504E33> /usr/lib/system/libsystem_malloc.dylib
    0x7fff2027b000 -     0x7fff202bffff  libdispatch.dylib (1271.40.12) <CEF1460B-1362-381A-AE69-6BCE2D8C215B> /usr/lib/system/libdispatch.dylib
    0x7fff202c0000 -     0x7fff202f9fff  libobjc.A.dylib (818.2) <339EDCD0-5ABF-362A-B9E5-8B9236C8D36B> /usr/lib/libobjc.A.dylib
    0x7fff202fa000 -     0x7fff202fcfff  libsystem_featureflags.dylib (28.60.1) <7B4EBDDB-244E-3F78-8895-566FE22288F3> /usr/lib/system/libsystem_featureflags.dylib
    0x7fff202fd000 -     0x7fff20385fff  libsystem_c.dylib (1439.40.11) <06D9F593-C815-385D-957F-2B5BCC223A8A> /usr/lib/system/libsystem_c.dylib
    0x7fff20386000 -     0x7fff203dbfff  libc++.1.dylib (904.4) <AE3A940A-7A9C-3F99-B175-3511528D8DFE> /usr/lib/libc++.1.dylib
    0x7fff203dc000 -     0x7fff203f4fff  libc++abi.dylib (904.4) <DDFCBF9C-432D-3B8A-8641-578D2EDDCAD8> /usr/lib/libc++abi.dylib
    0x7fff203f5000 -     0x7fff20423fff  libsystem_kernel.dylib (7195.60.75) <4BD61365-29AF-3234-8002-D989D295FDBB> /usr/lib/system/libsystem_kernel.dylib
    0x7fff20424000 -     0x7fff2042ffff  libsystem_pthread.dylib (454.60.1) <8DD3A0BC-2C92-31E3-BBAB-CE923A4342E4> /usr/lib/system/libsystem_pthread.dylib
    0x7fff20430000 -     0x7fff2046afff  libdyld.dylib (832.7.1) <2F8A14F5-7CB8-3EDD-85EA-7FA960BBC04E> /usr/lib/system/libdyld.dylib
    0x7fff2046b000 -     0x7fff20474fff  libsystem_platform.dylib (254.60.1) <3F7F6461-7B5C-3197-ACD7-C8A0CFCC6F55> /usr/lib/system/libsystem_platform.dylib
    0x7fff20475000 -     0x7fff204a0fff  libsystem_info.dylib (542.40.3) <0979757C-5F0D-3F5A-9E0E-EBF234B310AF> /usr/lib/system/libsystem_info.dylib
    0x7fff204a1000 -     0x7fff2093cfff  com.apple.CoreFoundation (6.9 - 1770.300) <7AADB19E-8EA2-3C9B-8699-F206DB47C6BE> /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
    0x7fff22560000 -     0x7fff227c1fff  libicucore.A.dylib (66109) <6C0A0196-2778-3035-81CE-7CA48D6C0628> /usr/lib/libicucore.A.dylib
    0x7fff227c2000 -     0x7fff227cbfff  libsystem_darwin.dylib (1439.40.11) <BD269412-C9D0-32EE-B42B-B09A187A9B95> /usr/lib/system/libsystem_darwin.dylib
    0x7fff22bdc000 -     0x7fff22be7fff  libsystem_notify.dylib (279.40.4) <98D74EEF-60D9-3665-B877-7BE1558BA83E> /usr/lib/system/libsystem_notify.dylib
    0x7fff24b37000 -     0x7fff24b45fff  libsystem_networkextension.dylib (1295.60.5) <F476B1CB-3561-30C5-A78E-44E99B1720A3> /usr/lib/system/libsystem_networkextension.dylib
    0x7fff24ba3000 -     0x7fff24bb9fff  libsystem_asl.dylib (385) <940C5BB9-4928-3A63-97F2-132797C8B7E5> /usr/lib/system/libsystem_asl.dylib
    0x7fff262d0000 -     0x7fff262d7fff  libsystem_symptoms.dylib (1431.60.1) <88F35AAC-746F-3176-81DF-49CE3D285636> /usr/lib/system/libsystem_symptoms.dylib
    0x7fff28604000 -     0x7fff28614fff  libsystem_containermanager.dylib (318.60.1) <4ED09A19-04CC-3464-9EFB-F674932020B5> /usr/lib/system/libsystem_containermanager.dylib
    0x7fff29314000 -     0x7fff29317fff  libsystem_configuration.dylib (1109.60.2) <C57B346B-0A03-3F87-BCAC-87B702FA0719> /usr/lib/system/libsystem_configuration.dylib
    0x7fff29318000 -     0x7fff2931cfff  libsystem_sandbox.dylib (1441.60.4) <8CE27199-D633-31D2-AB08-56380A1DA9FB> /usr/lib/system/libsystem_sandbox.dylib
    0x7fff29f27000 -     0x7fff29f29fff  libquarantine.dylib (119.40.2) <19D42B9D-3336-3543-AF75-6E605EA31599> /usr/lib/system/libquarantine.dylib
    0x7fff2a4a9000 -     0x7fff2a4adfff  libsystem_coreservices.dylib (127) <A2D875B9-8BA8-33AD-BE92-ADAB915A8D5B> /usr/lib/system/libsystem_coreservices.dylib
    0x7fff2a6c4000 -     0x7fff2a70ffff  libsystem_m.dylib (3186.40.2) <0F98499E-662F-36EC-AB58-91A8D5A0FB74> /usr/lib/system/libsystem_m.dylib
    0x7fff2a711000 -     0x7fff2a716fff  libmacho.dylib (973.4) <28AE1649-22ED-3C4D-A232-29D37F821C39> /usr/lib/system/libmacho.dylib
    0x7fff2a733000 -     0x7fff2a73efff  libcommonCrypto.dylib (60178.40.2) <1D0A75A5-DEC5-39C6-AB3D-E789B8866712> /usr/lib/system/libcommonCrypto.dylib
    0x7fff2a73f000 -     0x7fff2a749fff  libunwind.dylib (200.10) <C5792A9C-DF0F-3821-BC14-238A78462E8A> /usr/lib/system/libunwind.dylib
    0x7fff2a74a000 -     0x7fff2a751fff  liboah.dylib (203.13.2) <FF72E19B-3B02-34D4-A821-3397BB28AC02> /usr/lib/liboah.dylib
    0x7fff2a752000 -     0x7fff2a75cfff  libcopyfile.dylib (173.40.2) <89483CD4-DA46-3AF2-AE78-FC37CED05ACC> /usr/lib/system/libcopyfile.dylib
    0x7fff2a75d000 -     0x7fff2a764fff  libcompiler_rt.dylib (102.2) <0DB26EC8-B4CD-3268-B865-C2FC07E4D2AA> /usr/lib/system/libcompiler_rt.dylib
    0x7fff2a765000 -     0x7fff2a767fff  libsystem_collections.dylib (1439.40.11) <D40D8097-0ABF-3645-B065-168F43ACFF4C> /usr/lib/system/libsystem_collections.dylib
    0x7fff2a768000 -     0x7fff2a76afff  libsystem_secinit.dylib (87.60.1) <99B5FD99-1A8B-37C1-BD70-04990FA33B1C> /usr/lib/system/libsystem_secinit.dylib
    0x7fff2a76b000 -     0x7fff2a76dfff  libremovefile.dylib (49.40.3) <750012C2-7097-33C3-B796-2766E6CDE8C1> /usr/lib/system/libremovefile.dylib
    0x7fff2a76e000 -     0x7fff2a76efff  libkeymgr.dylib (31) <2C7B58B0-BE54-3A50-B399-AA49C19083A9> /usr/lib/system/libkeymgr.dylib
    0x7fff2a76f000 -     0x7fff2a776fff  libsystem_dnssd.dylib (1310.60.4) <81EFC44D-450E-3AA3-AC8F-D7EF68F464B4> /usr/lib/system/libsystem_dnssd.dylib
    0x7fff2a777000 -     0x7fff2a77cfff  libcache.dylib (83) <2F7F7303-DB23-359E-85CD-8B2F93223E2A> /usr/lib/system/libcache.dylib
    0x7fff2a77d000 -     0x7fff2a77efff  libSystem.B.dylib (1292.60.1) <A7FB4899-9E04-37ED-9DD8-8FFF0400879C> /usr/lib/libSystem.B.dylib
    0x7fff2a77f000 -     0x7fff2a782fff  libfakelink.dylib (3) <34B6DC95-E19A-37C0-B9D0-558F692D85F5> /usr/lib/libfakelink.dylib
    0x7fff2a783000 -     0x7fff2a783fff  com.apple.SoftLinking (1.0 - 1) <90D679B3-DFFD-3604-B89F-1BCF70B3EBA4> /System/Library/PrivateFrameworks/SoftLinking.framework/Versions/A/SoftLinking
    0x7fff2dd0c000 -     0x7fff2dd0cfff  liblaunch.dylib (2038.40.38) <05A7EFDD-4111-3E4D-B668-239B69DE3D0F> /usr/lib/system/liblaunch.dylib
    0x7fff301b9000 -     0x7fff301b9fff  libsystem_product_info_filter.dylib (8.40.1) <7CCAF1A8-F570-341E-B275-0C80B092F8E0> /usr/lib/system/libsystem_product_info_filter.dylib

External Modification Summary:
  Calls made by other processes targeting this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by all processes on this machine:
    task_for_pid: 475
    thread_create: 0
    thread_set_state: 0

VM Region Summary:
ReadOnly portion of Libraries: Total=608.8M resident=0K(0%) swapped_out_or_unallocated=608.8M(100%)
Writable regions: Total=96.1M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=96.1M(100%)
 
                                VIRTUAL   REGION 
REGION TYPE                        SIZE    COUNT (non-coalesced) 
===========                     =======  ======= 
Kernel Alloc Once                    8K        1 
MALLOC                            73.6M       55 
MALLOC guard page                   16K        4 
MALLOC_LARGE (reserved)            512K        2         reserved VM address space (unallocated)
STACK GUARD                          4K        1 
Stack                             16.0M        1 
VM_ALLOCATE                       4872K       21 
__DATA                            2547K       87 
__DATA_CONST                      2986K       38 
__DATA_DIRTY                        95K       23 
__LINKEDIT                       494.0M       67 
__OBJC_RO                         60.5M        1 
__OBJC_RW                         2451K        2 
__TEXT                           115.0M       84 
__UNICODE                          588K        1 
shared memory                        8K        2 
===========                     =======  ======= 
TOTAL                            772.9M      390 
TOTAL, minus reserved VM space   772.4M      390 

Model: iMac19,1, BootROM VirtualBox, 2 processors, Unknown, 2.9 GHz, 8 GB, SMC 2.3f35
Graphics: spdisplays_display, 5 MB
Memory Module: Bank 0/DIMM 0, 8 GB, DRAM, 1600 MHz, innotek GmbH, -
Network Service: Ethernet, Ethernet, en0
Network Service: Ethernet Adaptor (en1), Ethernet, en1
Serial ATA Device: VBOX HARDDISK, 137.44 GB
Serial ATA Device: VBOX CD-ROM
USB Device: USB Bus
USB Device: USB Tablet
USB Device: USB Keyboard
USB Device: USB 2.0 Bus
Thunderbolt Bus:

Enhancement request: enable user dictionary

In IRIS, it is possible to influence sentence detection and attribute detection (a.o. negation and sentiment) through a user dictionary. It would be helpful to have that functionality in the Python interface too.

ICUDIR not necessary on some platforms

On some platforms (e.g. a typical Linux distribution), the default package manager will install ICU into the default include/library paths, which makes checking for ICUDIR redundant and indeed impossible to satisfy, as there's no "root" of ICU.

The root Makefile needs some basic logic to construct the ICU lib/include flags (which may both be null in the typical Linux case)

iKnow indexing in genRAW: space is added in preprocessor items

Preprocessor items such as "he's", 'we're", etc. are split in the preprocessor to be able to process "he" and "we" differently from "'s" and "'re". This space between "he" and "'s" and "we" and "'re" shouldn't be visible in the RAW output, but it currently is.
input:
despite what we're hearing

current output:
despite what 're hearing

desired output:
despite what 're hearing

Engine C/C++ external APIs should convert to UTF-8

iknow/modules/iknowpy/engine.pxd

Line 30 in 60b1c74

ctypedef unsigned short Entity_Ref

This line suggests the engine API is exposing our internal UTF-16 string representation. Though cognizant of the performance hit, I think we need an API where all string content is ordinary char*'s of UTF-8 (maybe not the only one, but certainly the one that code like @adang1345's would use)

Stop lang_update in case of lexrep double with conflicting labels

During lang_update processing a lexrep double with conflicting labels currently only results in a notification. Instead it should stop the language compilation, so the conflict can be resolved.

Example:
;;letošními;;CSAdj;CSAdjInstrPl;CSBeginTime;
;;letošními;;CSVerb;

Message:
conflicting double: leto┼ín├¡mi Labels= CSVerb; conflicts with CSAdj;CSAdjInstrPl;CSBeginTime;

Result:
Successfully installed iknowpy

Trace output:
LexrepIdentified:letošními:CSVerb;

Certainty spans from XML output are not visualised

It seems that the style sheet iKnowXML.xsl does not pick up the spans for Certainty attributes, although they are present in the XML file.
Both genXML.py and iKnowXML.xsl can be found under "language_development".

The attached file is actually an XML file. It contains an example.

en_trace.txt

Python API uses opaque integer values for attribute types

'sent_attributes': [{'entity_ref': 10,
'marker_': 'not',
'offset_start_': 64,
'offset_stop_': 67,
'type_': 1,
'unit2_': '',
'unit_': '',
'value2_': '',
'value_': ''}]}]}

These should be human-readable strings.

handleCSV.py script removes certain multi-lexreps previously expanded by DELVE

I came across a situation where the handleCSV.py script does not restore multi-lexreps expanded by the DELVE code.

Original line in JP_jp_lexreps.csv:

;;(を手に取|を手にと);;JPParticleWO;-;JPVerbOther;Join;Join;

lexreps.csv generated by DELVE for the same line reads as follows:

/**** Rewritten by DELVE, ;;を手に取;;JPParticleWO;-;JPVerbOther;Join;Join;
;;を手に取;;JPParticleWO;Lit_を;-;JPVerbOther;Join;Join
/**** Rewritten by DELVE, ;;を手にと;;JPParticleWO;-;JPVerbOther;Join;Join;
;;を手にと;;JPParticleWO;Lit_を;-;JPVerbOther;Join;Join
/* Expanded previously by DELVE....;;(を手に取|を手にと);;JPParticleWO;-;JPVerbOther;Join;Join;

lexreps.csv after running handleCSV.py:
=> lexreps associated with the line no longer exists

I've checked several other expanded multi-lexrep entries, but so far this seems to be the only instance with this issue.

Japanese: Request for some improvements of entity extraction algorithm in terms of more accurate analysis of medical colloquial text

I’m Rei Noguchi from Gunma University Hospital, and I really appreciate the prompt implementation of “negation expansion “ in Japanese (#33). I’m now trying to analyze daily progress notes in electronic medical records, and unlike discharge summaries described as stylized documents, the progress notes are often written in a colloquial or narrative style and includes incomplete sentences, resulting in some problems.
To analyze these casual text in the medical field more accurately, I would like to propose the following three improvements.

1. Extract a word followed by +/- without parentheses as a single entity
2. Resolve the different entity extraction results depending on punctuation mark (Japanese period ”。” or just a space)
3. Detect time expression

The details are as follows.

1. Extract a word followed by +/- without parentheses as a single entity

The previous improvement (#31) enabled Katakana or numbers enclosed in parentheses to be concatenated with the preceding Concept as a single entity. This works in many cases, especially in stylized documents, and is useful for identifying the relation of negation. (e.g. heart murmur(-) → no heart murmur)
However, in informal text such as daily progress notes, there is a problem. Some entities are followed by +/- without parentheses. Even in these cases, +/- symbol should be concatenated with the preceding Concept as a single entity because doctors describe the text with the same intention, and this enables us to clarify the relation of negation. Is this improvement technically possible?
Importantly, in many cases of these, there is often no space between an entity and +/-, whereas there is often half-width or full-width space after +/- to separate from the next entity.

2. Resolve the different entity extraction results depending on punctuation mark (Japanese period ”。” or just a space)

“熱はなし”（no fever）is extracted as a single entity at this time, probably because this phrase includes all hiragana homonym “はなし”. In contrast, if there is a punctuation mark (i.e. Japanese period “。”) in the end of the phrase, like “熱はなし。”, the phrase is divided into multiple entities. The latter case seems like a good option in terms of identifying negation relation.
However, because doctors often end a sentence with just a “space” in place of Japanese period “。”, I think that a phrase ending with a space should be divided into multiple entities in the same manner as “。”.

3. Detect time expression

In medical progress notes, there are many time expressions, so that it’s very useful that they could be identified by something like markers.
Some examples:

2015-06-16 12:47:42 -> 2015-06-16 (Date) + 12:47:42 (Time) or 2015-06-16 12:47:42 (Datetime)
「12月ごろ花粉症の内服処方」 (extracted as a single entity) -> 12月ごろ花粉症の内服処方 (Month)
(in English: Around December prescription of medication for hay fever)

iKnow is an indispensable tool especially in a medical field, where there are many unknown words.
I realize the great value of iKnow and expect further improvement.
Thank you for your help.

Japanese: suggestion for simple Negation expansion (for Adjectival Verbs ない & its conjugated forms)

NOTE: this is a suggestion/request that came from Dr. Rei Noguchi @ Gunma University Hospital.

BACKGROUND

In iKnow, Negation expansion is normally done using the Path, which for non-Japanese language is the word order in the Sentence. Since we developed Entity Vector as a special-case Path for Japanese, the order of entities within the Path is mostly different from how they appear within the Sentence. For this reason, we have not yet implemented Negation expansion beyond the boundaries of the entity that includes the Negation marker.

For example:

今週はレッスンはない。- There is no lesson this week.
Entity Vector - レッスンない今週
The two particles は are NonRelevant.

Because of the sentence structure, the word ない, which is present form of the Adjectival Verb meaning "doesn't exist" and a Negation marker, does not expand beyond itself. This is a problem, since it's no possible to know "what" is being negated without reading the entire sentence.

SIMPLE EXPANSION EXPERIMENT

Dr. Noguchi used the current iKnow Python interface to experiment with his medical data, which often uses simple sentence structures that almost resembles the format: XXX は (or が) ない (or なかった - past form of the same Adjectival Verb meaning "didn't exist").

XXXはない。
XXXがない。
XXXはなかった。
XXXがなかった。

EXPERIMENT:
In cases like above, expand Negation to the left to the Concept before the particle は or が, i.e., in above examples would be "XXX".

In addition, there are some sentences where XXX are replaced by "XXX1やXXX2”, meaning "XXX1 and/or XXX2". In such case, expand Negation to the left, all the way to the Concept before the particle や, i.e., "XXX1" (the first Concept).

His experiment suggested that, at least for his data, such expansion implementation is normally semantically correct and would give more meaningful result to his machine learning work, since it is clearer what exists and what doesn't exist. (For example: There was no fever vs. Patient had fever.)

INITIAL DISCUSSION

This approach only works when the sentence structure is as simple as above (in clinical or medical text). In more complex sentences, it's possible that XXX is part of a subordinate clause, in which case it would be more desirable to expand even further to the left.
However, we have heard from various customers through the years that, it would be desirable to see the "link" between the Adjectival Verb and what is being modified. This is one of such examples. One idea was to enable Path (i.e., CRC-like Path) instead of Entity Vector and then make は and が PathRelevant, but it's not clear how much language model work is involved after such code change.
Better Negation expansion has been a longstanding task for Japanese. It may be a good idea to start small (such as in this suggestion), and improve further as we go.

TECHNICAL APPROACHES

There are two different ways Negation expansion can be implemented.

No change in Path mechanism, i.e., use Entity Vector
- No technical work involved
- In the language model, add Negation marker to XXX and particle, since NegStop/NegBegin will not do anything.
- This approach is not really creating a span but rather 3 separate entities (Concept, NonRelevant, Concept) with Negation Marker. => Is this acceptable for Dr. Noguchi? If so, is it a good approach in the long run? If not, we may need to make the entire thing a Concept. Is that acceptable...?
Add ability to select EV vs. CRC Path
- technical work is involved
- In the language model, NegStop/NegBegin can be used, thus creating "real" span.
- This was initially suggested back in December, when we observed that certain types of medical/clinical notes use more straightforward (CRC-like) sentence structure.
- It may be a problem if user wants Negation expansion but also want to use EV...

The first approach is quicker, but may not be as useful longer-term. Any comment or additional consideration that I'm missing? @ISC-SDE @bdeboe @JosDenysGitHub @woodfinisc

IkIndexDebugList illegal references

iknow/modules/core/src/IkIndexDebugList.cpp

Line 102 in 9b12ef9

Utf8List &trace = ToList(idx, j, kb);

This won't compile for me on Mac (nor does it look like it should):

/Users/woodfin/git/iknow/modules/core/src/IkIndexDebugList.cpp:102:27: error: non-const lvalue reference to type 'list<...>' cannot bind to a temporary of type 'list<...>' Utf8List &trace = ToList(idx, j, kb);

intersystems / iknow Goto Github PK

iknow's People

Contributors

Stargazers

Watchers

Forkers

iknow's Issues

BACKGROUND

WHAT WE WOULD LIKE TO EXPERIMENT

1. Extract a word followed by +/- without parentheses as a single entity

2. Resolve the different entity extraction results depending on punctuation mark (Japanese period ”。” or just a space)

3. Detect time expression

BACKGROUND

SIMPLE EXPANSION EXPERIMENT

INITIAL DISCUSSION

TECHNICAL APPROACHES

Recommend Projects

Recommend Topics

Recommend Org