intersystems / iknow Goto Github PK
View Code? Open in Web Editor NEWCommunity development repository for iKnow
License: MIT License
Community development repository for iKnow
License: MIT License
The Certainty attribute in the English language model has a marker, span and level. When using the m_index property in the Python interface, the marker can be found through ['sent_attributes'], the span through ['path_attributes']. The level, currently either 0 (uncertain) or 9 (certain), should be in ['sent_attributes'] too, but it is missing.
Example:
Input = "This might be a problem."
['sent_attributes'] = [{'type': 'Certainty', 'offset_start': 7, 'offset_stop': 12, 'marker': 'might', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 1}]
Hi Aohan,
you've probably seen these workflow permission issues with the autoupdate script, but wrapping them in an issue so it's tracked:
Pushing pull request branch to 'origin/autoupdate-buildcache'
/usr/bin/git push --force-with-lease origin HEAD:refs/heads/autoupdate-buildcache
To https://github.com/intersystems/iknow
! [remote rejected] HEAD -> autoupdate-buildcache (refusing to allow a GitHub App to create or update workflow `.github/workflows/dependencies.sh` without `workflows` permission)
error: failed to push some refs to 'https://github.com/intersystems/iknow'
Error: The process '/usr/bin/git' failed with exit code 1
Example failure: https://github.com/intersystems/iknow/runs/1471344935?check_suite_focus=true
The Frequency attribute for 'daily' is missing in the RAW output for the following example:
input:
60 mg daily
-> attributes: attr type="measurement" literal="60 mg daily" token="60 mg" value="60" unit="mg"
The Frequency attribute is however present in the genTrace output:
input:
60 mg daily
-> index="daily" labels="ENCon;ENFrequency(a:Entity,Frequency,);ENInMeasspan;
When the token with Measurement attribute is removed, 'daily' does get the Frequency attribute in the RAW output:
input:
daily
-> attr type="frequency" literal="daily." token="daily."
When an unsupported language is passed as the second argument to iKnowEngine::index, a segmentation fault occurs. It would be better to throw an exception, provide an error code, or to document a precondition.
The iKnow engine does not build with XCode 12, which emits a warning that earlier XCode versions do not. The problem is that (value_int > 0 || value_int <= 9)
is always true. Is there a mistake in this if-statement?
clang++ -std=c++14 -D_DOUBLEBYTE -DCACHE_COM_DISABLE -c -arch x86_64 -mmacosx-version-min=10.9 -stdlib=libc++ -DMY_BIG_ENDIAN=__BIG_ENDIAN__ -D_ISC_BIGENDIAN=__BIG_ENDIAN__ -DBIT64PLAT=__LP64__ -DSIZEOF_LONG=8 -DMACOSX -fPIC -DUNIX -stdlib=libc++ -g -O3 -Wno-long-long -Werror -Wall -Wextra -pedantic-errors -fdiagnostics-show-option -Wno-parentheses -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-local-typedef -Wno-unknown-warning-option -I/Users/travis/build/adang1345/iknow/modules/shell/src -I/Users/travis/build/adang1345/iknow/modules/shell/src/SDK/headers -I/Users/travis/build/adang1345/iknow/modules/base/src/headers -I/Users/travis/build/adang1345/iknow/modules/ali -I/Users/travis/build/adang1345/iknow/modules/core/src/headers -I/Users/travis/build/adang1345/iknow/modules/aho -I/Users/travis/build/adang1345/iknow/shared/System/unix -I/Users/travis/build/adang1345/iknow/shared/System -I/Users/travis/build/adang1345/iknow/shared/Utility -I/Users/travis/build/adang1345/iknow/kernel/common/h -I/Users/travis/build/adang1345/iknow/thirdparty/icu/include -o /Users/travis/build/adang1345/iknow/built/macx64/release/libiknowshell/Process.o /Users/travis/build/adang1345/iknow/modules/shell/src/Process.cpp
In file included from /Users/travis/build/adang1345/iknow/modules/shell/src/CompiledKnowledgebase.cpp:1:
In file included from /Users/travis/build/adang1345/iknow/modules/shell/src/CompiledKnowledgebase.h:3:
In file included from /Users/travis/build/adang1345/iknow/modules/shell/src/SharedMemoryKnowledgebase.h:17:
/Users/travis/build/adang1345/iknow/modules/shell/src/KbRule.h:146:24: error: overlapping comparisons always evaluate to true [-Werror,-Wtautological-overlap-compare]
if (value_int > 0 || value_int <= 9) lexrep_length_ = static_cast<short> (value_int);
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
1 error generated.
Compiling an optimized version of the Japanese model on Mac with clang consumes just about all the memory my system can give it (32GB).
I've tried to reduce this down to a test case I could report to the clang team but without success to date.
I think we need to adjust the Makefile for the Japanese model to not attempt to optimize on any platform (it causes less severe but notable problems on Linux with gcc as well, IIRC). -O0
Importing iknowpy
fails if the machine does not have Visual C++ Redistributable for Visual Studio 2015 installed.
>>> import iknowpy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python36\lib\site-packages\iknowpy\__init__.py", line 23, in <module>
from .engine import iKnowEngine, UserDictionary
ImportError: DLL load failed: The specified module could not be found.
Specifically, we need the files msvcp140.dll, vcruntime140.dll, vcruntime140_1.dll, and concrt140.dll. I can think of 3 possible solutions:
iknowpy
.msvc-runtime
package (https://pypi.org/project/msvc-runtime/), which installs the necessary DLLs into the Python instance.The input below contains some Greek characters which seem to mess up the detection of word boundaries. Boundaries (spaces) appear at the wrong positions, causing splitting and incomplete words. This is especially clear at the end of the sentence: the first 3 characters of the second sentence become part of the first sentence. The shift continues until the end of the input file.
The input file is UTF-8 encoded, as required.
input:
Syloïde blijkt een vergelijkbaar of zelfs groter effect te hebben op sommige parameters (bijv. 𝑎2, 𝑎3, 1 𝑡1 en 1 𝑡3) van de compressievergelijking. Dit vergelijkbare effect wordt echter vaak alleen bereikt bij een hogere concentratie Syloid in vergelijking met magnesiumstearaat.
output:
S1: Syloïde blijkt een vergelijkbaar of zelfs groter effect te hebben op sommige parameters (bijv. 𝑎2, 3, 1 𝑡1 en 1 3) van de om ressievergelijking. Dit
S2: ver elijkbare effect wor t ech er vaa < all> en ber ikt bij een hog re concentratie Syloid in ergelijking met mag esiumstearaat.
Additional markers for negation, sentiment, etc. can be defined through a user dictionary. Please add this functionality for certainty markers too.
With 363be5d, I have temporarily disabled manylinux2010_x86_64 builds due to pypa/manylinux#836. When that issue is resolved, this change should be reverted.
After adding some PathRelevant entities and simple path expansion, I’ve compared the outputs between IRIS NLP and iknowpy. I’ve found one difference which seems to result from different ways that IRIS & iknowpy identiy lexreps.
Sentence: また、大川小のある釜谷地区では住民と在勤者、来訪者計232人のうち、181人が犠牲となったとの調査結果を報告。
Lexrep identification for the part "232人のうち、181人が" in IRIS:
Lexrep("232")=Numeric
Lexrep("人")=JPCon+JPCount+JPRule3437+Lit_人
Lexrep("のうち")=JPParticlePREPO
Lexrep("、")=JPComma+Lit_、
Lexrep("181")=Numeric
Lexrep("人")=JPCon+JPCount+JPRule3437+Lit_人
Lexrep("が")=JPga+Lit_が
Lexrep identification for the same part in iknowpy:
LexrepIdentified:232:Numeric;
LexrepIdentified:人:JPCon;JPRule3437;JPCount;Lit_人;
LexrepIdentified:のうち:JPParticlePREPO;
LexrepIdentified:、:JPComma;Lit_、;
LexrepIdentified:181人:JPCon;JPNumber;Lit_1人;
LexrepIdentified:が:JPga;Lit_が;
As can be seen, “181人” is identified differently: IRIS identifies the whole chunk of numbers “181” first, whereas iknowpy identifies the lexrep “1人” first. This difference results in different indexing results for the character "が", now that it can sometimes be PathRelevant rather than NonRelevant. With the general left-to-right principle, the IRIS way should be kept.
NOTE: this is a request that came from Dr. Torikai & Dr. Noguchi @ Gunma University Hospital.
For Japanese text, we've implemented automatic detection of Furigana. The specifications are as follows:
For example:
This implementation works well in most cases, as such text is just another way of describing (or supplementary information for) the Concept that immediately precedes the set of parentheses. If the text inside the parentheses contains multiple types of characters or consists of all alphabetic characters, the implementation does not apply, as likely the information is more than just a repeat of the preceding Concept.
When the specification was originally designed back in 2013, there was a request to make this feature a switch that could be turned off, but we have no such switch as of date.
In the machine learning experiment Dr. Noguchi is conducting, he often comes across names of medications in the form: GENERIC_NAME (PRODUCT_NAME), i.g.,グリメピリド(アマリール)
Since most medication names are written in Katakana, the product name is almost always indexed as NonRelevant.
In iKnow sense, making アマリール in above example NonRelevant may not be a problem, since it is essentially repeating グリメピリド. In fact, considering アマリール to be a separate Concept may give more weight to the medication names than we need.
However, Dr. Noguchi is wondering if his model can give better results if the Furigana text is tweaked. There are a couple of different ways he wants to experiment:
Either way would have impact on the Entity Vector, proximity and dominance, but his experiment may not use them.
Currently, the Furigana implementation is outside of the language model CSV files, i.e., the iKnow engine does the work. We need Jos's help to enable ability to 1) switch off the Furigana implementation if chosen by the user; and to 2) make changes to the Furigana implementation (if chosen by the user) so that it still applies to certain types of characters but not all of the default ones.
Example 1: rule 2158 in the English language model
2158;50;typeRelation|.ENArtPosspron|*typeConcept+^ENList|ENComma|.ENArtPosspron|typeConcept+^ENNegation|ENAndOrBut+^"but"|.ENArtPosspron|typeConcept+^ENNegation|ENColon:SEnd;|+ENList|+ENList|+ENList|+ENList|+ENList|-ENNegStop+ENList|+ENList|+ENList|;;
This rule has 10 elements (9 + SEnd), 3 of which are optional.
The rule fires for
"of sneezing, a sore throat and fatigue." -> 9 elements (8 + SEnd)
but not for:
"of sneezing, a headache and fatigue." -> 8 elements (7 + SEnd)
Example 2: rule 2377 in the English language model
2377;65;ENCertainty|.ENNegation|ENPBegin+ENCertStop+^ENConj|.^ENPBegin+^SEnd|ENPBegin:SEnd;||-ENCertStop|*|+ENCertStop;;
This rules has 5 elements, 2 of which are optional.
The rule fires for
"perhaps what else" -> 4 elements (3 + SEnd)
but not for:
"perhaps what" -> 3 elements (2 + SEnd)
For more concerned rules and examples, please contact me directly.
In IRIS, it is possible to use literal labels, which get collected automatically from the rules and added to lexreps. It would be very helpful to have that functionality in Python interface as well.
I was doing some testing with 32-bit builds on Linux (where IKNOWPLAT
is set to lnxrhx86
), and I get an exception when I run iknowenginetest
.
$ ./iknowenginetest
*** Unit Test Failure ***
No knowledgebases with rules loaded.
GDB gives the following information for the point the exception is thrown.
(gdb) bt
#0 0xf741f16a in __cxa_throw () from /lib/i386-linux-gnu/libstdc++.so.6
#1 0xf7050c9c in iknow::shell::CProcess::CProcess (this=0xffffccbc, languageKbMap=std::map with 1 element = {...}) at /iknow/iknow/modules/shell/src/Process.cpp:57
#2 0xf760b9f6 in iKnowEngine::index (this=0xffffce44,
text_input=u"こんな台本でプロットされては困る、と先生言った。志望学部の決定時期につい経営関し表()済示すだ外国人入試スポーツ推薦標大きが小くミリディングを避けめ除あ概観分かど区おも高校年最普通点一方般セタ利用合格後やう群率み非常達ら解釈注意要数値以上受験段階併願より発者ち中創価ば良考え「ま来」多存在ル勉対問題抱可能性十力レベ動機面見倣ろ",
utf8language="ja", b_trace=false) at /iknow/iknow/modules/engine/src/engine.cpp:326
#3 0x080543e4 in testing::iKnowUnitTests::test1 (this=0xffffcf37, pMessage=0x8058b58 "Japanese output must generate entity vectors")
at /iknow/iknow/modules/enginetest/iKnowUnitTests.cpp:105
#4 0x0805363e in testing::iKnowUnitTests::runUnitTests () at /iknow/iknow/modules/enginetest/iKnowUnitTests.cpp:22
#5 0x0804e815 in main (argc=1, argv=0xffffd1a4) at /iknow/iknow/modules/enginetest/enginetest.cpp:109
(gdb) frame 1
#1 0xf7050c9c in iknow::shell::CProcess::CProcess (this=0xffffccbc, languageKbMap=std::map with 1 element = {...}) at /iknow/iknow/modules/shell/src/Process.cpp:57
57 in /iknow/iknow/modules/shell/src/Process.cpp
(gdb) p languageKbMap
$1 = std::map with 1 element = {[u"ja"] = 0xffffccf0}
(gdb) set cit = languageKbMap.begin()
(gdb) p cit->second
$2 = (iknow::core::IkKnowledgebase *) 0xffffccf0
(gdb) p *(cit->second)
$3 = {_vptr.IkKnowledgebase = 0xf707da5c <vtable for iknow::shell::CompiledKnowledgebase+8>, cache_ = 0x0, m_strIdentifier = ""}
(gdb) p cit->second->RuleCount()
$4 = 0
Current situation: compiler_report.log is generated in the release/bin directory when lang_update.ban is run. It overwrites the existing compiler_report.log, even if it contains data from another language model.
Request: rename the file to xx_compiler_report.log where xx represents the language code of the concerned model + put it in the language_development folder, where it is needed for genTrace.py.
When an entity contains more than one marker of the same type, e.g. two Negation markers or two DateTime markers, the m_index property in the Python interface outputs them as two separate items. It would be better to collect them into one item.
Example 1: Il n'y avaient jamais des chiens.
concerned entity: n'y avaient pas
attribute output:
[{'type': 'Negation', 'offset_start': 5, 'offset_stop': 8, 'marker': "n'y", 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 1},
{'type': 'Negation', 'offset_start': 17, 'offset_stop': 23, 'marker': 'jamais', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 1}]
Example 2: These reports are for the 1997-1998 academic year.
concerned entity: 1997-1998 academic year
attribute output:
[{'type': 'DateTime', 'offset_start': 28, 'offset_stop': 37, 'marker': '1997-1998', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 4}, {'type': 'DateTime', 'offset_start': 47, 'offset_stop': 52, 'marker': 'year.', 'value': '', 'unit': '', 'value2': '', 'unit2': '', 'entity_ref': 4}]
Currently, the Makefiles require an IKNOWPLAT environment variable set to one of a few InterSystems-specific platform identifiers. This should be easy to replace with some generic definitions with good defaults for e.g. CXX.
The iknow engine supports several attributes: negation, time, certainty, measurements, time, sentiment. However, customers may benefit from a 'generic' attribute that they can use for specific patterns in their data.
Requirements:
This issue happens when building iKnowEngineTest or iKnowALI on Japanese Windows and Japanese Visual Studio that gets installed, basically the setup that would be used by most Japanese users.
The warning is "The file contains a character that cannot be represented in the current code page (932). Save the file in Unicode format to prevent data loss." The warning is for the line 3153 in uchar.h: u_isWhitespace(UChar32 c);
The build can continue if "Treat warnings as errors" is set to No, but why this is happening needs to be investigated.
This is the iKnow standalone implementation of a request to manipulate certainty levels in rules processing, as described in ISC Confluence page : https://usconfluence.iscinternal.com/pages/viewpage.action?spaceKey=ILT&title=Certainty+Levels
Part 1 : select lexreps based on certainty level conditions (rule matching).
Part 2 : manipulate lexrep certainty levels (rule output actions).
Part 3 : the generic "Certainty" label, and how it relates to certainty levels.
Part 4 : joining lexreps: how to handle certainty levels.
Rules containing the pattern ".^SEnd|SEnd" do not fire on that pattern.
Example:
Rule
2363;65;SBegin|ENCertBegin|","|ENPBegin+ENCertStop|.^ENPBegin+^SEnd|ENPBegin:SEnd;|||-ENCertStop||+ENCertStop;;
Input
Clearly, regional measurements can provide more detailed information about morphologic changes.
-> Actual pattern = ".^ENPBegin+^SEnd|SEnd" -> rule doesn't fire
Input
Clearly, regional measurements can provide more detailed information about morphologic changes that cannot be gained by globally averaged evaluations alone.
-> Actual pattern = ".^ENPBegin+^SEnd|ENPBegin" -> rule does fire
The second example shows that the rule is applicable for this input.
The UIMA standard enables interoperability between different NLP and more general unstructured data processing tools. The iKnow technology embedded in the InterSystems IRIS Data Platform where this repo has its roots has supported calling the iKnow engine through the UIMA interface and conceptually, that would be a very reasonable entry point to publish to this open source project as well.
This placeholder is more of an open question rather than an outright commitment or project plan, as UIMA has a Java-based implementation and the UIMACPP bridge we'd have to lean on is not seeing as much dev activity, so eager to see +1s or potential users chime in on their needs for this before we plan any dev work.
Running lang_update.bat modifies modules\aho\inl\ja\kb_data.inl even if the language model has not changed.
When iKnow processes Japanese text, it's possible that a Concept doesn't get any EntityVector attribute such as Subject, Object, Topic, OtherEntity, and DateTime (which all include EVValue). In such case, iKnow assigns the lowest priority EV value so that the entity gets included in the Entity Vector.
For example:
吊(つ)り戸棚からはだし類が次々と15袋以上出てくる。
Since the Furigana(つ)interferes with grammatical interpretation of the sentence, the character 吊 cannot receive any EV attribute, thus receiving the lowest EV priority.
This happens mostly when the sentence's format is unconventional, like in this example, but it's possible that a new rule that doesn't take EVs into consideration causes a similar situation.
In the IRIS version of genTrace, the generated trace includes a line like below to indicate which entity of which sentence within the input file was a Concept without EV attribute:
*** MissingEntityVector !Lexrep("吊")=JPVerbOther
However, when genTrace.py script gets run against a file that includes such sentence, "MissingEntityVector" appears after the input file name in the Command Prompt but there is no further information. The generated trace log file doesn't include any additional information about "MissingEntityVector" either. This is a problem, since it will not be possible to identify which entity/sentence is generating such warning.
Rules are organized in "phases". The label definitions contain a list of phases in which the label can be used.
If a label is used in a rule in a phase for which it is not defined, the rule won't fire, because the input pattern is not found. Can you let the engine emit a warning about the mismatch in phases?
I'm trying to follow the instructions (currently ISC-local but should be posted here) to build the iknowpy integration. But it looks like the tool is invoking clang rather than clang++, which is very unhappy with all the C++ constructs in the iKnow source.
ambassador:iknowpy woodfin$ python3 setup.py build_ext --inplace
Compiling iknowpy.pyx because it changed.
[1/1] Cythonizing iknowpy.pyx
running build_ext
building 'iknowpy' extension
creating build
creating build/temp.macosx-10.14-x86_64-3.7
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I. -I../engine/src -I../core/src/headers -I../base/src/headers -I/usr/local/Cellar/icu4c/64.2/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c iknowpy.cpp -o build/temp.macosx-10.14-x86_64-3.7/iknowpy.o
In file included from iknowpy.cpp:651:
In file included from ./../core/src/headers/IkConceptProximity.h:10:
../base/src/headers/PoolAllocator.h:60:29: error: 'T' does not refer to a value
size_t alignment = alignof(T);
... <many more errors> ...
Although this issue is about a Python script, it actually concerns something that needs to be changed in the engine.
Where PreprocessToken introduces a space within a token (thus creating 2 tokens), the last character of the first part is replaced by space in Sentence_found, e.g. PreprocessToken "It's" -> "It 's", Sentence_found "I 's".
This issue happens when building iKnowEngineTest or iKnowALI on Japanese Windows and Japanese Visual Studio that gets installed, basically the setup that would be used by most Japanese users.
Error C2001 - newline in constant
Error C2143 - missing ')' before 'if'
Error C2143 - missing ';' before 'if'
These errors occur in the file enginetest.cpp for the line that starts with if (language_code == "ja") return
. It fails because of the Japanese text. I've found that some very short text may work, but they are certainly not long enough to be called sentences:
お金 - works
お金をなくした - doesn't work
コーヒー - works
コーヒー代 - works
テキスト - doesn't work
ABCZ - works (all double-width characters)
ABCZ。 - doesn't work (same as above)
雅子 - doesn't work
まさこ - works
If more than one value and unit are present in the same Concept only one value and one unit are shown in the RAW output. In the following example only '7 ounces' gets an attribute, and '7 pounds' doesn't.
input:
The baby weighs 7 pounds 7 ounces.
current output:
<attr type="measurement" literal="7 pounds 7 ounces." token="7 pounds 7 ounces." value="7" unit="ounces">
desired output:
<attr type="measurement" literal="7 pounds 7 ounces." token="7 pounds 7 ounces." value="7" unit="pounds" value2="7" unit2="ounces">
The IRIS / iKnow engine interface has traditionally been specified as C++ APIs. This was fine for internal use, as we controlled the entire build & deployment infrastructure.
But a pure C API would have advantages for cross-compiler compatibility & deployment. For integration with spaCy and other tools, we should consider supporting such an API, either exclusively or supplementing the "richer" C++ APIs.
With #37 now out of the way (thanks @JosDenysGitHub !), we'll want to have a means to seed (or even override?) this kind of attribute properties through the User Dictionary, on top of the ones in the language models. The level property for the certainty attribute is the first and foremost example of these.
We'll need this first on the C++ side and then expose it in the iknowpy.UserDictionary
interface on the Python end
During lang_update processing a rule with non-matching number of labels in the input and output goes unnoticed. The language model compilation should stop, so the conflict can be resolved.
Example:
;39;CSPrep+^"před":"po":"za"|CSNum+^CSPartTime|CSYear|^CSEndTime;||-CSTimeConcept-CSTime;
-> 4 input labels and 3 output labels
Result:
Successfully installed iknowpy
When I run the test Python interface program, I get the following with Japanese text, i.e., Proximity & Path are empty, which is not the case with English text. Entity Vector (Path for Japnese) is probably the most important part of output for Japanese, so please make it available.
(base) C:\Users\mohira\Documents>python test.py
Languages Set:
{'cs', 'sv', 'ru', 'ja', 'nl', 'de', 'fr', 'pt', 'uk', 'es', 'en'}
Input text:
これはiKnowエンジンへのPythonインターフェースです。
Index:
{'proximity': [],
'sentences': [{'entities': [{'dominance_value': 0.0,
'entity_id': 0,
'index': 'これ',
'offset_start': 0,
'offset_stop': 2,
'type': 'NonRelevant'},
{'dominance_value': 0.0,
'entity_id': 0,
'index': 'は',
'offset_start': 2,
'offset_stop': 3,
'type': 'NonRelevant'},
{'dominance_value': 9.223372036854776e+18,
'entity_id': 1,
'index': 'iknowエンジン',
'offset_start': 3,
'offset_stop': 12,
'type': 'Concept'},
{'dominance_value': 0.0,
'entity_id': 2,
'index': 'へ',
'offset_start': 12,
'offset_stop': 13,
'type': 'Relation'},
{'dominance_value': 0.0,
'entity_id': 0,
'index': 'の',
'offset_start': 13,
'offset_stop': 14,
'type': 'NonRelevant'},
{'dominance_value': 9.223372036854776e+18,
'entity_id': 3,
'index': 'pythonインターフェース',
'offset_start': 14,
'offset_stop': 28,
'type': 'Concept'},
{'dominance_value': 0.0,
'entity_id': 4,
'index': 'です',
'offset_start': 28,
'offset_stop': 30,
'type': 'Relation'},
{'dominance_value': 0.0,
'entity_id': 0,
'index': '。',
'offset_start': 30,
'offset_stop': 31,
'type': 'NonRelevant'}],
'path': [],
'path_attributes': [],
'sent_attributes': []}]}
During lang_update processing a duplicate rule number goes unnoticed. The language model compilation should stop, so the conflict can be resolved.
Example:
852;39;^SBegin:CSPunctuation|"no"+CSCapitalInitial;*|CSCon+CSBeforeNumber;
852;39;"no"+CSCapitalAll;CSCon+CSBeforeNumber;
-> By first commenting out rule 852 and later reactivating the rule, rule number '852' existed twice in the rules file
Result:
Successfully installed iknowpy
For alphabet characters within the Entity Vector, the genRAW python script currently uses normalized lowercase single-width characters, even if the original text uses uppercase or double-width alphabet. The expectation is to get the literal values, i.e., identical to the original text.
If an undefined label is used in lexreps.csv, the build of the language compiler fails with this error:
57>C:\iKnow_GH\modules\aho\inl\de\lexrep\MatchObjs.inl(1,22): error C2466: cannot allocate an array of constant siz
e 0 [C:\iKnow_GH\modules\aho\model0_de.vcxproj]
It would be more convenient to get a warning like "undefined label DEVerbadj in lexreps.csv, line 32424".
During lang_update processing a duplicate regular expression goes unnoticed. The language model compilation should stop, so the conflict can be resolved.
Example:
footnote;[\d{1,3}]
footnote;[\d{3}]
-> the same regular expression name was used twice
Result:
Successfully installed iknowpy
I am trying to add a run of ref_testing.py
to the CI workflow. It runs smoothly on Windows and Linux but crashes the Python interpreter on Mac OS X.
Here is the script output.
GENERATING RAW OUTPUT
Processing cs_core.txt
language: cs
Processing ja_core.txt
language: ja
Processing de_core.txt
language: de
Processing ru_core.txt
language: ru
Processing en_core.txt
language: en
Processing uk_core.txt
language: uk
zsh: segmentation fault python3 ref_testing.py
And here are the crash details.
Process: Python [821]
Path: /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Resources/Python.app/Contents/MacOS/Python
Identifier: Python
Version: 3.8.2 (3.8.2)
Build Info: python3-73040006000000~117
Code Type: X86-64 (Native)
Parent Process: zsh [676]
Responsible: Terminal [673]
User ID: 501
Date/Time: 2021-01-12 09:12:35.686 -0500
OS Version: macOS 11.1 (20C69)
Report Version: 12
Anonymous UUID: 4A4343A2-84B4-48A9-9F69-AD3834F1FA06
Time Awake Since Boot: 970 seconds
System Integrity Protection: disabled
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00007f99a78ffffe
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Segmentation fault: 11
Termination Reason: Namespace SIGNAL, Code 0xb
Terminating Process: exc handler [821]
VM Regions Near 0x7f99a78ffffe:
MALLOC_TINY 7f99a7700000-7f99a7800000 [ 1024K] rw-/rwx SM=PRV
-->
MALLOC_LARGE_REUSABLE 7f99a7900000-7f99a791f000 [ 124K] rw-/rwx SM=PRV
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libiknowcore-36353c21.dylib 0x0000000108afce43 iknow::core::IkIndexProcess::FindNextSentence(iknow::core::IkIndexInput*, std::__1::vector<iknow::core::IkLexrep, iknow::base::PoolAllocator<iknow::core::IkLexrep> >&, int&, unsigned long, bool, std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> >&, double&, iknow::core::IkKnowledgebase*, double, int) + 1907
1 libiknowcore-36353c21.dylib 0x0000000108af8b81 iknow::core::IkIndexProcess::Start(iknow::core::IkIndexInput*, iknow::core::IkIndexOutput*, iknow::core::IkIndexDebug<std::__1::list<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >*, bool, bool, bool, unsigned long, iknow::core::IkKnowledgebase*) + 1345
2 libiknowshell-1bceda57.dylib 0x0000000108aae95d iknow::shell::CProcess::IndexFunc(iknow::core::IkIndexInput&, void (*)(iknow::core::IkIndexOutput*, iknow::core::IkIndexDebug<std::__1::list<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >*, void*, iknow::core::IkStemmer<std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> >, std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> > >*), void*, bool, bool) + 653
3 libiknowengine-2c57ddd0.dylib 0x0000000108a3c472 iKnowEngine::index(std::__1::basic_string<char16_t, std::__1::char_traits<char16_t>, std::__1::allocator<char16_t> >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 498
4 libiknowengine-2c57ddd0.dylib 0x0000000108a40dd5 iKnowEngine::index(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 53
5 engine.cpython-38-darwin.so 0x00000001089ee6f4 __pyx_pw_7iknowpy_6engine_11iKnowEngine_7index(_object*, _object*, _object*) + 3604
6 engine.cpython-38-darwin.so 0x00000001089df3ac __Pyx_CyFunction_CallAsMethod(_object*, _object*, _object*) + 92
7 com.apple.python3 0x000000010839bdd6 _PyObject_MakeTpCall + 374
8 com.apple.python3 0x000000010839f185 method_vectorcall + 229
9 com.apple.python3 0x000000010847c012 call_function + 354
10 com.apple.python3 0x00000001084786d6 _PyEval_EvalFrameDefault + 29782
11 com.apple.python3 0x000000010839c78d function_code_fastcall + 237
12 com.apple.python3 0x000000010847c012 call_function + 354
13 com.apple.python3 0x000000010847878a _PyEval_EvalFrameDefault + 29962
14 com.apple.python3 0x000000010847d097 _PyEval_EvalCodeWithName + 3287
15 com.apple.python3 0x00000001084711e0 PyEval_EvalCode + 48
16 com.apple.python3 0x00000001084c2933 PyRun_FileExFlags + 291
17 com.apple.python3 0x00000001084c1d9f PyRun_SimpleFileExFlags + 271
18 com.apple.python3 0x00000001084e1267 Py_RunMain + 2103
19 com.apple.python3 0x00000001084e1793 pymain_main + 403
20 com.apple.python3 0x00000001084e17eb Py_BytesMain + 43
21 libdyld.dylib 0x00007fff20445621 start + 1
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x00007f99a7900000 rbx: 0x0000000000000000 rcx: 0x00007f99a7900000 rdx: 0x0000000000000000
rdi: 0x00007ffee789bc20 rsi: 0x0000000000000004 rbp: 0x00007ffee789af20 rsp: 0x00007ffee789ada0
r8: 0x00007f99a6aa53b0 r9: 0x00007f99a6aa4f30 r10: 0x00007f99a6921340 r11: 0x0000000000000001
r12: 0x0000000108b32120 r13: 0x0000000000000000 r14: 0x00007f99a7900000 r15: 0x00007f99a7900000
rip: 0x0000000108afce43 rfl: 0x0000000000010293 cr2: 0x00007f99a78ffffe
Logical CPU: 1
Error Code: 0x00000004 (no mapping for user data read)
Trap Number: 14
Thread 0 instruction stream:
18 ff ff ff 4c 8b 7d a0-0f 83 47 01 00 00 66 2e ....L.}...G...f.
0f 1f 84 00 00 00 00 00-0f 1f 44 00 00 41 8b 07 ..........D..A..
ff c0 41 89 07 f6 45 cc-01 0f 85 26 01 00 00 48 ..A...E....&...H
63 c8 48 39 8d 10 ff ff-ff 0f 87 07 fb ff ff e9 c.H9............
11 01 00 00 49 63 c4 48-8b 4d a8 4c 8d 34 41 49 ....Ic.H.M.L.4AI
63 07 48 8d 04 41 4c 8d-25 e0 52 03 00 49 89 c7 c.H..AL.%.R..I..
[0f]b7 40 fe 66 83 f8 7f-77 43 8d 48 d0 41 bd 03 [email protected].. <==
00 00 00 66 83 f9 0a 72-5c 89 c1 83 e1 df 83 c1 ...f...r\.......
bf 66 83 f9 1a 72 4e 41-bd 03 00 00 00 66 83 f8 .f...rNA.....f..
0d 77 42 b9 00 34 00 00-0f a3 c1 73 38 4d 39 f7 .wB..4.....s8M9.
77 1b eb 23 66 0f 1f 84-00 00 00 00 00 0f b7 f8 w..#f...........
e8 48 a0 02 00 41 89 c5-4d 39 f7 76 0a 49 8d 47 .H...A..M9.v.I.G
Thread 0 last branch register state not available.
Binary Images:
0x108363000 - 0x108366fff com.apple.python3 (3.8.2 - 3.8.2) <6BA1329E-13BF-3FAB-ABA2-E57C991180AC> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Resources/Python.app/Contents/MacOS/Python
0x108378000 - 0x1085c7fff com.apple.python3 (3.8.2 - 3.8.2) <EC3F4640-FA3E-3557-88A7-A8402EDEFFFB> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Python3
0x108895000 - 0x108898fff +_heapq.cpython-38-darwin.so (73.40.6) <A1BA08C1-54F6-3C26-8634-0FDA853D57DC> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/lib-dynload/_heapq.cpython-38-darwin.so
0x1088a5000 - 0x1088a8fff +_opcode.cpython-38-darwin.so (73.40.6) <FB35708A-9CB9-3C35-9EF8-C8392E91A29A> /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/lib-dynload/_opcode.cpython-38-darwin.so
0x1088f5000 - 0x108900fff +libicuio-dc9f50b5.68.2.dylib (0) <9496F364-D088-3C24-88BF-E51D492FD765> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicuio-dc9f50b5.68.2.dylib
0x108909000 - 0x10890cfff +libiknowmodelcom-abcf95e0.dylib (0) <662B9ADA-022F-30FE-92DE-7F027019C8BA> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelcom-abcf95e0.dylib
0x1089d5000 - 0x1089f8fff +engine.cpython-38-darwin.so (0) <D0C9DFA3-2BE6-36D0-9B94-AF3FF31F59C0> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/engine.cpython-38-darwin.so
0x108a1d000 - 0x108a48fff +libiknowengine-2c57ddd0.dylib (0) <581F1DDC-842C-3CB7-95F8-A35B77672BD0> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowengine-2c57ddd0.dylib
0x108a71000 - 0x108a80fff +libiknowbase-e6b6bc45.dylib (0) <F03732C4-8DAB-3F12-BA50-A6602B6F1673> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowbase-e6b6bc45.dylib
0x108a99000 - 0x108ab8fff +libiknowshell-1bceda57.dylib (0) <8FD2F6B9-CFB1-383B-AC6D-CBA73A556551> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowshell-1bceda57.dylib
0x108ad5000 - 0x108b30fff +libiknowcore-36353c21.dylib (0) <D8BB5AD4-45E5-3844-8F2D-A621E1EDF0E7> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowcore-36353c21.dylib
0x108b75000 - 0x108d64fff +libicui18n-66033737.68.2.dylib (0) <6E6161F5-D061-3DD5-AABD-6CFDF9C0A5DE> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicui18n-66033737.68.2.dylib
0x108e75000 - 0x108fd0fff +libicuuc-38a45a2b.68.2.dylib (0) <869DF4C9-1805-3B5F-9CC2-A762118A470D> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicuuc-38a45a2b.68.2.dylib
0x109045000 - 0x10ab84fff +libicudata-fc5cc17d.68.2.dylib (0) <9B328C2D-4277-36F3-87B5-B4842F4B95BC> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libicudata-fc5cc17d.68.2.dylib
0x10ab89000 - 0x10ab90fff +libiknowali-86fc8c44.dylib (0) <3A48AC15-501D-3EE7-9A24-9A2C67FD1DAA> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowali-86fc8c44.dylib
0x10ab99000 - 0x10af10fff +libiknowmodelde-0c73cc5e.dylib (0) <E9DF6E9A-D6CF-39B7-A5F3-B34DE53DC1C3> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelde-0c73cc5e.dylib
0x10af21000 - 0x10af28fff +libiknowmodeldex-f9dbd60a.dylib (0) <F4A9D38B-2FC2-3E39-9FB7-2EEFBF92C53F> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeldex-f9dbd60a.dylib
0x10af35000 - 0x10b5a4fff +libiknowmodelen-1f3801c6.dylib (0) <4211658B-7B0A-387F-8298-70D8F7752DFB> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelen-1f3801c6.dylib
0x10b5e9000 - 0x10b610fff +libiknowmodelenx-24cb2a74.dylib (0) <5F55A0CC-4BA8-3D20-A68F-9A310CBE2C06> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelenx-24cb2a74.dylib
0x10b61d000 - 0x10bd08fff +libiknowmodeles-abff9de2.dylib (0) <92768581-E483-334E-A94D-86E0896C4E78> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeles-abff9de2.dylib
0x10bd25000 - 0x10bd2cfff +libiknowmodelesx-72f179bf.dylib (0) <BC07BD7A-3098-32B0-9957-32A2187404E0> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelesx-72f179bf.dylib
0x10bd39000 - 0x10c448fff +libiknowmodelfr-3dd7ab3d.dylib (0) <C34FE180-294B-31D6-9031-B8ADB2E0E8BE> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelfr-3dd7ab3d.dylib
0x10c469000 - 0x10c470fff +libiknowmodelfrx-bfcfa387.dylib (0) <A829EF38-EA2F-3B99-9FBE-57C1F08B475B> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelfrx-bfcfa387.dylib
0x10c47d000 - 0x10ca3cfff +libiknowmodelja-ca57a02b.dylib (0) <49E10BB0-5196-3FD0-B851-950978572A89> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelja-ca57a02b.dylib
0x10cb5d000 - 0x10cb64fff +libiknowmodeljax-f84c9c91.dylib (0) <6BC0C101-BC09-3EA3-9EEF-4A574512B303> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeljax-f84c9c91.dylib
0x10cb71000 - 0x10cf84fff +libiknowmodelnl-40411dbf.dylib (0) <02FB0098-9EF0-3497-A8EA-424279782E59> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelnl-40411dbf.dylib
0x10cfa5000 - 0x10cfacfff +libiknowmodelnlx-1c7a6666.dylib (0) <C3BDAB87-0179-324A-AA21-1E36521F1016> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelnlx-1c7a6666.dylib
0x10cfb9000 - 0x10da4cfff +libiknowmodelpt-5c2f4caf.dylib (0) <7734CFDB-B28D-3582-8531-08205C03B255> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelpt-5c2f4caf.dylib
0x10da69000 - 0x10da70fff +libiknowmodelptx-85d4654a.dylib (0) <04E02350-C643-3EE3-8D33-4405AD74C493> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelptx-85d4654a.dylib
0x10da7d000 - 0x10dce4fff +libiknowmodelru-8ea2598f.dylib (0) <045F1501-416B-3513-AD34-18A8F0207F3C> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelru-8ea2598f.dylib
0x10dd01000 - 0x10dd28fff +libiknowmodelrux-c5df6bc8.dylib (0) <DF613DD8-47B6-37C1-AF9B-EC27913DE18A> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelrux-c5df6bc8.dylib
0x10dd39000 - 0x10df10fff +libiknowmodeluk-f8322ec4.dylib (0) <E6F6CF0E-A7AE-3EE1-8C9A-5958277CB6BB> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodeluk-f8322ec4.dylib
0x10df31000 - 0x10df54fff +libiknowmodelukx-fef454ad.dylib (0) <7CF0F9AA-7FF9-3708-8B6C-2053C1C4CA09> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelukx-fef454ad.dylib
0x10df65000 - 0x10e554fff +libiknowmodelsv-e28a4052.dylib (0) <BB9E995F-9A80-34EA-A9E2-9278ED856D2E> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelsv-e28a4052.dylib
0x10e571000 - 0x10e57cfff +libiknowmodelsvx-11383aed.dylib (0) <BAED8D7D-3280-3ED0-B6FB-36F8E5E953C9> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelsvx-11383aed.dylib
0x10e589000 - 0x10f388fff +libiknowmodelcs-8de28af0.dylib (0) <80B5985B-F1FF-3845-8442-D6E96CE8AEE4> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelcs-8de28af0.dylib
0x10f3b9000 - 0x10f3d0fff +libiknowmodelcsx-82298ec8.dylib (0) <95DB5E6D-DFC3-3ADA-B18B-6F1587F0570A> /Users/USER/Library/Python/3.8/lib/python/site-packages/iknowpy/libiknowmodelcsx-82298ec8.dylib
0x112dc6000 - 0x112e61fff dyld (832.7.1) <DEA51514-B4E8-3368-979B-89D0F8397ABC> /usr/lib/dyld
0x7fff2015f000 - 0x7fff20160fff libsystem_blocks.dylib (78) <9CF131C6-16FB-3DD0-B046-9E0B6AB99935> /usr/lib/system/libsystem_blocks.dylib
0x7fff20161000 - 0x7fff20196fff libxpc.dylib (2038.40.38) <003A027D-9CE3-3794-A319-88495844662D> /usr/lib/system/libxpc.dylib
0x7fff20197000 - 0x7fff201aefff libsystem_trace.dylib (1277.50.1) <48C14376-626E-3C81-B0F5-7416E64580C7> /usr/lib/system/libsystem_trace.dylib
0x7fff201af000 - 0x7fff2024dfff libcorecrypto.dylib (1000.60.19) <92F0211E-506E-3760-A3C2-808BF3905C07> /usr/lib/system/libcorecrypto.dylib
0x7fff2024e000 - 0x7fff2027afff libsystem_malloc.dylib (317.40.8) <2EF43B96-90FB-3C50-B73E-035238504E33> /usr/lib/system/libsystem_malloc.dylib
0x7fff2027b000 - 0x7fff202bffff libdispatch.dylib (1271.40.12) <CEF1460B-1362-381A-AE69-6BCE2D8C215B> /usr/lib/system/libdispatch.dylib
0x7fff202c0000 - 0x7fff202f9fff libobjc.A.dylib (818.2) <339EDCD0-5ABF-362A-B9E5-8B9236C8D36B> /usr/lib/libobjc.A.dylib
0x7fff202fa000 - 0x7fff202fcfff libsystem_featureflags.dylib (28.60.1) <7B4EBDDB-244E-3F78-8895-566FE22288F3> /usr/lib/system/libsystem_featureflags.dylib
0x7fff202fd000 - 0x7fff20385fff libsystem_c.dylib (1439.40.11) <06D9F593-C815-385D-957F-2B5BCC223A8A> /usr/lib/system/libsystem_c.dylib
0x7fff20386000 - 0x7fff203dbfff libc++.1.dylib (904.4) <AE3A940A-7A9C-3F99-B175-3511528D8DFE> /usr/lib/libc++.1.dylib
0x7fff203dc000 - 0x7fff203f4fff libc++abi.dylib (904.4) <DDFCBF9C-432D-3B8A-8641-578D2EDDCAD8> /usr/lib/libc++abi.dylib
0x7fff203f5000 - 0x7fff20423fff libsystem_kernel.dylib (7195.60.75) <4BD61365-29AF-3234-8002-D989D295FDBB> /usr/lib/system/libsystem_kernel.dylib
0x7fff20424000 - 0x7fff2042ffff libsystem_pthread.dylib (454.60.1) <8DD3A0BC-2C92-31E3-BBAB-CE923A4342E4> /usr/lib/system/libsystem_pthread.dylib
0x7fff20430000 - 0x7fff2046afff libdyld.dylib (832.7.1) <2F8A14F5-7CB8-3EDD-85EA-7FA960BBC04E> /usr/lib/system/libdyld.dylib
0x7fff2046b000 - 0x7fff20474fff libsystem_platform.dylib (254.60.1) <3F7F6461-7B5C-3197-ACD7-C8A0CFCC6F55> /usr/lib/system/libsystem_platform.dylib
0x7fff20475000 - 0x7fff204a0fff libsystem_info.dylib (542.40.3) <0979757C-5F0D-3F5A-9E0E-EBF234B310AF> /usr/lib/system/libsystem_info.dylib
0x7fff204a1000 - 0x7fff2093cfff com.apple.CoreFoundation (6.9 - 1770.300) <7AADB19E-8EA2-3C9B-8699-F206DB47C6BE> /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
0x7fff22560000 - 0x7fff227c1fff libicucore.A.dylib (66109) <6C0A0196-2778-3035-81CE-7CA48D6C0628> /usr/lib/libicucore.A.dylib
0x7fff227c2000 - 0x7fff227cbfff libsystem_darwin.dylib (1439.40.11) <BD269412-C9D0-32EE-B42B-B09A187A9B95> /usr/lib/system/libsystem_darwin.dylib
0x7fff22bdc000 - 0x7fff22be7fff libsystem_notify.dylib (279.40.4) <98D74EEF-60D9-3665-B877-7BE1558BA83E> /usr/lib/system/libsystem_notify.dylib
0x7fff24b37000 - 0x7fff24b45fff libsystem_networkextension.dylib (1295.60.5) <F476B1CB-3561-30C5-A78E-44E99B1720A3> /usr/lib/system/libsystem_networkextension.dylib
0x7fff24ba3000 - 0x7fff24bb9fff libsystem_asl.dylib (385) <940C5BB9-4928-3A63-97F2-132797C8B7E5> /usr/lib/system/libsystem_asl.dylib
0x7fff262d0000 - 0x7fff262d7fff libsystem_symptoms.dylib (1431.60.1) <88F35AAC-746F-3176-81DF-49CE3D285636> /usr/lib/system/libsystem_symptoms.dylib
0x7fff28604000 - 0x7fff28614fff libsystem_containermanager.dylib (318.60.1) <4ED09A19-04CC-3464-9EFB-F674932020B5> /usr/lib/system/libsystem_containermanager.dylib
0x7fff29314000 - 0x7fff29317fff libsystem_configuration.dylib (1109.60.2) <C57B346B-0A03-3F87-BCAC-87B702FA0719> /usr/lib/system/libsystem_configuration.dylib
0x7fff29318000 - 0x7fff2931cfff libsystem_sandbox.dylib (1441.60.4) <8CE27199-D633-31D2-AB08-56380A1DA9FB> /usr/lib/system/libsystem_sandbox.dylib
0x7fff29f27000 - 0x7fff29f29fff libquarantine.dylib (119.40.2) <19D42B9D-3336-3543-AF75-6E605EA31599> /usr/lib/system/libquarantine.dylib
0x7fff2a4a9000 - 0x7fff2a4adfff libsystem_coreservices.dylib (127) <A2D875B9-8BA8-33AD-BE92-ADAB915A8D5B> /usr/lib/system/libsystem_coreservices.dylib
0x7fff2a6c4000 - 0x7fff2a70ffff libsystem_m.dylib (3186.40.2) <0F98499E-662F-36EC-AB58-91A8D5A0FB74> /usr/lib/system/libsystem_m.dylib
0x7fff2a711000 - 0x7fff2a716fff libmacho.dylib (973.4) <28AE1649-22ED-3C4D-A232-29D37F821C39> /usr/lib/system/libmacho.dylib
0x7fff2a733000 - 0x7fff2a73efff libcommonCrypto.dylib (60178.40.2) <1D0A75A5-DEC5-39C6-AB3D-E789B8866712> /usr/lib/system/libcommonCrypto.dylib
0x7fff2a73f000 - 0x7fff2a749fff libunwind.dylib (200.10) <C5792A9C-DF0F-3821-BC14-238A78462E8A> /usr/lib/system/libunwind.dylib
0x7fff2a74a000 - 0x7fff2a751fff liboah.dylib (203.13.2) <FF72E19B-3B02-34D4-A821-3397BB28AC02> /usr/lib/liboah.dylib
0x7fff2a752000 - 0x7fff2a75cfff libcopyfile.dylib (173.40.2) <89483CD4-DA46-3AF2-AE78-FC37CED05ACC> /usr/lib/system/libcopyfile.dylib
0x7fff2a75d000 - 0x7fff2a764fff libcompiler_rt.dylib (102.2) <0DB26EC8-B4CD-3268-B865-C2FC07E4D2AA> /usr/lib/system/libcompiler_rt.dylib
0x7fff2a765000 - 0x7fff2a767fff libsystem_collections.dylib (1439.40.11) <D40D8097-0ABF-3645-B065-168F43ACFF4C> /usr/lib/system/libsystem_collections.dylib
0x7fff2a768000 - 0x7fff2a76afff libsystem_secinit.dylib (87.60.1) <99B5FD99-1A8B-37C1-BD70-04990FA33B1C> /usr/lib/system/libsystem_secinit.dylib
0x7fff2a76b000 - 0x7fff2a76dfff libremovefile.dylib (49.40.3) <750012C2-7097-33C3-B796-2766E6CDE8C1> /usr/lib/system/libremovefile.dylib
0x7fff2a76e000 - 0x7fff2a76efff libkeymgr.dylib (31) <2C7B58B0-BE54-3A50-B399-AA49C19083A9> /usr/lib/system/libkeymgr.dylib
0x7fff2a76f000 - 0x7fff2a776fff libsystem_dnssd.dylib (1310.60.4) <81EFC44D-450E-3AA3-AC8F-D7EF68F464B4> /usr/lib/system/libsystem_dnssd.dylib
0x7fff2a777000 - 0x7fff2a77cfff libcache.dylib (83) <2F7F7303-DB23-359E-85CD-8B2F93223E2A> /usr/lib/system/libcache.dylib
0x7fff2a77d000 - 0x7fff2a77efff libSystem.B.dylib (1292.60.1) <A7FB4899-9E04-37ED-9DD8-8FFF0400879C> /usr/lib/libSystem.B.dylib
0x7fff2a77f000 - 0x7fff2a782fff libfakelink.dylib (3) <34B6DC95-E19A-37C0-B9D0-558F692D85F5> /usr/lib/libfakelink.dylib
0x7fff2a783000 - 0x7fff2a783fff com.apple.SoftLinking (1.0 - 1) <90D679B3-DFFD-3604-B89F-1BCF70B3EBA4> /System/Library/PrivateFrameworks/SoftLinking.framework/Versions/A/SoftLinking
0x7fff2dd0c000 - 0x7fff2dd0cfff liblaunch.dylib (2038.40.38) <05A7EFDD-4111-3E4D-B668-239B69DE3D0F> /usr/lib/system/liblaunch.dylib
0x7fff301b9000 - 0x7fff301b9fff libsystem_product_info_filter.dylib (8.40.1) <7CCAF1A8-F570-341E-B275-0C80B092F8E0> /usr/lib/system/libsystem_product_info_filter.dylib
External Modification Summary:
Calls made by other processes targeting this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by all processes on this machine:
task_for_pid: 475
thread_create: 0
thread_set_state: 0
VM Region Summary:
ReadOnly portion of Libraries: Total=608.8M resident=0K(0%) swapped_out_or_unallocated=608.8M(100%)
Writable regions: Total=96.1M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=96.1M(100%)
VIRTUAL REGION
REGION TYPE SIZE COUNT (non-coalesced)
=========== ======= =======
Kernel Alloc Once 8K 1
MALLOC 73.6M 55
MALLOC guard page 16K 4
MALLOC_LARGE (reserved) 512K 2 reserved VM address space (unallocated)
STACK GUARD 4K 1
Stack 16.0M 1
VM_ALLOCATE 4872K 21
__DATA 2547K 87
__DATA_CONST 2986K 38
__DATA_DIRTY 95K 23
__LINKEDIT 494.0M 67
__OBJC_RO 60.5M 1
__OBJC_RW 2451K 2
__TEXT 115.0M 84
__UNICODE 588K 1
shared memory 8K 2
=========== ======= =======
TOTAL 772.9M 390
TOTAL, minus reserved VM space 772.4M 390
Model: iMac19,1, BootROM VirtualBox, 2 processors, Unknown, 2.9 GHz, 8 GB, SMC 2.3f35
Graphics: spdisplays_display, 5 MB
Memory Module: Bank 0/DIMM 0, 8 GB, DRAM, 1600 MHz, innotek GmbH, -
Network Service: Ethernet, Ethernet, en0
Network Service: Ethernet Adaptor (en1), Ethernet, en1
Serial ATA Device: VBOX HARDDISK, 137.44 GB
Serial ATA Device: VBOX CD-ROM
USB Device: USB Bus
USB Device: USB Tablet
USB Device: USB Keyboard
USB Device: USB 2.0 Bus
Thunderbolt Bus:
In IRIS, it is possible to influence sentence detection and attribute detection (a.o. negation and sentiment) through a user dictionary. It would be helpful to have that functionality in the Python interface too.
On some platforms (e.g. a typical Linux distribution), the default package manager will install ICU into the default include/library paths, which makes checking for ICUDIR redundant and indeed impossible to satisfy, as there's no "root" of ICU.
The root Makefile needs some basic logic to construct the ICU lib/include flags (which may both be null in the typical Linux case)
Preprocessor items such as "he's", 'we're", etc. are split in the preprocessor to be able to process "he" and "we" differently from "'s" and "'re". This space between "he" and "'s" and "we" and "'re" shouldn't be visible in the RAW output, but it currently is.
input:
despite what we're hearing
current output:
despite what 're hearing
desired output:
despite what 're hearing
iknow/modules/iknowpy/engine.pxd
Line 30 in 60b1c74
This line suggests the engine API is exposing our internal UTF-16 string representation. Though cognizant of the performance hit, I think we need an API where all string content is ordinary char*'s of UTF-8 (maybe not the only one, but certainly the one that code like @adang1345's would use)
During lang_update processing a lexrep double with conflicting labels currently only results in a notification. Instead it should stop the language compilation, so the conflict can be resolved.
Example:
;;letošními;;CSAdj;CSAdjInstrPl;CSBeginTime;
;;letošními;;CSVerb;
Message:
conflicting double: letošními Labels= CSVerb; conflicts with CSAdj;CSAdjInstrPl;CSBeginTime;
Result:
Successfully installed iknowpy
Trace output:
LexrepIdentified:letošními:CSVerb;
It seems that the style sheet iKnowXML.xsl does not pick up the spans for Certainty attributes, although they are present in the XML file.
Both genXML.py and iKnowXML.xsl can be found under "language_development".
The attached file is actually an XML file. It contains an example.
'sent_attributes': [{'entity_ref': 10,
'marker_': 'not',
'offset_start_': 64,
'offset_stop_': 67,
'type_': 1,
'unit2_': '',
'unit_': '',
'value2_': '',
'value_': ''}]}]}
These should be human-readable strings.
I came across a situation where the handleCSV.py script does not restore multi-lexreps expanded by the DELVE code.
Original line in JP_jp_lexreps.csv:
;;(を手に取|を手にと);;JPParticleWO;-;JPVerbOther;Join;Join;
lexreps.csv generated by DELVE for the same line reads as follows:
/**** Rewritten by DELVE, ;;を手に取;;JPParticleWO;-;JPVerbOther;Join;Join;
;;を手に取;;JPParticleWO;Lit_を;-;JPVerbOther;Join;Join
/**** Rewritten by DELVE, ;;を手にと;;JPParticleWO;-;JPVerbOther;Join;Join;
;;を手にと;;JPParticleWO;Lit_を;-;JPVerbOther;Join;Join
/* Expanded previously by DELVE....;;(を手に取|を手にと);;JPParticleWO;-;JPVerbOther;Join;Join;
lexreps.csv after running handleCSV.py:
=> lexreps associated with the line no longer exists
I've checked several other expanded multi-lexrep entries, but so far this seems to be the only instance with this issue.
I’m Rei Noguchi from Gunma University Hospital, and I really appreciate the prompt implementation of “negation expansion “ in Japanese (#33). I’m now trying to analyze daily progress notes in electronic medical records, and unlike discharge summaries described as stylized documents, the progress notes are often written in a colloquial or narrative style and includes incomplete sentences, resulting in some problems.
To analyze these casual text in the medical field more accurately, I would like to propose the following three improvements.
1. Extract a word followed by +/- without parentheses as a single entity
2. Resolve the different entity extraction results depending on punctuation mark (Japanese period ”。” or just a space)
3. Detect time expression
The details are as follows.
The previous improvement (#31) enabled Katakana or numbers enclosed in parentheses to be concatenated with the preceding Concept as a single entity. This works in many cases, especially in stylized documents, and is useful for identifying the relation of negation. (e.g. heart murmur(-) → no heart murmur)
However, in informal text such as daily progress notes, there is a problem. Some entities are followed by +/- without parentheses. Even in these cases, +/- symbol should be concatenated with the preceding Concept as a single entity because doctors describe the text with the same intention, and this enables us to clarify the relation of negation. Is this improvement technically possible?
Importantly, in many cases of these, there is often no space between an entity and +/-, whereas there is often half-width or full-width space after +/- to separate from the next entity.
“熱はなし”(no fever)is extracted as a single entity at this time, probably because this phrase includes all hiragana homonym “はなし”. In contrast, if there is a punctuation mark (i.e. Japanese period “。”) in the end of the phrase, like “熱はなし。”, the phrase is divided into multiple entities. The latter case seems like a good option in terms of identifying negation relation.
However, because doctors often end a sentence with just a “space” in place of Japanese period “。”, I think that a phrase ending with a space should be divided into multiple entities in the same manner as “。”.
In medical progress notes, there are many time expressions, so that it’s very useful that they could be identified by something like markers.
Some examples:
iKnow is an indispensable tool especially in a medical field, where there are many unknown words.
I realize the great value of iKnow and expect further improvement.
Thank you for your help.
NOTE: this is a suggestion/request that came from Dr. Rei Noguchi @ Gunma University Hospital.
In iKnow, Negation expansion is normally done using the Path, which for non-Japanese language is the word order in the Sentence. Since we developed Entity Vector as a special-case Path for Japanese, the order of entities within the Path is mostly different from how they appear within the Sentence. For this reason, we have not yet implemented Negation expansion beyond the boundaries of the entity that includes the Negation marker.
For example:
今週はレッスンはない。- There is no lesson this week.
Entity Vector - レッスン ない 今週
The two particles は are NonRelevant.
Because of the sentence structure, the word ない, which is present form of the Adjectival Verb meaning "doesn't exist" and a Negation marker, does not expand beyond itself. This is a problem, since it's no possible to know "what" is being negated without reading the entire sentence.
Dr. Noguchi used the current iKnow Python interface to experiment with his medical data, which often uses simple sentence structures that almost resembles the format: XXX は (or が) ない (or なかった - past form of the same Adjectival Verb meaning "didn't exist").
EXPERIMENT:
In cases like above, expand Negation to the left to the Concept before the particle は or が, i.e., in above examples would be "XXX".
In addition, there are some sentences where XXX are replaced by "XXX1やXXX2”, meaning "XXX1 and/or XXX2". In such case, expand Negation to the left, all the way to the Concept before the particle や, i.e., "XXX1" (the first Concept).
His experiment suggested that, at least for his data, such expansion implementation is normally semantically correct and would give more meaningful result to his machine learning work, since it is clearer what exists and what doesn't exist. (For example: There was no fever vs. Patient had fever.)
There are two different ways Negation expansion can be implemented.
The first approach is quicker, but may not be as useful longer-term. Any comment or additional consideration that I'm missing? @ISC-SDE @bdeboe @JosDenysGitHub @woodfinisc
iknow/modules/core/src/IkIndexDebugList.cpp
Line 102 in 9b12ef9
This won't compile for me on Mac (nor does it look like it should):
/Users/woodfin/git/iknow/modules/core/src/IkIndexDebugList.cpp:102:27: error: non-const lvalue reference to type 'list<...>' cannot bind to a temporary of type 'list<...>' Utf8List &trace = ToList(idx, j, kb);
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.