Git Product home page Git Product logo

iknow's Introduction

iKnow

iKnow is a library for Natural Language Processing that identifies entities (phrases) and their semantic context in natural language text in English, German, Dutch, French, Spanish, Portuguese, Swedish, Russian, Ukrainian, Czech and Japanese. It was originally developed by i.Know in Belgium, acquired by InterSystems in 2010 to be embedded in its Caché and IRIS Data Platform products. InterSystems published the iKnow engine as open source in 2020.

This readme file has the basic pointers to get started, but make sure you click through to the wiki for more details on any of these subjects.

Understanding iKnow

Entities

iKnow identifies phrase boundaries that define Entities, entirely based on the syntactic structure of the sentences, rather than relying on an upfront dictionary or pretrained model. This makes iKnow well-suited for initial exploration of a new corpus. iKnow Entities are not Named Entities in the NER sense, but rather the word groups that need to be considered together, representing a concept or relationship as coined by the text author in its entirety. The following examples clearly show the importance of this phrase level to fully capture what the author meant:

iKnow Entity Meaning
Dopamine small molecule
Dopamine receptor drug target
Dopamine receptor antagonist chemical drug
Dopamine receptor gene gene, molecular sequence
Dopamine receptor gene mutation physiological process

iKnow will label every entity with a simple role that is either concept (usually corresponding to Noun Phrases in POS lingo) or relation (verbs, prepositions, ...). Typical stop words that have little meaning of their own get categorized as PathRelevant (e.g. pronouns) or NonRelevant parts, depending on whether they play a role in the sentence structure or are just linguistic fodder.

In the following sample sentence, we've highlighted concepts, relations and PathRelevants separately.

Belgian geuze is well-known across the continent for its delicate balance.

Read more...

Attributes

Beyond this simple phrase recognition, iKnow also captures the context of these entities through semantic attributes. Attributes label spans (of entities) within a sentence that share a semantic context. Most attributes start from a marker term and are then, through linguistic rules, expanded left and right as appropriate per the syntactic structure of the sentence. iKnow's main contribution is in this fine-grained expansion, which has been shown to be more accurate than many ML-based techniques.

iKnow supports the following attribute types:

  • Negation: iKnow tags all entities participating in a negation, as opposed to an (implied) affirmative context.

    After discussing his nausea, the [patient didn't report suffering from chest pain, shortness of breath or tickling].

  • Sentiment: based on a user-supplied list of marker terms, iKnow will identify spans with either a positive or negative sentiment (through separate attributes). Overlapping negation attributes will reverse the sentiment in some language models.

    [ I liked the striped pijamas], but the [slippers didn't really fit with it ].

  • Measurements, Time, Frequency and Duration: all entities "participating" in an expression of something measurable or time-related will be tagged, enabling efficient recognition of facts in long stretches of natural language text.

    Upon exam [two weeks ago] the [patient's weight was 146.5 pounds].

  • Certainty: this attribute is a work in progress. See the corresponding wiki section for more details.

Some attributes are not available for all languages yet. See the wiki section for more details.

How it works

Some InterSystems-era resources on how iKnow works:

Read more...

Using iKnow

Read more on the APIs here.

Directly

The C++ API file is "engine.h" (modules\engine\src), it defines the class "iKnowEngine", and it's main method : index(TextSource, language). After indexing all data is stored in iknowdata::Text_Source m_index. "iknowdata" is the namespace used for all classes that contain meaningfull data :

  • iknowdata::struct Entity : represents a text entity after indexing.
  • iknowdata::struct Sent_Attribute : represents an attribute sentence marker.
  • iknowdata::struct Path_Attribute_Span : represents a span in the sentence' path after attribute expansion.
  • iknowdata::struct Sentence : represents a sentence in the text source after indexing.
  • iknowdata::Sentence::Path : represents a path in a sentence.
  • iknowdata::struct Text_Source : represents the whole text after indexing.

enginetest.cpp (modules\enginetest\enginetest.cpp) has a demo function (void a_short_demo(void)) that explains every step from indexing to retrieving the results.

The main iKnowEngine::index() method has currently 2 limitations : it only works synchronously and single threaded. A mutex is used to synchronize threads internally, no protection is needed from the side of the client.

From Python

WIP

From SpaCy

WIP

From InterSystems IRIS

For many years, the iKnow engine has been available as an embedded service on the InterSystems IRIS Data Platform. The obvious advantage of packaging it with a database is that indexing results from many documents can be stored in a single repository, enabling corpus-wide analytics through practical APIs. See the iKnow documentation for IRIS or browse the InterSystems Developer Community's articles on setting up an iKnow domain, browsing it and using iFind (iKnow-powered text search)

The InterSystems IRIS Community Edition is available from Docker Hub free of charge.

From UIMA

This part of the kit has not yet been added to the open source repository, but relevant documentation can be found here.

Building iKnow

The source code for the iKnow engine is written in C++ and includes .sln files for building with Microsoft Visual Studio 2019 Community Edition and Makefiles for building in Linux/Unix. See also this wiki page for more on the overall build process.

Dependencies

  • ICU : Header files and libraries

On Windows

Step 1: Setting up dependencies

  1. Download the Win64 binaries for a recent release of the ICU library (e.g. version 65.1) and unzip to <repo_root>/thirdparty/icu (or a local folder of your choice).

  2. If you chose a different folder for your ICU libraries, update <repo_root>\modules\Dependencies.props to represent your local configuration. This is how it looks after download, which should be OK if you used the suggested directory paths:

  <PropertyGroup Label="UserMacros">
    <ICUDIR>$(SolutionDir)..\thirdparty\icu\</ICUDIR>
    <ICU_INCLUDE>$(ICU_DIR)\include</ICU_INCLUDE>
    <ICU_LIB>$(ICU_DIR)\lib64</ICU_LIB>
  </PropertyGroup>

Step 2: Building iKnow

  1. Open the Solution file <repo_root>\modules\iKnowEngineTest.sln in Visual Studio. We used Visual Studio Community 2019

  2. In the Solution Explorer, choose "iKnowEngineTest" as "Set up as startup project"

  3. In Solution Configurations, choose either "Debug|x86", or "Release|x64", depending on the kind of executable you prefer.

  4. Build the solution, it will build all 29 projects.

Step 3: Testing the indexer

Once building has succeeded, you can run the test program, depending on which build config you chose:

  • <repo_root>\kit\x64\Debug\bin\iKnowEngineTest.exe
  • <repo_root>\kit\x64\Release\bin\iKnowEngineTest.exe

⚠️ Note that you'll have to add the $(ICUDIR)/bin64 directory to your PATH or copy its .dll files to this test folder in order to run the test executable.

Alternatively, you can also start a debugging session in Visual Studio and walk through the code to inspect it.

The iKnow indexing demo program will index one sentence for each of the 11 languages, and write out the sentence boundaries. That's of course not very spectacular by itself, but future iterations of this demo program will expose more of the entity and context information iKnow detects.

On Linux / Unix

Step 1: Setting up dependencies

  1. Download the proper binaries for a recent release of the ICU library (e.g. version 65.1) and untar to <repo_root>/thirdparty/icu (or a local folder of your choice).

  2. Save the path you untarred the archive to a ICUDIR environment variable. Note that your ICU download may have a relative path inside the tar archive, so you may need to use --strip-components=4 or manually reorganise to make sure the ${ICUDIR}/include leads where you'd expect it to lead.

Step 2: Building iKnow

  1. Set the IKNOWPLAT environment variable to the target platform of your choice: e.g. "lnxubuntux64", "lnxrhx64" or "macx64"

  2. In the <repo_root> folder, run make all

On Docker

While primarily useful for build-testing convenience, we're also providing a Dockerfile that stuffs the code in a clean container with the required ICU libraries. If your Linux / Unix build doesn't seem to work, perhaps a quick look at this Dockerfile will help nail down where trouble starts.

Step 1: Building the container

  1. Optionally open the Dockerfile to change the ICU library version to use

  2. Use the docker build command to package things up:

docker build --tag iknow .

This will automatically download the ICU library of your choice and register its path for onward building.

Step 2: Building iKnow

  1. Start and step into the container using docker run:
docker run --rm -it iknow

The --rm flag will make sure the container gets dropped after you're done exploring.

  1. Inside the container, use make all to kick off the build:
cd /usr/src/iknow
make all

Contributing to iKnow

You are welcome to contribute to iKnow's engine code and language models. Check out the Wiki for more details on how they work and the Issues and Projects sections for any particular work on the horizon.

iknow's People

Contributors

bdeboe avatar isc-sde avatar josdenysgithub avatar nwoebcke avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.