r0ller / alice Goto Github PK

A Language Interpreter as semantiC Experiment in natural language processing

Shell 0.74% C++ 41.12% C 10.16% PLpgSQL 6.82% Makefile 0.40% Java 3.03% HTML 0.22% CMake 0.62% JavaScript 31.94% Perl 0.18% Yacc 4.77%

nlp human-interface jni-android-library c-plus-plus-library javascript-library classification tagging functors morphological-analysis syntactic-analysis

alice's People

Contributors

Stargazers

Watchers

alice's Issues

failed_analyses db table is not filled according to its schema

When there's no successful analysis, the sql insert statement for the failed_analyses table has the columns in the wrong order.

morphological features are not cleaned up at the end of a token path analysis

Morphological features added during an analysis for a token path are set at morpheme instance level and as the morphemes are cached, they retain those features which shall not be the case.

'list executable files' dumps on local repo

seems that copying dependencies_found makes it dump in find_dependencies_for_node()

Implement nlg api

Instead of the currently used nltk based nlg (https://www.nltk.org/howto/generate.html), implement our own.

check if webanno can be used as annotation frontend

link: https://webanno.github.io/webanno/

It supports custom layers besides morphological, syntactic and semantic annotation layers. A well defined custom layer may yield an output that could be used for the dependencies (depolex) but the grammar rules or even the foma source could be generated from such custom layer annotated outputs.

dependency semantics crashes if it has to handle words unknown for the dictionary

Change dependency semantics based interpretation to rewrite incomplete/incorrect sentences to syntactically correct ones?

It may be better if dependency semantics based interpretation goes into a direction where an incomplete/incorrect sentence is reformulated to be syntactically correct and then gets interpreted, mainly due to functionality like tagging what is actually prepared during syntactic analysis. Answering questions depends on tagging so that needs to be investigated.

If a sentence is syntactically incorrect but can be interpreted based on dependency semantics then based on the structured syntactic error reports we could loop over each error message of the syntactic interpretations and check which symbols the syntactic parser expects. Replacing the wrong symbol with the expected one and retrying the syntactic interpretation with an nlg generated sentence can then be done. The question is how many times shall it be done for each error message? As after the symbol replacement the interpretation may fail again many times till a syntactically correct sentence is generated (if at all).

replace google speech recognition with pocketsphinx (cmusphinx)

Currently found no hungarian language model for pocketsphinx but found a hint according to which one can create a phonetic dictionary from a word list by espeak as a basis for training pocketsphinx:

cat word.list | espeak -v hu -x --sep=" " > phonetic.dictionary

Then train a language model with that dictionary using g2p-seq2seq.

rules aren't merged correctly in proc_abl and not all variations of words are generated by stax

It'd be too complicated to explain it so if anyone is interested, needs to check the correction after the fix:)

"could not determine your language preference" error

This issue has popped up earlier already in a form of a dump when dereferencing a nullpointer which I simply avoided by introducing a non-null check when the system cannot determine the default language (?!how on earth?!). This happens on 8.0=< android systems. It turned out that the problem is this:

https://issuetracker.google.com/issues/73044965

So a new workaround needs to be introduced based on this:

https://stackoverflow.com/questions/48653654/sendorderedbroadcast-setpackage-requirement-in-oreo

is_valid_combination() enforces that a combination for a rule entry is only valid if functors are found for each combination of main/dependent nodes found

picking next rule/depolex entry in case of failed optional entry

It seems that in case of failed optional rule/depolex entry, the failover/successor of the failed entry is picked. Review!

Implemenatation clash in interpreting sentences without verb and context ref handling for verbs

Improve error reporting

Check if the syntax error reporting function (https://www.gnu.org/software/bison/manual/html_node/Syntax-Error-Reporting-Function.html) can be used for structured error reporting.

Semantic error reporting (missing mandatory dependencies especially) needs to be developed as well.

These two would pave the way for a basic use of natural language generation algorithm so clients don't have to hard code error messages but could call an nlg api function to generate it based on the structured error report passed to it.

functor implementation bug in shell script

The sentences below do compile well but there's a bug in the shell script implementation of one of the functors in them:

list empty or executable directories in directory abc that are not symlinked
list directory abc in directory def

this should fail with syntax error as "are" is plural, need to add a rule for that to make it fail:
list directory abc that are in def

interpretation fails for one word sentences

If there's a one word sentence like:

Send.
Yes.

the interpeter fails even if an empty node is set up just to trigger semantic validation (as there must be two nodes to trigger combine_nodes). Setting up an empty bison rule in the framework was possible (though requires coding the action manually) using the '%empty' string as the right hand symbol and even the semantic validation could be set up (rule_2_rule_map, depolex) correctly to get done.

how to merge ml inferenced grammar and manual grammar rules

It needs to be figured how to protect manually crafted grammar rules from merging with inferenced ones.

Example: grammar rules to process logical expressions are crafted carefully to be able to build appropriate semantic dependencies based on them. In order to protect them from merging with ml inferenced rules it shall be possible to indicate that new rules based on the terminals of such protected rules shall not be merged in the grammar.

single line comments not allowed in functor definitions

Since new lines are not allowed in json structures, transgraph::apply_json_escapes() gets rid of them which results in incorrect code if there are single line comments (js://, sh:#) in it simply because the following code line get commented out due to the missing new line. The workaround till the fix, is to use block comments in the functor definitions.

Two failed analyses in one epoch leads to unique constraint failed when inserting into failed_analyses table

The reason is most probably that the epoch resolution depends on the hardware and in case a complete analysis fails followed by an immediate partial analysis which fails as well, the same epoch timestamp gets generated and the failed_analyses table key has already a record for the sentence of the first analysis with that timestamp. One solution could be to add a counter to the table.

dependency semantics cannot interpret one word (verb) sentences

drop support for interpreting graphs with more than one head

There is some code in interpreter::find_dependencies_for_functor() (the one for non-optional deps) that contains code that was meant to support graphs with more than one head. A minimal example for that is when A and B nodes are head (main) nodes and C is a dependency of both but A and B are not related to each other at all. The course of development did not follow that idea and now it poses a problem for nodes with a functor having more than one d_key. Let's extend the previous example to a diamond shape where A is the root, B and C are dependencies of A but B is a dependency of A's functor with d_key 1 and C is a dependency of A's functor with d_key 2. D is a dependency of B and C. When this function processes the dependencies of D (if any), in the end it'll check the node links of D and finds either B or C depending on which direction is processed first. So find_dependencies_for_node() will be called for the node D, registering B or C as it's main node, connecting the nodes in a way like A->B->D->C (or A->C->D->B) which is wrong since it should either be A->B->D or A->C->D.

optional dependency path resolution conflicts when more than 1 path leads to one real dependency

As the subject indicates, when more than one optional dependency paths are found that lead to the same real dependency node, there's a conflict in the resoltuion. To avoid that, a rule should be introduced, that takes only the first optional dependency path found into account. So e.g. when the sentence "show location of abc" is evaluated where there are two dependencies for "of" like a constant "CON" and "RESTAURANT" where the constant is the first dependency and the restaurant is the second, then the first optional path leading from "show" to "abc" will be taken into account: show(location(of(con(abc)))). So this has an effect on the design of dependencies as well in depolex.

semantic rule substeps need to be ordered according their length of dependencies

Currently, in each step, one substep is expected to get executed in case the step itself not being optional. If there are e.g. two substeps where the first one has less dependencies than the other, that can be validated in the current interpretation, then the interpretation stops checking further in is_valid_combination() after the successful validation of the first one. The workaround is to enter the substeps in a way that the substep with the longest number of dependencies comes first and then the rest in descending order.

sqlite db query results are slow in case of big selections

The title says it all. The reason is that the query_result->insert() method that takes care of buffering field names and values belonging to the same row id checks before inserting a new row if it is unique. The check itself is slow.

tagging words for question words supports only tagging one of them

When tagging a word in case of indicative mood for later questions like:

sparser->add_feature_to_leaf(ENG_N_Sg,"N",std::string("qw_what"));

where the tag was setup like:

insert into FUNCTOR_TAGS values('FILEENGN', '1', 'qw_what', '1', 'qword', 'what');

the problem arises when there are two words or even phrases of words that can be answered to the same question words like 'what'. E.g. (a non perfect example):

Peter is a gentleman and a pirate at the same time.

Here 'gentleman' and 'pirate' would be tagged with the question word 'what' which is not really a problem as both can be answer to the question:

What is Peter?

However, there are certainly cases when these tagged words collide/interfere/etc. One solution could be to allow tags to have a technical suffix e.g. the node id number so tagging would look like:

node_id=...//get node_id
sparser->add_feature_to_leaf(ENG_N_Sg,"N",std::string("qw_what#"+node_id));

This requires the logic checking for allowed tags in FUNCTOR_TAGS to allow such suffixing and also adjusting the query logic to be able to look for such suffixed tags.

To take it a bit higher level: nodes (i.e. subtrees) could be made taggable as well to be able to answer a question with a part of a previous sentence.

ungrammatical sentence to call contact with no inflection on name triggers no call

The subject says it all.

move registering dependencies into dep.val.matrix to is_valid_combination()

Dependencies are registered into dependency validation matrix even if validating the combination of nodes fail.

Environment variables are not transferred to forthcoming commands

This happens because each human command translated to a shell script gets executed in a child process. So environment variables set in the shell script are specific to that child process environment and there is no way for a child process to set environment variables in its parent process environment so that later on they get inherited in the forthcoming child process.

The solution is that environment variables occuring in a shell script are read/stored in the parent process during translation using getenv() and putenv(). This way, the environment variables are inherited by the child processes and can be used as a communication channel between subsequent commands.

Introduce references

Currently, it's not possible to handle defining (restrictive) relative clauses because it's technically not yet solved that a relative pronoun like 'that' or 'whose' can be set up as a dependency which refers back e.g: 'FILE,1,[...],THAT,1'. Such an entry now only sets up a dependency between 'file' and 'that' but this is not enough as the attributes that 'file' can have won't be found through 'that'. Unless, a reference is created between the two so that the word 'that' gets replaced/substituted by 'file'. If that's solved, sentences like 'list files that are (bigger than 20MB)' or 'list files whose (name begin with x)'. It must be kept in mind that the reference implementation shall be such that it can also be used to a part of the context. Furthermore, when the sentence gets interpeted (but not yet compiled to another language) and each functor/d_key is put into a sequence that shows in which order they should be applied to their arguments, the reference must be apparent.

Applying basic logic

As a first thought, the currently produced call hierarchy is already formal enough but quantifiers are not caught right as they are arguments. So either the functor implementations produce a logical form that can be evaluated or the functors themselves implement the logic. However, for the latter to work e.g. 'if a then b' requires that a() and b() return bool so each functor needs to be implemented while in the former case they may be generated if the produced call hierarchy is made available for the functors..

comma is not taken into account as punctuation in corpus in ml scenarios

As comma is often used as a separator by ml and test tools, due to a bug it's not taken into account as punctuation mark.

in case of extending the training corpus, ml has to process again what has already been learned earlier

As the abl integration was done in a way that the grammar rules extracted by proc_abl refer to the lines numbers of the sentence structures in the selection phase output file, it shall be possible to continue from the last line number (+1) when new grammar rules need to be added to the existing model via new training sentences.

the algorithm handling syntactically wrong sentences is greedy

It is greedy in a way that it tries to collect arguments for the parameters of the current node of the parse tree as long as it can. That means if a parameter of a functor can take more than one argument (like an array) the algorithm will try to satisfy that. This may exhaust the possibility to fulfill the requirements of other parameters. E.g. the sentence:

"üzenem xyznek hogy helló"

results in an interpretation where "xyznek" is analysed as a constant. However, as the algorithm tries to collect arguments for the parameters of the main verb, it collects both "xyznek" and "helló" as if they were first and last names (if the functor is set up that way). So no argument is left for "hogy". Thus the expression is not satisfied. One solution could be to try to satisfy the minimal requirements of all functors involved.

semantic checks when setting nodes?

The problem of not being able to introduce a semantic check arises when e.g. one would like to disallow using the singular form of a noun in a certain context but allow it in another one like in the following sentences:
1)show products
2)show product x
3)show product details for x
4)show product -> this one should not work
Syntactically this cannot be grasped as forcing all verb + singular noun combinations to have a constant does not make sense as being way too broad. Semantically, it can't be controlled as when the only common combination for 2) and 4) happens, phrases from 4) are not yet present. Sure, by restructuring the grammar a workaround could be found but it already shows that there are certain cases when semantic checks need to be carried out for rules that just set one node and do not create a combination of two.

The solution may be that in case of calling set_node_info(), the rule_to_rule_map needs to be checked against entries having matching parent_symbol and head_root_symbol but NULL non_head_root_symbol and either adapt combine_nodes() to be able to call it from set_node_info() without passing in right_node to carry out semantic checks or write a whole new logic.

hardcoded package name in android sqlite db handler

The jni_db class in jni_db.cpp has a member variable cl_dbhelper that stores a reference to the java database handler class which is looked up by the android package name. When using the java database handler class in another android package, the system won't find it.

only negative branching is available in depolex and rule_to_rule_map

It's currently not possible to set up dependencies or semantic rules in a way to branch on success. In case of semantic rules e.g. it means, that it is not possible to restrict the interpretaion of the sentence:
"list files in directories" which now evaluates to true.
To be able to restrict the semantics so that whenever a plural noun appears in a PP, it should refer to some constants like "list files in directories abc def", the positive branching should be made possible for substeps. This implies introducing a new field in rule_to_rule_maps instead of having only 'substep' for failures like 'failure_substep' and 'success_substep'. Similarly, in depolex instead of d_failover, we need to have another field for success e.g. 'd_success'.

dependency semantics depends on nr of dependencies found matching the nr of words

desktop segfault [old TODO, may not happen any more]

Find out why printing ENG_Con.expression.lexeme leads to segfault in case of "list directory abc" but not in case of "list abc" at the rule ENG_1Con->ENG_Con when it's accessed in the action like:
std::cout<<"ENG_1Con->ENG_Con:"<<ENG_Con.expression.lexeme<<std::endl;

no language preference can be determined on android 8.0 (oreo)

For whatever reasons, the system event handler does not get back the data about the user language preference and when trying to blindly extract the data from the result (pointing to null), a nullpointerexception is raised. Does not seem to be a problem on 6.0 or 7.0 systems.

calling phone number beginning with zero leads to calling zero

The subject says it all.

dependent functor/d_key pair calculated as found for each functor as many times as it is found for different functors of the same node

To make it clear with an example: if IN1 is a dependency of LIST1 and LIST2 as well and it's found for both functors (once for LIST1 and once for LIST2) during the longest matching semantic rule calculation then it's cumulated and in is_longest_match_for_semantic_rules_found() shows up as having been found twice! Currently, a fix takes care of avoiding that for the functor being the argument of the first call of the recursive method find_dependencies_for_node() but that logic should be extended to all levels of recursion. Fixing that has the corollary that the algorithm to calculate the longest match needs to be adjusted as well.

add possibility to create a full featured node for an empty terminal

Issue #37 got fixed which proved that an empty terminal can be handled but if one wanted to create a node for a concealed word, there's no possibility for it currently.

Example:

Yes.

Let's say the surface form 'yes' means 'yes, send' confirming a previous intent.

In this case, it could be useful to create a node substituting 'send' with a general verb like 'do' to get a verb in the sentence. So a parse tree would be created for 'yes, do' for the sentence 'yes'. Anyhow, the depolex for 'do' could look like: do(yes). Just like in case of 'no', it could be: do(no). In the end, the functor of 'do' could be used to confirm or deny a previous intent.

It seems that this also requires context handling as verb phrase ellipsis can only be solved by finding the verb missing from the currently processed sentence in another sentence of the context.

The steps are probably as follows:

creating analysis for a sentence without a verb by use of an empty terminal
context handling
find the missing verb in the context and put it to its place in the sentence where the empty terminal indicates it. This is simply about reconstructing the sentence with the appropriate verb as a string.

The problem with 3. is that finding a verb with its prefix is already pretty difficult (requires that the prefix is recorded as a feature of the verb not as a standalone category with a stem). But when found, it's even more complicated to put the verb with a prefix to the right place to create a correct sentence. Without a prefix it's kind of easier due to 1. However, constructing a correct sentence may be unnecessary since the constructed sentence is either ambiguous or unambiguous. If it's unambiguous, the algorithm already developed to interpret a syntactically incorrect sentence will fit the bill. If it's ambiguous, the algorithm will fail and a question can be raised to the user for clarification. All in all, it may be enough to find the verb with its prefix and concatenate them with the sentence not having a verb then just let the algorithm interpret the constructed syntactically incorrect sentence.

'list directory abc' evaluated to false

Two interpretations are found with equal dependencies somehow for the sentence in question which is of course ambiguous, so evaluates as false. Debug...

Generate dependencies from ML inferenced grammar

Good question if possible at all. Options:

As a first thought, a set of rules describing which grammatical relation (function) dominates/governs which other grammatical relation(s) is necessary. Such information may be extracted from question tests that are used when analysing a sentence to find out what modifies what e.g.:
"Peter is buying a black coat."
"What is Peter buying?" -> "a black coat" is a dependency of "buying"
"Who is buying a black coat?" -> "Peter" is a dependency of "buying"
"What is the coat like?" -> "black" is a dependency of "coat"
Generalising these to rules like: the object in sentences with the same structure is an argument of the predicate (or adjective phrase is an argument of the object) could produce the set of rules which could guide the logic of a new tool to generate the dependencies to be stored in depolex. The problem is that such analyses must be carried out manually for each sentence structure found in the ML corpus.
Looking up existing dependency graphs for the language in question and convert that to depolex form.
Make use of https://universaldependencies.org (https://universaldependencies.org/u/dep/)
An intermediate solution may be that dependencies generalised by grammatical categories (noun, adjective, etc.) are generated which seems to be easier. That would at least allow interpreting way more sentences than what manually crafted parse trees allow. Although, that would be only sufficient to parse sentences but not to act upon them (e.g. imperatives) as the words belonging to the same grammatical category would share the functor of the dependency of the grammatical category. Nevertheless, to tag constituents/phrases in indicative sentences that's sufficient as such functors could add c_values to the analyses_deps. Which also means that asking questions and information retrieval would work as well. Afterthought: generating the dependencies this way would reflect the grammar 1:1. It may work though when combined with 1) but then 1) in itself may be better.

semantic rules for constants?

A semantic rule for (ENG_VBAR2->ENG_VBAR1 + ENG_PP) that checks for a noun in the head(main) subtree and a preposition in the non-head(dependent) subtree is e.g. good for the sentence 'list files in directory abc' but not for 'list abc in def'. The problem is that a failover rule would be necessary to check for a constant (abc in the head subtree) and a preposition in the non-head subtree. However, constants should not be customized with a signature/interface in depolex. Instead, the grammatical category they later transformed into (e.g. N like noun), should somehow be looked up and not the grammatical category 'constant' which they have in their leaf node.

r0ller / alice Goto Github PK

alice's People

Contributors

Stargazers

Watchers

alice's Issues

Recommend Projects

Recommend Topics

Recommend Org