quangis / geo-question-parser Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 3.73 MB

Extract core concept transformations from geo-analytical questions.

ANTLR 0.89% Python 99.10% Batchfile 0.01%

antlr language natural nlp parsing processing

geo-question-parser's Introduction

Installation (Miniconda on Windows)

Download and extract the parser source code into some folder, for example to folder “D:/qparserSrc”.

Follow the below steps to setup the python environment.

Install the 64-Bit version of Miniconda from (https://repo.anaconda.com/miniconda/).
Open a new window of anaconda prompt. Create a new conda environment with a name “D:/condaEnv/qparserSOA”. This creates the conda environment inside the folder “D:/condaEnv/qparserSOA”.

conda create -p D:/condaEnv/qparserSOA python=3.9.7

Activate the new environment:

conda activate D:/condaEnv/qparserSOA

Move to the folder qparserSOA. Packages from github will be installed here.

d:
cd D:\condaEnv\qparserSOA

Install spacy package from conda-forge

conda install -c conda-forge spacy

Install spacy trained pipeline. If the installation throws error, then try executing the command again.

python -m spacy download en_core_web_sm

Install other packages from conda-forge:

conda install -c conda-forge antlr4-python3-runtime=4.9.3 word2number pyzmq

Optionally, it may be necessary to install the checklist package:

pip install checklist

Test running the parser

In the anaconda prompt, make sure the “D:/condaEnv/qparserSOA” environment is activated.
Move to the folder “D:/qparserSrc/test”. Packages from github will be installed here.

cd D:\qparserSrc\test

Execute the "Test.py" script:

python Test.py

The output of the script should look like below:

============================== TESTING retrieval dataset
============================== TESTING GeoAnQu dataset
============================== TESTING MS student dataset

Process finished with exit code 0

Running the parser as a service (SOA)

Make sure “D:/condaEnv/qparserSOA” is activated in the anaconda prompt.
In the anaconda prompt, navigate to the folder “D:/qparserSrc”. This folder should contain the batch file “runAsyncWorker.bat”.
Run the batch file by typing

runAsyncWorker.bat

If the server starts correctly then you should see a message like below:

error loading _jsonnet (this is expected on Windows), treating config.json as plain json
INFO: Bound broker frontend to 'tcp://127.0.0.1:5570' in method 'QparserBroker.run'
INFO: Bound broker backend to an inter-process port 'inproc://backend' in method 'QparserBroker.run'
INFO: Started worker '0' on a inter-process socket 'inproc://backend' in method 'QparserWorker.run'
INFO: Started worker '1' on a inter-process socket 'inproc://backend' in method 'QparserWorker.run'
INFO: Started worker '2' on a inter-process socket 'inproc://backend' in method 'QparserWorker.run'
INFO: Started the poller in the broker in method 'QparserBroker.run'

To stop the server press the combo “Ctrl+C” in the anaconda prompt. Enter ‘Y’ when asked “Terminate batch job (Y/N)?”.

Two parameters can be changed in the batch file “runAsyncWorker.bat”. “FRONT_PORT” sets to which port server should be bound to. “INST_COUNT” sets the number of concurrent worker threads. In other words, it is the number of requests the server can handle simultaneously without requests blocking each other. For example, if “INST_COUNT” is set to 1 then only one request is processed at the time, and all other incoming requests are queued until the current request is handled. “INST_COUNT” can be any integer number above 0.

Running the test client for the parser service

Make sure the parser server is up and running.
Open a new Anaconda prompt window.
In the newly opened prompt window, activate the environment “D:/condaEnv/qparserSOA”.
Navigate to the folder “D:/qparserSrc”.
Execute the "asyncClient.py" script:
The expected output should look like below:

Starting the client
Setting the client poller
Sending a request to the remove service
Waiting for a reply ...
Client received a reply: {"question": "What is the shortest path through my workplace, a gym and a supermarket 
from my home in Amsterdam", "placename": ["Amsterdam"], "replaceQ": "What is shortest path through workplace , 
gym and supermarket from home extent", "network": ["shortest path"], "object": ["supermarket", "workplace", 
"home", "gym"], "ner_Question": "what is network0 through object1 , object3 and object0 from object2 extent", 
"parseTreeStr": "(start what is (measure (networkC network 0) through (destination (objectC object 1) (objectC 
object 3) and (objectC object 0)) from (origin (objectC object 2))) (extent extent))", "cctrans": {"types": 
[{"type": "object", "id": "0", "keyword": "home", "cct": "R(Obj,_)"}, {"type": "object", "id": "1", "keyword": 
"supermarket", "cct": "R(Obj,_)"}, {"type": "object", "id": "2", "keyword": "gym", "cct": "R(Obj,_)"}, {"type": 
"object", "id": "3", "keyword": "workplace", "cct": "R(Obj,_)"}, {"type": "network", "id": "4", "keyword": 
"shortest path", "cct": "R(Obj*Obj,Reg)"}, {"type": "object", "id": "5", "keyword": "Amsterdam", "cct": 
"R(Obj,_)"}], "extent": ["5"], "transformations": [{"before": ["1", "2", "3", "0"], "after": ["4"]}]}, "valid": 
"T", "query": {"afterId": "4", "after": "R(Obj*Obj,Reg)", "before": ["R(Obj,_)", "R(Obj,_)", "R(Obj,_)", 
"R(Obj,_)"]}, "queryEx": {"after": {"id": "4", "cct": "R(Obj*Obj,Reg)"}, "before": [{"after": {"id": "1", "cct":
 "R(Obj,_)"}}, {"after": {"id": "2", "cct": "R(Obj,_)"}}, {"after": {"id": "3", "cct": "R(Obj,_)"}}, {"after": 
 {"id": "0", "cct": "R(Obj,_)"}}]}}

Process finished with exit code 0

Usage

from QuestionParser import *
from TypesToQueryConverter import *

qBlock = '{"question": "What is the  shortest path through my workplace,' + \
            ' a gym and a supermarket from my home in Amsterdam","placename":' + \
            '["Amsterdam"],"replaceQ": "What is the  shortest path through my' + \
            ' workplace, a gym and a supermarket from my home extent"}'

# identify types and transformations steps
parser = QuestionParser()
qParsed = parser.parseQuestionBlock(qBlock)
# annotate types with cct expressions and generate a query
cctAnnotator = TQConverter()
cctAnnotator.cctToExpandedQuery(qParsed, True, True)

Structure

Check the Grammar folder for more details of the functional grammar in GeoAnQu.g4. Check the Dictionary folder for the concept dictionary.

License

CC BY-NC-ND 4.0

geo-question-parser's People

Contributors

Watchers

geo-question-parser's Issues

Extract functional roles from question formulator output

As discussed in #1, the issues of constructing a question and of extracting functional roles should be separated. For this, we need to figure out how to replace or adapt the ANTLR parser for the recognition of functional roles.

Issue #5 discusses changing the output of blocks to simplify this step. The issue you're reading now is about taking that output and actually producing the functional roles/transformations.

Ideally, the information needed to both show question blocks and extract functional roles from their output would be declared in a single grammar file. This is ideal because it would mean that phrases and their functional roles are kept in a single place; and that the procedural code to generate the blocks and transformation extraction can be kept separate from the declarative code for the grammar. This would allow those who edit the grammar to focus on the important bits. It may or may not be possible; I will need a better understanding of the blocks & parser.

Connect question elements to CCT operators

Eventually, it is my understanding that the operators of the cct language should inform the queries --- not just the CCT types. For this, semantic markers in the questions should be connected to CCT operators.

Set up JavaScript tooling

Dependencies: #2

At the moment, plain JavaScript is used with a copy of blockly to create the interface. This was fine for local testing, but if we are to maintain a version of this that is exposed to the outside world, we need to streamline the process (reasons can be reviewed on Blockly's 'get started' page).

The JavaScript tooling ecosystem is fragmented: it's easy to be overwhelmed by the plethora of package managers, build tools, module bundlers, etcetera. I have opted for NPM as our package manager and Parcel as our bundler --- this struck me as easiest. For now, I don't think we need additional build tools. However, I'm open for persuasion towards Gulp, Grunt, Yarn, Webpack, Rollup, etcetera. Additionally, we will use TypeScript to protect our sanity.

Note that you can install npm via conda-forge. I mention this for the benefit of Windows-using colleagues: since we use Conda elsewhere anyway, it's an easy way to install and keep track of the development environment.

Eliminate grammar parser/blockly interface overlap

The pressing issues with this part of the pipeline are with robustness, scalability and testing. For the final product, we need a lot of simplifications. To organize and document the development, I will be tracking that in the issue tracker here.

Currently, if I understand correctly, the procedure can be roughly sketched as follows. I will edit as I go along; please comment if I am mistaken.

The question is cleaned. nltk is used to detect and clean adjectives like 'nearest', so that the important nouns can be isolated and recognized in subsequent steps.
Important words in the questions are annotated.
1. Recognize concepts, amounts and proportions via a pre-defined concept dictionary.
2. Recognize place names via ELMO-based named entity recognition (NER) from allennlp.
3. Recognize times, quantities and dates via NER from spaCy.
Extract functional roles based on syntactic structures and connective words, via a grammar implemented in ANTLR. This yields a parse tree.
Convert parse trees into transformations between concept types.
1. Find input and output concept types by matching questions to categories that are associated with concept transformation models.
2. The order of concept types is derived from the function of each phrase in which they occur: subcondition is calculated before the condition, etcetera. A table is generated that calculates the order for each functional part, which is then itself combined in a rule-based way (see Algorithm 1 in the paper).
Transform concept types into cct types via manually constructed rules based on the concepts/extents/transformations that were found in previous steps.

The issue is that this is rather fragile; it depends (among other things) on:

All concepts and entities being annotated properly.
Having a complete rule set for converting concept types into CCT types.

We have chosen blockly to constrain the natural language at the user end, in such a way that the questions that may be presented to the parser are questions that the parser can handle. However, this only formats the question to reduce the problems of an otherwise unchanged natural language processing pipeline. As discussed in the meeting and elsewhere:

Given that we already know the type of entity when constructing a query via blockly instead of freeform text, we will no longer need named entity recognition or question cleaning. This would strip out the nltk, spaCy, and allenlp packages, tremendously simplifying the process.
To guarantee robustness, the visual blocks need to be in perfect accordance with the parser. For this, they should be automatically constructed from one common source of truth.
In fact, given that the blockly-constructed query can output something different than what's written on the blocks, we might even forego the natural language parser completely, in favour of JSON output at the blockly level (or another format that is easily parsed). This would eliminate even the ANTLR parser, further reducing complexity. The downside is that we would no longer be able to parse freeform text (though that would be impacted by the removal of named entity recognition anyway). We could describe this with JSON Schema to really pin it down.
To make sure that no regressions are introduced, we should have expected output for every step (that is, not just expected output from the whole pipeline).

This would make this repository not so much a geo-question-parser as much as a geo-question-formulator. This is good, because the code right now is very complex and very fitted to the specific questions in the original corpus, which isn't acceptable in a situation where users can pose their own questions.

Note: If we simplify to this extent, it might be nice to use rdflib.js to output a transformation graph directly, but that is for later.

The process would thus become:

In blockly, construct JSON that represents a question.
Convert that question into transformations between concept types.
1. Find input and output concept types by matching questions to transformation categories.
2. Find concept type ordering.
Transform concept types into cct types via rules.

I'm not sure to what extent we can still simplify step 2. Depending how much code would be left, it would be nice to port/rewrite in JavaScript, alongside blockly, so that we can visualize most things client-side and with minimal moving parts.

Structure blocks according to grammar

We claim to be able to extract a lot of information from a question. However, a block where a text field can be left empty; or where a text field occurs on its own without contextual information to constrain it; or multiple variants of a block that carry only syntactical differences, indicates to me that we haven't pinned down exactly what information is contained in a block, and we're handwaving away the extraction of that information by pointing to the ANTLR parser.

This is a problem because the parser is hard to verify and test systematically, since it is much less constrained than the blocks.

Also, hiding blocks makes it hard for the user to understand the space of possibilities. We can disable blocks, but I don't think we should hide them. Of course, this is more feasible when the set of blocks is smaller.

That's why I think we should systematize the set of blocks a bit more. This would also help with issue #6.

Merge Blockly interface with this repository

A Blockly interface is used to constrain natural language to a form that can be handled by Haiqi's grammar. There is presently an overlap between the two, which is (one of the...) reasons that maintenance is hard. More information at #1.

For this reason, the interface should be built from a common source, and thus the copy of the interface at https://github.com/HaiqiXu/haiqixu.github.io or https://github.com/quangis/quangis-web should be grafted into this repository.

Determine and produce structured question formulator output

As mentioned in #1, we need to get rid of the overlap between the grammar and the Blockly interface.

For this, the blocks should generate not another semi-natural language question that needs to pass through the whole parsing pipeline again, but rather, a structure from which functional roles are derived directly.

For example, the machine learning libraries that were used to recognize entities should be removed in favour of constraining the blocks in such a way that we already know the entity types. This will speed up the process and simplify the environment.

What other information should be conveyed by the block's output to be able to recognize functional roles? That should inform the details of the structure.

Once we have a better idea of what that structure should look like, we can generate it from blocks; see this page for more information.

Set up testing

Dependencies: #3

At the moment, there are no automated tests for this part of the pipeline. Any improvements made to one part might cause a regression elsewhere (see also #1).

We will need tests at these levels:

A tool that builds blocks from a natural language question, to test whether all corpus questions can still be built as blocks.
Tests to check whether functional roles are correctly extracted from blocks.
Tests to check whether transformation graphs with core concepts are correctly generated from functional roles.
Tests to check whether transformation graphs with CCT types are correctly identified from transformation graphs with core concepts.

I'm not yet familiar with JavaScript's testing ecosystem. Updates will be tracked here.

Dynamic blocks for irrelevant grammatic variants

This is a low priority issue that ties into issue #7. We can use mutators to make dynamic blocks. We could use this to avoid users having to explicitly choose irrelevant grammatical variants.

We could automatically adapt 'fewer than' to 'less than', or automatically add connectives like 'and/or' when stacking relationships.