seantater / uncc2014watsonsim Goto Github PK

View Code? Open in Web Editor NEW

59.0 32.0 37.0 135.78 MB

Open-domain question answering system from UNC Charlotte

Home Page: http://blog.watsonphd.com

License: GNU General Public License v2.0

Shell 0.80% Python 12.55% PLSQL 0.54% Java 83.25% Prolog 0.13% Scala 0.52% JavaScript 0.61% HTML 1.09% CSS 0.52%

uncc2014watsonsim's Introduction

Watsonsim Question Answering System

Quick Intro

Watsonsim works using a pipeline of operations on questions, candidate answers, and their supporting passages. In many ways it is similar to IBM's Watson, and Petr's YodaQA. It's not all that similar to more logic based systems like OpenCog or Wolfram Alpha. But there are significant differences even from Watson and YodaQA.

We don't use a standard UIMA pipeline, which is a product of our student-project history. Sometimes this is a hindrance but typically it has little impact. We suspect it reduces the learning overhead and boilerplate code.
Unlike YodaQA, we target Jeopardy! questions, but we do incorporate their method of Lexical Answer Type (LAT) checking, in addition to our own.
Our framework is rather heavyweight in terms of computation. Depending on what modules are enabled, it can take between about 1 second and 2 minutes to answer a question. We use Indri to improve accuracy but it is now an optional feature that we highly recommend. (We are investigating alternatives as well.)
We include (relatively) large amounts of preprocessed article text from Wikipedia as our inputs. Be prepared to use about 100GB of space if you want to try it out at its full power.

Installing the Simulator

Use git to clone this repository, as in: git clone https://github.com/SeanTater/uncc2014watsonsim.git
Install Java 8, either:
- Bundled with Eclipse
- or on Ubuntu utopic+: sudo apt-get install openjdk-8-jdk
- or on Fedora 20+: yum install java-1.8.0-openjdk
- or on Windows, Mac, all others
libSVM machine learning library (native)
- For Ubuntu and Fedora: install libsvm-java
- otherwise, for Windows follow some instructions
Download Gradle (just unzip it; keep in mind it updates very often)
Download the latest data and place them in the data/ directory
Copy the configuration file config.properties.sample to config.properties and customize to your liking
Run gradle eclipse -Ptarget in uncc2014watsonsim/ to download platform-independent dependencies and create an Eclipse project.
Possibly enable some Optional Features

Running the Simulator

We recommend running the simulator with Gradle:

gradle run -Ptarget=WatsonSim

But, if you prefer, you can also use Eclipse. First create a project.

gradle eclipse -Ptarget

Then you can run WatsonSim.java directly.

There are a few other features as well

# Generate statistics reports for accuracy and other measurements
gradle run -Ptarget=scripts.ParallelStats
# Regenerate the Indri, Lucene, SemanticVectors, Bigram and Edge indices
gradle run -Ptarget=index.Reindex

Technologies Involved

This list isn't exhaustive, but it should be a good overview

Search
- Text search from Lucene and Indri (Terrier upcoming)
- Web search from Bing (Google is in the works)
- Relational queries using PostgreSQL and SQLite
- Linked data queries using Jena
Sources
- Text from all the articles in Wikipedia, Simple Wikipedia, Wiktionary, and Wikiquotes
- Linked data from DBPedia, used for LAT detection
- Wikipedia pageviews organized by article
- Source, target, and label from all links in Wikipedia
Machine learning with Weka and libSVM
Text parsing and dependency generation from CoreNLP and OpenNLP
Parsing logic in Prolog (with TuProlog)

Notes:

You should probably consider using PostgreSQL if you scale this project to more than a few cores, or any distributed environment. It should support both engines nicely.
The data is sizable and growing, especially for statistics reports; 154.5 GB as of the time of this writing.
Can't find libindri-jni? Make sure you enabled Java and SWIG and had the right dependencies when compiling Indri.

Tools

Giving Back

Do you like this project? Then help make it better! We can use all kinds of help, whether you're a scientist, an engineer, or just a curious user!

Also, you may be interested to read (or to cite!) our paper:

@TechReport{GallagherTR2014,
author = {Gallagher, Sean and Zadrozny, Wlodek W. and Shalaby, Walid and Avadhani, Adarsh},
title = {Watsonsim: Overview of a Question Answering Engine},
institution = {University of North Carolina at Charlotte},
month = {December},
year = {2014},
}

uncc2014watsonsim's People

Contributors

Stargazers

Watchers

uncc2014watsonsim's Issues

Add charts to the performance logs

I'd like to see:

rank by time
[top/questions] by time
[top3/questions] by time

Right now, it doesn't have a response threshold, but in the future:

[top/questions] by response_threshold

Need to run statistics on FITB and mixed questions

Machine Learning Prebuilt Logistic Regression Scorer

We need Walid to create a new model for Google for this.

Generate Deployable Artefact

We want users to be able to just download and run some version of the program. How much do we have to do for that to work?

automatically generate arff files for ML

we need to generate arff files for ML autmatically from baseline database of results and passages and add to them whatever scorers there.

Setup Google Query Queueing

Optimally, this should be done using the ruby server and gradle. It needs to be simple.

5 questions not 100 in GenerateSearchResultDataset

only retrieves 5 questions at a time not 100! see uncc2014watsonsim/src/main/java/uncc2014watsonsim/sources/GenerateSearchResultDataset.java line 51
also make sure solution will not impact uncc2014watsonsim/src/main/java/uncc2014watsonsim/DBQuestionSource.java line 52

Elementary Passage Retrieval

We should be able to take the full text of a ResultSet and generate a List where each Passage is maybe 4 sentences long has at least one of the words from the question.

Query Generation: Indri

Search Queries should be made that target Indri specifically, in order to weed out the question itself and to improve the rank of the correct answers.

Improve comments

This will require many steps, but it desperately needs to be done.
Let's start with:

[ ]: Answer
[ ]: Passage
[ ]: Question
[ ]: Scorer
[ ]: PassageScorer
[ ]: AnswerScorer
[ ]: Learner

Retrain with ScorerAda

Remember to retrain the ML models.

Improve ML retraining automation

Walid has already written a lot of this code but WekaLearner, WekaTee, and the scorer.* classes are not working as one team, we need to integrate them.

Model data for Weka

This sample data is available for everyone to use in Weka if you like

This one is smaller
https://dl.dropboxusercontent.com/u/92563044/100factoid_10_bing.arff

And this is larger
https://dl.dropboxusercontent.com/u/92563044/1000factoid_04.arff

resolve problem with mixing of results and passages scores

multiple files with same name but different case

Checkout of branches on my Windows machine seems to be having problems because multiple files with the same name but different case have been checked in from a non-Windows machine. Here's the error:

error: The following untracked working tree files would be overwritten by checkout:
src/main/java/uncc2014watsonsim/uima/types/queryString.java
src/main/java/uncc2014watsonsim/uima/types/queryString_Type.java
src/main/java/uncc2014watsonsim/uima/types/searchResult.java
src/main/java/uncc2014watsonsim/uima/types/searchResult_Type.java
Please move or remove them before you can switch branches.

Will someone with a Linux machine please try to resolve this? Classes should start with a capital letter. I had an issue with Searcher.java as well. It is not showing up now but it might still be a problem.

Examine using a database for J! questions and stored results

First we used CSV, which failed. XML was never finished. JSON functions but keeping up with the versions is real work. (There are at least 8 now.) I think we can do better, and merge the results with the stored query results by using a SQLite database. We can also put it on github.

GenerateSearchResultDataset gives exception after changing scores into enum

trying to access scores through get(String s) gives exception as it is defined as enum. Also must account for google/bing scores absence

PercentFilteredWordsInCommon always returns NaN

Determine why WekaLearner throws ArrayOutofBoundsException for learners other than Logistic

Add Bing as a passage searcher

Finish scorerIrene

Right now, scorerIrene does not compile, but it should. That way it can be used for the next ML model.

Generate new documentation

We need something made this month, even if it is not perfect.

Query Generation: Lucene

We should target Lucene for weeding out the original question and for improving correct answer rank as well.

Query Generation: Google

Decide what should be done about Google in order to improve correct answer rankings

Investigate Wikipedia Redirects

I tried it already but I probably didn't give it enough of a change. It would do better if it knew which answers were redirect-generated.

Generated URL Based Score

"Score" may be misleading. Create some numerical interpretation where distinct values are assigned to each of the top level domains, .edu .com .net .org .gov and whatever others appear meaningful.

Any thought of other possible scores would be appreciated and should be added as new issues.

Separate Factoid stats from FITB and other categories

This is mostly a concern with watson-server.

Invalid casts in NGram scorer

NGram throws casting errors and won't run for any passages. Probably it should not be casting an ArrayList to a Set. Maybe think about using one or the other the whole way.

Investigate Stanford NLP and possible faster NLP parsers

Some are 100 fold faster, and some 1000 fold! That could be a very worthwhile speedup!

It does come at a cost but NLP at all is probably better than none.

Consider making Correct scorer numeric

This would prevent Logistic Regression from being a viable method. But it may allow gaining much more out of the data we have. Some answers may be wrong but much closer.

Sort the FITB rank

The FITB answers are not sorted by rank so the highest ranked item isn't always returned first.

query translation (prune results with ":" & "list" in titles)

modify query to lucene and indri to prune irrelevant results; e.g., results that contains ":" or "list" in wiki document title

Remove Wikipedia "Category:" and "List of"'s

Right now, tons of results are Wikipedia "Category:" and "List of" articles. Those are terrible answers. We should probably consider getting rid of those articles from the source rather and spending time indexing, querying and later removing them.

Trim Mediawiki data before indexing, rather than after

moidfy lucene, indri scores to be 1 based rather than 0 based

Question in Passage Scorer should return a range (not just 0 or 1)

Retrain with 1000 Bing questions

It's already in the works but there are still steps left to complete.

Change CachingScorer to detect DB presence

This would make it a lot easier on people. Even better if it made the file but that may be too much.

integrate question results scoring using logistic regression

Added some enhancements for Query Translation. Getting SQL Errors working with the code

So Rahul wrote code for Query Translation. which is working correctly on IndriGUI. I integrated that part on Indri and Lucene Researcher but am getting an SQL Error. This is what the error looks like

Enter the jeopardy text:
Who is the Current President of United States
This is a FACTOID Question
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such table: documents)
at org.sqlite.DB.newSQLException(DB.java:383)
at org.sqlite.DB.newSQLException(DB.java:387)
at org.sqlite.DB.throwex(DB.java:374)
at org.sqlite.NativeDB.prepare(Native Method)
at org.sqlite.DB.prepare(DB.java:123)
at org.sqlite.PrepStmt.(PrepStmt.java:42)
at org.sqlite.Conn.prepareStatement(Conn.java:404)
at org.sqlite.Conn.prepareStatement(Conn.java:399)
at org.sqlite.Conn.prepareStatement(Conn.java:383)
at uncc2014watsonsim.SQLiteDB.prep(SQLiteDB.java:43)
at uncc2014watsonsim.search.Searcher.fillFromSources(Searcher.java:55)
at uncc2014watsonsim.search.LuceneSearcher.runQuery(LuceneSearcher.java:80)
at uncc2014watsonsim.WatsonSim.main(WatsonSim.java:50)
Exception in thread "main" java.lang.RuntimeException: Can't prepare an SQL statement "select title, text from documents where docno=?;"
at uncc2014watsonsim.SQLiteDB.prep(SQLiteDB.java:46)
at uncc2014watsonsim.search.Searcher.fillFromSources(Searcher.java:55)
at uncc2014watsonsim.search.LuceneSearcher.runQuery(LuceneSearcher.java:80)
at uncc2014watsonsim.WatsonSim.main(WatsonSim.java:50)

Detect and apply tags to Question()'s, changing Question as necessary
Change the statistics integration test to include accuracy where the question type is factoid, and where not