Git Product home page Git Product logo

uncc2014watsonsim's Introduction

Quick Intro

Watsonsim works using a pipeline of operations on questions, candidate answers, and their supporting passages. In many ways it is similar to IBM's Watson, and Petr's YodaQA. It's not all that similar to more logic based systems like OpenCog or Wolfram Alpha. But there are significant differences even from Watson and YodaQA.

  • We don't use a standard UIMA pipeline, which is a product of our student-project history. Sometimes this is a hindrance but typically it has little impact. We suspect it reduces the learning overhead and boilerplate code.
  • Unlike YodaQA, we target Jeopardy! questions, but we do incorporate their method of Lexical Answer Type (LAT) checking, in addition to our own.
  • Our framework is rather heavyweight in terms of computation. Depending on what modules are enabled, it can take between about 1 second and 2 minutes to answer a question. We use Indri to improve accuracy but it is now an optional feature that we highly recommend. (We are investigating alternatives as well.)
  • We include (relatively) large amounts of preprocessed article text from Wikipedia as our inputs. Be prepared to use about 100GB of space if you want to try it out at its full power.

Installing the Simulator

  • Use git to clone this repository, as in: git clone https://github.com/SeanTater/uncc2014watsonsim.git
  • Install Java 8, either:
  • libSVM machine learning library (native)
  • Download Gradle (just unzip it; keep in mind it updates very often)
  • Download the latest data and place them in the data/ directory
  • Copy the configuration file config.properties.sample to config.properties and customize to your liking
  • Run gradle eclipse -Ptarget in uncc2014watsonsim/ to download platform-independent dependencies and create an Eclipse project.
  • Possibly enable some Optional Features

Running the Simulator

We recommend running the simulator with Gradle:

gradle run -Ptarget=WatsonSim

But, if you prefer, you can also use Eclipse. First create a project.

gradle eclipse -Ptarget

Then you can run WatsonSim.java directly.

There are a few other features as well

# Generate statistics reports for accuracy and other measurements
gradle run -Ptarget=scripts.ParallelStats
# Regenerate the Indri, Lucene, SemanticVectors, Bigram and Edge indices
gradle run -Ptarget=index.Reindex

Technologies Involved

This list isn't exhaustive, but it should be a good overview

  • Search
    • Text search from Lucene and Indri (Terrier upcoming)
    • Web search from Bing (Google is in the works)
    • Relational queries using PostgreSQL and SQLite
    • Linked data queries using Jena
  • Sources
    • Text from all the articles in Wikipedia, Simple Wikipedia, Wiktionary, and Wikiquotes
    • Linked data from DBPedia, used for LAT detection
    • Wikipedia pageviews organized by article
    • Source, target, and label from all links in Wikipedia
  • Machine learning with Weka and libSVM
  • Text parsing and dependency generation from CoreNLP and OpenNLP
  • Parsing logic in Prolog (with TuProlog)

Notes:

  • You should probably consider using PostgreSQL if you scale this project to more than a few cores, or any distributed environment. It should support both engines nicely.
  • The data is sizable and growing, especially for statistics reports; 154.5 GB as of the time of this writing.
  • Can't find libindri-jni? Make sure you enabled Java and SWIG and had the right dependencies when compiling Indri.

Tools

Giving Back

Do you like this project? Then help make it better! We can use all kinds of help, whether you're a scientist, an engineer, or just a curious user!

Also, you may be interested to read (or to cite!) our paper:

@TechReport{GallagherTR2014,
author = {Gallagher, Sean and Zadrozny, Wlodek W. and Shalaby, Walid and Avadhani, Adarsh},
title = {Watsonsim: Overview of a Question Answering Engine},
institution = {University of North Carolina at Charlotte},
month = {December},
year = {2014},
}

uncc2014watsonsim's People

Contributors

adaava avatar bhavnasiyeshvant avatar csteph16 avatar dhaval257 avatar ipate258 avatar jvujjini avatar kenoverholt avatar pavan27 avatar rahulpedduri avatar seantater avatar thestephenstanton avatar unimpossible avatar varshadevadas avatar walid-shalaby avatar wlodz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

uncc2014watsonsim's Issues

Add charts to the performance logs

I'd like to see:

  • rank by time
  • [top/questions] by time
  • [top3/questions] by time

Right now, it doesn't have a response threshold, but in the future:

  • [top/questions] by response_threshold

Generate Deployable Artefact

We want users to be able to just download and run some version of the program. How much do we have to do for that to work?

5 questions not 100 in GenerateSearchResultDataset

only retrieves 5 questions at a time not 100! see uncc2014watsonsim/src/main/java/uncc2014watsonsim/sources/GenerateSearchResultDataset.java line 51
also make sure solution will not impact uncc2014watsonsim/src/main/java/uncc2014watsonsim/DBQuestionSource.java line 52

Elementary Passage Retrieval

We should be able to take the full text of a ResultSet and generate a List where each Passage is maybe 4 sentences long has at least one of the words from the question.

Query Generation: Indri

Search Queries should be made that target Indri specifically, in order to weed out the question itself and to improve the rank of the correct answers.

Improve comments

This will require many steps, but it desperately needs to be done.
Let's start with:

  • [ ]: Answer
  • [ ]: Passage
  • [ ]: Question
  • [ ]: Scorer
  • [ ]: PassageScorer
  • [ ]: AnswerScorer
  • [ ]: Learner

Improve ML retraining automation

Walid has already written a lot of this code but WekaLearner, WekaTee, and the scorer.* classes are not working as one team, we need to integrate them.

multiple files with same name but different case

Checkout of branches on my Windows machine seems to be having problems because multiple files with the same name but different case have been checked in from a non-Windows machine. Here's the error:

error: The following untracked working tree files would be overwritten by checkout:
src/main/java/uncc2014watsonsim/uima/types/queryString.java
src/main/java/uncc2014watsonsim/uima/types/queryString_Type.java
src/main/java/uncc2014watsonsim/uima/types/searchResult.java
src/main/java/uncc2014watsonsim/uima/types/searchResult_Type.java
Please move or remove them before you can switch branches.

Will someone with a Linux machine please try to resolve this? Classes should start with a capital letter. I had an issue with Searcher.java as well. It is not showing up now but it might still be a problem.

Examine using a database for J! questions and stored results

First we used CSV, which failed. XML was never finished. JSON functions but keeping up with the versions is real work. (There are at least 8 now.) I think we can do better, and merge the results with the stored query results by using a SQLite database. We can also put it on github.

Finish scorerIrene

Right now, scorerIrene does not compile, but it should. That way it can be used for the next ML model.

Query Generation: Lucene

We should target Lucene for weeding out the original question and for improving correct answer rank as well.

Investigate Wikipedia Redirects

I tried it already but I probably didn't give it enough of a change. It would do better if it knew which answers were redirect-generated.

Generated URL Based Score

"Score" may be misleading. Create some numerical interpretation where distinct values are assigned to each of the top level domains, .edu .com .net .org .gov and whatever others appear meaningful.

Any thought of other possible scores would be appreciated and should be added as new issues.

Invalid casts in NGram scorer

NGram throws casting errors and won't run for any passages. Probably it should not be casting an ArrayList to a Set. Maybe think about using one or the other the whole way.

Consider making Correct scorer numeric

This would prevent Logistic Regression from being a viable method. But it may allow gaining much more out of the data we have. Some answers may be wrong but much closer.

Sort the FITB rank

The FITB answers are not sorted by rank so the highest ranked item isn't always returned first.

Remove Wikipedia "Category:" and "List of"'s

Right now, tons of results are Wikipedia "Category:" and "List of" articles. Those are terrible answers. We should probably consider getting rid of those articles from the source rather and spending time indexing, querying and later removing them.

Added some enhancements for Query Translation. Getting SQL Errors working with the code

So Rahul wrote code for Query Translation. which is working correctly on IndriGUI. I integrated that part on Indri and Lucene Researcher but am getting an SQL Error. This is what the error looks like

Enter the jeopardy text:
Who is the Current President of United States
This is a FACTOID Question
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such table: documents)
at org.sqlite.DB.newSQLException(DB.java:383)
at org.sqlite.DB.newSQLException(DB.java:387)
at org.sqlite.DB.throwex(DB.java:374)
at org.sqlite.NativeDB.prepare(Native Method)
at org.sqlite.DB.prepare(DB.java:123)
at org.sqlite.PrepStmt.(PrepStmt.java:42)
at org.sqlite.Conn.prepareStatement(Conn.java:404)
at org.sqlite.Conn.prepareStatement(Conn.java:399)
at org.sqlite.Conn.prepareStatement(Conn.java:383)
at uncc2014watsonsim.SQLiteDB.prep(SQLiteDB.java:43)
at uncc2014watsonsim.search.Searcher.fillFromSources(Searcher.java:55)
at uncc2014watsonsim.search.LuceneSearcher.runQuery(LuceneSearcher.java:80)
at uncc2014watsonsim.WatsonSim.main(WatsonSim.java:50)
Exception in thread "main" java.lang.RuntimeException: Can't prepare an SQL statement "select title, text from documents where docno=?;"
at uncc2014watsonsim.SQLiteDB.prep(SQLiteDB.java:46)
at uncc2014watsonsim.search.Searcher.fillFromSources(Searcher.java:55)
at uncc2014watsonsim.search.LuceneSearcher.runQuery(LuceneSearcher.java:80)
at uncc2014watsonsim.WatsonSim.main(WatsonSim.java:50)

Question Analysis: Classification

We need to be able to distinguish between types of questions.

  • Detect and apply tags to Question()'s, changing Question as necessary
  • Change the statistics integration test to include accuracy where the question type is factoid, and where not

Wikiquotes

Add wikiquotes to pipeline. I have the TREC formatted and index copies of the files.

I just need to put them in the right places..

create a baseline DB

create a baseline DB with results and passages along with search engines ranks and scores

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.