Git Product home page Git Product logo

forager's Introduction

Forager

Build Status

Forager is a proof-of-concept search engine in Clojure. I am writing it for my NLP independent study's final project. The goal is to get a nuts-and-bolts understanding of language modeling (n-grams and tokenizing), indexing with term frequency-inverse document frequency (tf-idf), and retrieving documents with Boolean queries.

Features

  • interactive interface via a Clojure REPL
  • indexing of documents via tf-idf
  • Boolean query operators (AND, OR, NOT)
    • NEAR or WITHIN a certain distance (low priority)
  • methods to evaluate information retrieval precision and recall

Optional Features

If time allows, these features would be nice to have:

  • compressed indexing
  • NEAR or WITHIN a certain distance operators

Data

Initially, Forager will work on the plain-text short stories of Rider Haggard, as the Coursera NLP course provides the data and a few sample queries to evaluate Forager on.

If time allows, implementing support for one of the following data sets:

Interface

I haven't put much thought towards this yet, but I imagine interaction will be something like this, returning the document identifier and an excerpt in some kind of structure:

(index "path/to/reut2-000.sgm")

(query "butter")
;; => FROM `EC MINISTERS CONSIDER BIG AGRICULTURE PRICE CUTS'
;; => "Routine sales of BUTTER were made."

(query (AND "butter" "cereal"))
;; => FROM `EC MINISTERS CONSIDER BIG AGRICULTURE PRICE CUTS'
;; => "...16 mln tonnes of unwanted CEREALS, over one mln tonnes of BUTTER..."

Background

Indexing

In small-scale information retrieval problems, one can create a matrix of documents and term frequencies. However, this requires a lot of memory to construct a matrix that's the #documents by all keywords. An alternative, is to create an inverted index, which is a dictionary of keywords, where each keyword points to a sorted list of document IDs.

WITHIN Operator

One way to implement a k-word proximity search is with a divide-and-conquer approach, as described by this article. The algorithm proposed in that article is:

  1. Find the median v between the two keywords
  2. Scan the list of keyword positions and divide the list into two smaller lists, L (positions < v) and R (positions > v). Keep the largest positions of each keyword in L and the smallest positions of each keyword in R.
  3. Find the minimal intervals which lie on both L and R with the plane-sweep algorithm (this scans from left-to-right and finds intervals [left-start, right-end] containing all keywords).
  4. If L or R contains all k kewords, recursively find minimal intervals in that list.

References

Evaluation

  • Performance on Coursera data set
  • Performance on test queries

Author

Jon-Michael Deldin, [email protected].

forager's People

Contributors

jmdeldin avatar

Stargazers

Bruno do Nascimento Maciel avatar Cora Sutton avatar Ulan Sametov avatar

Watchers

 avatar aliraza avatar James Cloos avatar  avatar

Forkers

abhilater

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.