Git Product home page Git Product logo

information-retrieval's Introduction

Information Retrieval

This project includes the following features:

  • Indexing of Wikipedia data into Apache Lucene
  • Ability to search Wikipedia data using Apache Lucene
  • Distributed extraction of results using Apache Spark

Installation

  1. Clone this repository

  2. Install docker image iisas/hadoop-spark-pig-hive and run it using the following command:

docker run -it -p 12345:12345 -p 8088:8088 -p 8080:8080 -p 8042:8042 -p 8081:8081 -p 19888:19888 iisas/hadoop-spark-pig-hive:2.9.2 bash
  1. Once the docker image is running, exec the maven command to build and package the project:
mvn clean package
  1. Copy the jar file to the docker image:
docker cp target/information-retrival-project-1.0-SNAPSHOT.jar <container_id>:/information-retrival.jar
  1. Submit the jar file using spark-submit:
spark-submit --master local --executor-memory 4g --packages com.databricks:spark-xml_2.12:0.15.0 --class project.spark.SparkMain information-retrival.jar "<path-to-wikipedia-dump.xml>" "<path-to-output-directory>"

Here an example of the <path-to-wikipedia-dump.xml>

file:////datasets/en-wiki-pages-articles.xml

Here an example of the

/output
  1. Once the job is finished, copy the output directory to the host machine:
docker cp <container_id>:/<path-to-output-directory> <path-to-local-output-directory>
  1. Run the Main class to index the data into Apache Lucene and add the index directory and the output directory:
.\results\index\ .\results\output\ 

Usage

The Spark-Lucene Information Retrieval system can be accessed via the command line. To search for a term, simply type the term in the command line, and the system will return a list of relevant results.

Contribute

If you would like to contribute to this project, please submit a pull request.

information-retrieval's People

Contributors

nathanaelbayle avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.