This project includes the following features:
- Indexing of Wikipedia data into Apache Lucene
- Ability to search Wikipedia data using Apache Lucene
- Distributed extraction of results using Apache Spark
-
Clone this repository
-
Install docker image iisas/hadoop-spark-pig-hive and run it using the following command:
docker run -it -p 12345:12345 -p 8088:8088 -p 8080:8080 -p 8042:8042 -p 8081:8081 -p 19888:19888 iisas/hadoop-spark-pig-hive:2.9.2 bash
- Once the docker image is running, exec the maven command to build and package the project:
mvn clean package
- Copy the jar file to the docker image:
docker cp target/information-retrival-project-1.0-SNAPSHOT.jar <container_id>:/information-retrival.jar
- Submit the jar file using spark-submit:
spark-submit --master local --executor-memory 4g --packages com.databricks:spark-xml_2.12:0.15.0 --class project.spark.SparkMain information-retrival.jar "<path-to-wikipedia-dump.xml>" "<path-to-output-directory>"
Here an example of the <path-to-wikipedia-dump.xml>
file:////datasets/en-wiki-pages-articles.xml
Here an example of the
/output
- Once the job is finished, copy the output directory to the host machine:
docker cp <container_id>:/<path-to-output-directory> <path-to-local-output-directory>
- Run the Main class to index the data into Apache Lucene and add the index directory and the output directory:
.\results\index\ .\results\output\
The Spark-Lucene Information Retrieval system can be accessed via the command line. To search for a term, simply type the term in the command line, and the system will return a list of relevant results.
If you would like to contribute to this project, please submit a pull request.