Git Product home page Git Product logo

oodt-solr's Introduction

The installation, configuration and execution of this project is divided in 5 basic steps:

1. Installing and configuring tika-parser and running it to generate json files to be posted to solr.
2. Configuring Solr.
3. Installing and configuring oodt and custom-workflow which ingests the input json files and sends to solr for indexing.
4. Running the launch.py to post the data to solr using oodt workflow manager and crawler.
5. Compiling and executing the queries App to get the answers to the assignment questions.


1a. Steps for Installing and Configuring tika-parser:
-----------------------------------------------------
Compile and package using the command:
cd src/tika-parser
mvn package

Run tika-parser App using the command assuming you are in tika-parser directory:
java -cp "target/tika-parser-1.0-SNAPSHOT.jar:./target/lib/*" com.parse.tika.App /path/to/input/tsv/dir 1

Command Explanation
java -cp $CLASSPATH$ com.parse.tika.App [input folder] [deduplication switch]

[input folder] - String: Path to the folder containing all the TSV files
[deduplication switch] - int: Selector to chose whether to execute the crawler with or without Deduplication. 0 => Without, 1 => With.


2a. Configuring Solr:
---------------------
Copy the schema.xml file from src/files-to-copy/solr to solr conf directory in your solr installation.


3a. Steps for installing oodt:
------------------------------
Install the radix version of oodt in the src directory using the following link: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT

Note:
1. Update the version of OODT not lower than 0.7 in pom.xml if its less than 0.7. (Before building the source.)
2. In default installation, cas-pge-<version>.jar may not present in workflow and filemgr home, if so copy that file to both [WORKFLOW_HOME]/lib and [FILEMGR_HOME]/lib from [PGE_HOME]/lib. Also copy cas-filemgr-<version>.jar from [FILEMGR_HOME]/lib in [WORKFLOW_HOME]/lib and [RESMGR_HOME]/lib.
3. If using solr with Tomcat, make sure the port of oodt tomcat server is not conflicting. If both solr's tomcat and oodt's tomcat servers are using same port, change the Connector and Server ports in oodt installations in the oodt/tomcat/conf/server.xml and oodt/tomcat/conf/server-minimal.xml


3b. Steps for installing and configuring custom-pge with oodt (provided you are in the src folder):
---------------------------------------------------------------------------------------------------
1. Compiling custom-pge
Follow the steps from here: https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example
Especially step 4.
cd /usr/local/src/custom-pge
mvn package
cp target/custom-pge-1.0.jar /oodt/$WORKFLOW_HOME/lib
cp target/custom-pge-1.0.jar /oodt/$RESMGR_HOME/lib
cp target/custom-pge-1.0.jar /oodt/$FILEMGR_HOME/lib
(check if the custom-pge-1.0.jar is copied in above mentioned directories)

2. Configure custom-workflow
Note: This directory contains the workflow which will post the json files to solr for indexing. Also assuming ETLlib is installed on the disk already.
Modify the custom-workflow/pge-configs/PGEConfig.xml as following:
i.   Update <metadata key="JobDir" val="/Users/harshsingh/Documents/Codes/IR/solroodt"/> to the path of oodt installation.
ii.  Update  the value of '/etllib/bin/poster' to the poster executable location in etllib and 'http://localhost:8080/solr/update/json?commit=true' to actual uri of solr installation in <cmd>echo [FileLocation]/[Filename] | ./etllib/bin/poster -v -u http://localhost:8080/solr/update/json?commit=true</cmd>

3. Configure oodt Crawler
Note: the below mentioned files are modified to integrate the custom-workflow
i.   Copy and replace crawler_launcher from src/files-to-copy/crawler/bin to /oodt/$CRAWLER_HOME/bin
ii.  Copy and replace action-beans.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy
iii. Copy and replace cmd-line-options.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy
iv.  Copy and mime-extractor-map.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy
v.   Copy mimetypes.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy
vi.  Copy tikaextractor.config from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy

4. Configure oodt Workflow Manager
Note: the below mentioned files are modified to integrate the custom-workflow
i.   Copy and replace events.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy
ii.  Copy and replace workflow-instance-met.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy
iii. Copy CustomWorkflow.workflow.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy
iv.  Copy tasks.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy
	 Also, update <property name="PGETask_ConfigFilePath" value="/Users/harshsingh/Documents/Codes/IR/solroodt/custom-workflow/pge-configs/PGEConfig.xml" envReplace="true"/> to the actual path of PGEConfig.xml


4a. Executing launch.py
(this step crawls the json files and post to solr for indexing using oodt workflow manager & crawler):
------------------------------------------------------------------------------------------------------
Prerequisites before executing launch.py
i.   Solr is running (if running with port 8080 on tomcat, you need to change the port oodt).
ii.  OODT is running. (File manager running on port 9000, Workflow Manager running on port 9001 and Resource Manager running on 9002).
iii. OODT Batch Stub is running on port 2001.

Assuming you are in the src directory, run the following command:

python launch.py -i /path/to/json/files

where: /path/to/json/files should be pointing to json files directory.
Note: The files generated by tika-parser are present in (src/tika-parser/output)


5a. Executing the results.java to get query answers:
----------------------------------------------------
Compile and package using the command:
cd src/queries
mvn package

Run queries App using the command assuming you are in queries directory:
java -cp "target/queries-1.0-SNAPSHOT.jar:./target/lib/*" com.solr.queries.App "http://localhost:8080/solr/collection1/" " *" " *"

Command Explanation
java -cp $CLASSPATH$ [solr url] [period start date] [period end date]
[solr url] - String: http://localhost:8080/solr/collection1/
[period start date] - String: “ *” if you want to start at the beginning or date in YYYY-MM-DDThh:mm:ssZ format
[period end date] - String: “ *” if you want to end at the last date or date in YYYY-MM-DDThh:mm:ssZ format

NOTE: Do not forget the space before the “*” in “ *”

Important Debugging note:
-------------------------
If you are getting error while executing any of the following steps, following steps may help you solve the problem:
1. Check the that solr, oodt, oodt file manager, oodt workflow manager, oodt batch stub are running properly.
2. Check if all the above mentioned jars are files are copied correctly to assigned places.
3. The configuration files are properly updated.
4. The path provided to output and input directories exists and have proper permission to read and write.
5. If compile time errors are faced check if all the libraries are getting downloaded by maven.

oodt-solr's People

Contributors

harshs08 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.