harshs08 / oodt-solr Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
The installation, configuration and execution of this project is divided in 5 basic steps: 1. Installing and configuring tika-parser and running it to generate json files to be posted to solr. 2. Configuring Solr. 3. Installing and configuring oodt and custom-workflow which ingests the input json files and sends to solr for indexing. 4. Running the launch.py to post the data to solr using oodt workflow manager and crawler. 5. Compiling and executing the queries App to get the answers to the assignment questions. 1a. Steps for Installing and Configuring tika-parser: ----------------------------------------------------- Compile and package using the command: cd src/tika-parser mvn package Run tika-parser App using the command assuming you are in tika-parser directory: java -cp "target/tika-parser-1.0-SNAPSHOT.jar:./target/lib/*" com.parse.tika.App /path/to/input/tsv/dir 1 Command Explanation java -cp $CLASSPATH$ com.parse.tika.App [input folder] [deduplication switch] [input folder] - String: Path to the folder containing all the TSV files [deduplication switch] - int: Selector to chose whether to execute the crawler with or without Deduplication. 0 => Without, 1 => With. 2a. Configuring Solr: --------------------- Copy the schema.xml file from src/files-to-copy/solr to solr conf directory in your solr installation. 3a. Steps for installing oodt: ------------------------------ Install the radix version of oodt in the src directory using the following link: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT Note: 1. Update the version of OODT not lower than 0.7 in pom.xml if its less than 0.7. (Before building the source.) 2. In default installation, cas-pge-<version>.jar may not present in workflow and filemgr home, if so copy that file to both [WORKFLOW_HOME]/lib and [FILEMGR_HOME]/lib from [PGE_HOME]/lib. Also copy cas-filemgr-<version>.jar from [FILEMGR_HOME]/lib in [WORKFLOW_HOME]/lib and [RESMGR_HOME]/lib. 3. If using solr with Tomcat, make sure the port of oodt tomcat server is not conflicting. If both solr's tomcat and oodt's tomcat servers are using same port, change the Connector and Server ports in oodt installations in the oodt/tomcat/conf/server.xml and oodt/tomcat/conf/server-minimal.xml 3b. Steps for installing and configuring custom-pge with oodt (provided you are in the src folder): --------------------------------------------------------------------------------------------------- 1. Compiling custom-pge Follow the steps from here: https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example Especially step 4. cd /usr/local/src/custom-pge mvn package cp target/custom-pge-1.0.jar /oodt/$WORKFLOW_HOME/lib cp target/custom-pge-1.0.jar /oodt/$RESMGR_HOME/lib cp target/custom-pge-1.0.jar /oodt/$FILEMGR_HOME/lib (check if the custom-pge-1.0.jar is copied in above mentioned directories) 2. Configure custom-workflow Note: This directory contains the workflow which will post the json files to solr for indexing. Also assuming ETLlib is installed on the disk already. Modify the custom-workflow/pge-configs/PGEConfig.xml as following: i. Update <metadata key="JobDir" val="/Users/harshsingh/Documents/Codes/IR/solroodt"/> to the path of oodt installation. ii. Update the value of '/etllib/bin/poster' to the poster executable location in etllib and 'http://localhost:8080/solr/update/json?commit=true' to actual uri of solr installation in <cmd>echo [FileLocation]/[Filename] | ./etllib/bin/poster -v -u http://localhost:8080/solr/update/json?commit=true</cmd> 3. Configure oodt Crawler Note: the below mentioned files are modified to integrate the custom-workflow i. Copy and replace crawler_launcher from src/files-to-copy/crawler/bin to /oodt/$CRAWLER_HOME/bin ii. Copy and replace action-beans.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy iii. Copy and replace cmd-line-options.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy iv. Copy and mime-extractor-map.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy v. Copy mimetypes.xml from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy vi. Copy tikaextractor.config from src/files-to-copy/crawler/policy to /oodt/$CRAWLER_HOME/policy 4. Configure oodt Workflow Manager Note: the below mentioned files are modified to integrate the custom-workflow i. Copy and replace events.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy ii. Copy and replace workflow-instance-met.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy iii. Copy CustomWorkflow.workflow.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy iv. Copy tasks.xml from src/files-to-copy/workflow/policy to /oodt/$WORKFLOW_HOME/policy Also, update <property name="PGETask_ConfigFilePath" value="/Users/harshsingh/Documents/Codes/IR/solroodt/custom-workflow/pge-configs/PGEConfig.xml" envReplace="true"/> to the actual path of PGEConfig.xml 4a. Executing launch.py (this step crawls the json files and post to solr for indexing using oodt workflow manager & crawler): ------------------------------------------------------------------------------------------------------ Prerequisites before executing launch.py i. Solr is running (if running with port 8080 on tomcat, you need to change the port oodt). ii. OODT is running. (File manager running on port 9000, Workflow Manager running on port 9001 and Resource Manager running on 9002). iii. OODT Batch Stub is running on port 2001. Assuming you are in the src directory, run the following command: python launch.py -i /path/to/json/files where: /path/to/json/files should be pointing to json files directory. Note: The files generated by tika-parser are present in (src/tika-parser/output) 5a. Executing the results.java to get query answers: ---------------------------------------------------- Compile and package using the command: cd src/queries mvn package Run queries App using the command assuming you are in queries directory: java -cp "target/queries-1.0-SNAPSHOT.jar:./target/lib/*" com.solr.queries.App "http://localhost:8080/solr/collection1/" " *" " *" Command Explanation java -cp $CLASSPATH$ [solr url] [period start date] [period end date] [solr url] - String: http://localhost:8080/solr/collection1/ [period start date] - String: “ *” if you want to start at the beginning or date in YYYY-MM-DDThh:mm:ssZ format [period end date] - String: “ *” if you want to end at the last date or date in YYYY-MM-DDThh:mm:ssZ format NOTE: Do not forget the space before the “*” in “ *” Important Debugging note: ------------------------- If you are getting error while executing any of the following steps, following steps may help you solve the problem: 1. Check the that solr, oodt, oodt file manager, oodt workflow manager, oodt batch stub are running properly. 2. Check if all the above mentioned jars are files are copied correctly to assigned places. 3. The configuration files are properly updated. 4. The path provided to output and input directories exists and have proper permission to read and write. 5. If compile time errors are faced check if all the libraries are getting downloaded by maven.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.