Git Product home page Git Product logo

karma-spark's Introduction

Spark Installation Guide

Requirements

In order to install Karma-Spark, you need to install this repo, and Web-Karma first. Open a shell(command prompt) and use git clone commands to get these 2 project.

	git clone https://github.com/usc-isi-i2/Karma-Spark.git
	git clone https://github.com/usc-isi-i2/Web-Karma.git

Install Karma

Karma is written in JAVA and will run on Mac, Linux and Windows. To Install Web Karma, you will need Java 1.7 and Maven 3.0.

Java 1.7

Download from http://www.oracle.com/technetwork/java/javase/downloads/index.html Make sure JAVA_HOME environment variable is pointing to JDK 1.7

Maven 3.0

Make sure that M2_HOME and M2 environment variables are set as described in Maven Installation Instructions: http://maven.apache.org/download.cgi

When system requirements are satisfied, run the following commands:

	cd Web-Karma
	git checkout development
	mvn clean install –DskipTests

“BUILD SUCCESS” indicates build successfully. Then type:

	cd karma-spark
	mvn package -P shaded -Dmaven.test.skip=true

“BUILD SUCCESS” indicates build successfully. Then type:

	cd target
	ls

Go to find the karma-spark-0.0.1-SNAPSHOT-shaded.jar file and write down the absolute path of this file. For example, …… /Web-Karma/karma-spark/target/karma-spark-0.0.1-SNAPSHOT-shaded.jar

Install Spark

  1. Download and untar hadoop cloudera in <hadoop_dir>
  2. Download and untar spark cloudera version 1.5 in <spark_dir>
  3. Setup spark configuration. Create a new file <spark_dir>/conf/spark-env.sh with the following content. Replace SPARK_HOME and HADOOP_HOME folders with your paths
#!/usr/bin/env bash

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
export HADOOP_HOME=/Users/karma/hadoop-2.6.0-cdh5.5.0
export SPARK_HOME=/Users/karma/spark-1.5.0-cdh5.5.0

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/contrib/capacity-scheduler/*.jar:$HADOOP_HOME/share/hadoop/tools/lib/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/tools/lib/*:$SPARK_HOME/lib/*

Also add <spark_dir>\bin to you PATH variable. Example:

	export PATH=/Users/karma/spark-1.5.0-cdh5.5.0/bin:$PATH

On the command prompt, type the following commands, you will see “Spark assembly has been built with Hive, including Datanucleus jars on classpath”.

	cd ..
	spark-submit

Configuring workflow

	cd Karma-Spark

Using make.sh

requirements.txt specifies the pip install repos and versions required for running ht workflow. You can install them by running

    pip install -r requirements.txt

Now you have all the dependencies. Next you can run make.sh, for this you have to configure the path. Please point path to site-packages variable where all the python libraries are there. virtual environment is highly recommeded.

How to install and configure virtual environment

install virtualenvwrapper using

	pip install virtualenvwrapper

Now add the following environment variables to your

	export WORKON_HOME=/Users/danny/Desktop/.virtualenvs
	export PROJECT_HOME=/Users/danny/Desktop/Devel
	source /opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/virtualenvwrapper.sh
	export VIRTUALENVWRAPPER_PYTHON=/opt/local/bin/python

Now create a virtual enviroment using

	mkvirtualenv env1

Refer here for any issues regarding virtual environment.

check where your site-packages for virtualenv are, for me it looks like

	path=/Users/rajagopal/Desktop/github_repos/.virtualenvs/env1/lib/python2.7/site-packages

After setting the path, you run make.sh now, this will create a python-lib.zip in lib folder which has all the .py files required for running ht workflow.

Using makeSpark.sh for development purposes:

Edit the ``makeSpark.shfile and chenge the first lineentity=~/blahblah/lib/python2.7/site-packages/dig-entity-merger`. Enter the path to your dig-entity-merger site-packages path. Similarly update the paths for other variables digsparkutil and digStylometry also.

Now Run:

	./makeSpark.sh

This bundles all files are required for supporting the workflow in lib\python-lib.zip

Test with an example

Go to the file where you put Karma-Spark, then go to the folder workflow-samples. This contains sample code to run workflows.

  • karmaWorkflow.py applies a karma model on a JSON file and saves the JSON-LD output
  • karmaWorkflowCSV.py applies karma model on a CSV file and outputs data as N3
  • karmaReduceWorkflow.py shows how to apply multiple models on a dataset and then combine/reduce them
  • karmaContextWorkflow.py shows how to apply context to a JSON-LD file
  • karmaFramerWorkflow.py applies multiple models on a JSON input dataset, reduces the results and then uses the framer to output multiple frames/roots and publish them to Elastic Search

The instructions on how to run the workflow are commented at the top of the workflow's python file.

karma-spark's People

Contributors

dkapoor avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.