Git Product home page Git Product logo

spark-py-submit's Introduction

Spark Py Submit

A python library that can submit spark job to spark cluster and standalone using rest API

Note: It Currently supports the CDH(5.6.1) and HDP(2.3.2.0-2950,2.4.0.0-169) yarn cluster
The Library is Inspired from: github.com/bernhard-42/spark-yarn-rest-api

Getting Started:

Use the library

# Import the SparkJobHandler
from spark_job_handler import SparkJobHandler

...

logger = logging.getLogger('TestLocalJobSubmit')
# Create a spark JOB
# jobName:           name of the Spark Job
# jar:               location of the Jar (local/hdfs)
# run_class:         entry class of the appliaction
# hadoop_rm:         hadoop resource manager host ip
# hadoop_web_hdfs:   hadoop web hdfs ip
# hadoop_nn:         hadoop name node ip (Normally same as of web_hdfs)
# env_type:          env type is CDH or HDP
# local_jar:         flag to define if a jar is local (Local jar gets uploaded to hdfs)
# spark_properties:  custom properties that need to be set
sparkJob = SparkJobHandler(logger=logger, job_name="test_local_job_submit",
            jar="./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar",
            run_class="IrisApp", hadoop_rm='rma', hadoop_web_hdfs='nn', hadoop_nn='nn',
            env_type="CDH", local_jar=True, spark_properties=None)
trackingUrl = sparkJob.run()
print "Job Tracking URL: %s" % trackingUrl
The above code starts an spark application using the local jar (simple-project/target/scala-2.10/simple-project_2.10-1.0.jar)
For more example see the test_spark_job_handler.py

Build the simple-project

$ cd simple-project
$ sbt package;cd ..

The above steps will create the target jar as: ./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar

Update the nodes Ip in test:

Add the node IP for hadoop resource manager and Name node in the test_cases:
* rm: Resource Manager * nn: Name Node

load the data and make it available to HDFS:

$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

upload data to the HDFS:

$ python upload_to_hdfs.py <name_nodei_ip> iris.data /tmp/iris.data

Run the test cases:

Make the simple-project jar available in HDFS to test remote jar:

$ python upload_to_hdfs.py <name_nodei_ip> simple-project/target/scala-2.10/simple-project_2.10-1.0.jar /tmp/test_data/simple-project_2.10-1.0.jar

Run the test:

$ python test_spark_job_handler.py

Utility:

  • upload_to_hdfs.py: upload local file to hdfs file system

Notes:

The Library is still in early stage and need testing, bug-fixing and documentation
Before running, follow the below steps:
* Update the ResourceManager,NameNode and WebHDFS Port if required in settings.py
* Make the spark-jar available in hdfs as: hdfs:/user/spark/share/lib/spark-assembly.jar
For Contribution Please Create Issue corresponding PR at: github.com/s8sg/spark-py-submit

spark-py-submit's People

Contributors

s8sg avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.