Git Product home page Git Product logo

spark-cookbook's Introduction

spark-cookbook

Chef cookbook to install Apache Spark master, and slaves for standalone mode.

Spark will be installed and run as ['spark']['user'], the home directory of that user will be the installed path.

In standalone mode, spark slaves are started from master using ssh connections to the slaves. Then, if you add add slaves, add them to ['spark']['slaves'] and re-run the recipe on master to update the list of slaves.

Supported Platforms

Only tested on debian. Please fill a ticket if you found incompatibilities with your platform.

Attributes

Key Type Description Default
["chef"]["data_bag_secret_path"] String Path to secret file to decrypt data bag secret. /var/chef/encrypted_data_bag_secret
["spark"]["master_host"] String hostname for spark master. localhost
["spark"]["master_port"] String spark master port. 7077
["spark"]["bin_url"] String URL to download spark binary archive. http://d3kbcqa49mib13.cloudfront.net/spark-1.0.1-bin-hadoop2.tgz
["spark"]["bin_checksum"] String SHA256 checksum to help chef cache the file. Set it if you change ["spark"]["bin_url"].
["spark"]["install_dir"] String Where to install spark. Also home directory for ['spark']['user']. /opt/local/spark
["spark"]["user"] String Spark runtime user name. spark
["spark"]["group"] String Spark runtime group spark
["spark"]["slaves"] Array[String] List of hostname of the slaves. You probably want to use private network hostnames or ip addresses. []

Also you can set spark-env.sh environment variables with ['spark']['env']['lowercase key']. For example :

SPARK_LOCAL_IP = 'ec2-ip-xxx.internal'

Here are the supported parameters :

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

Data bags

You must provide an keypair (rsa or dsa) in a databag item :

spark
  ssh_key
    type: "rsa" or "dsa"
    private_key: the private key
    public_key: the public key

You can generate a keypair with :

ssh-keygen -t dsa -f spark_key

Put the content of spark_key.pub file in public_key and spark_key file in private_key and dsa as type.

Usage

Include spark::master and/or spark::slave in your node's run_list:

{
  "run_list": [
    "recipe[spark::master]"
  ]
}

Here is a more concrete example that also configures java and scala :

{
  "java": {
    "install_flavor": "oracle",
    "jdk_version": "8",
    "oracle": {
      "accept_oracle_download_terms": true
    }
  },
  "scala": {
    "version": "2.10.4",
    "home": "/usr/lib/scala",
    "checksum": "b46db638c5c6066eee21f00c447fc13d1dfedbfb60d07db544e79db67ba810c3",
    "url": "http://www.scala-lang.org/files/archive/scala-2.10.4.tgz"
  },
  "spark": {
    "slaves": ["ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal"]
  },
  "run_list": [
    "recipe[spark::master]"
  ]
}

Contributing

  1. Fork the repository on Github
  2. Create a named feature branch (i.e. add-new-recipe)
  3. Write your change
  4. Write tests for your change (if applicable)
  5. Run the tests, ensuring they all pass
  6. Submit a Pull Request

License and Authors

Copyright (C) 2014 Antonin Amand

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

spark-cookbook's People

Contributors

gwik avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.