Git Product home page Git Product logo

spark-standalone-cluster-on-docker's Introduction

Apache Spark Standalone Cluster on Docker

The project just got its own article at Towards Data Science Medium blog! ✨

This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.

build jupyterlab-latest-version spark-latest-version docker-version docker-compose-file-version spark-scala-api spark-pyspark-api spark-sparkr-api

TL;DR

curl -LO https://raw.githubusercontent.com/andre-marcos-perez/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Contents

Quick Start

Cluster overview

Application URL Description
JupyterLab localhost:8888 Cluster interface with built-in Jupyter notebooks
Apache Spark Master localhost:8080 Spark Master node
Apache Spark Worker I localhost:8081 Spark Worker node with 1 core and 512m of memory (default)
Apache Spark Worker II localhost:8082 Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Build from Docker Hub

  1. Download the source code or clone the repository;
  2. Edit the docker compose file with your favorite tech stack version, check apps supported versions;
  3. Build the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c.

Build from your local machine

Note: Local build is currently only supported on Linux OS distributions.

  1. Download the source code or clone the repository;
  2. Move to the build directory;
cd build
  1. Edit the build.yml file with your favorite tech stack version;
  2. Match those version on the docker compose file;
  3. Build the images;
chmod +x build.sh ; ./build.sh
  1. Build the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c.

Tech Stack

  • Infrastructure
Component Version
Docker Engine 1.13.0+
Docker Compose 1.10.0+
Python 3.7.3
Scala 2.12.11
R 3.5.2
  • Jupyter Kernels
Component Version Provider
Python 2.1.4 Jupyter
Scala 0.10.0 Almond
R 1.1.1 IRkernel
  • Applications
Component Version Docker Tag
Apache Spark 2.4.0 | 2.4.4 | 3.0.0 <spark-version>-hadoop-2.7
JupyterLab 2.1.4 <jupyterlab-version>-spark-<spark-version>

Apache Spark R API (SparkR) is only supported on version 2.4.4. Full list can be found here.

Docker Hub Metrics

Image Size Downloads
JupyterLab docker-size-jupyterlab docker-pull
Spark Master docker-size-master docker-pull
Spark Worker docker-size-worker docker-pull

Contributing

We'd love some help. To contribute, please read this file.

Staring us on GitHub is also an awesome way to show your support ⭐

Contributors

spark-standalone-cluster-on-docker's People

Contributors

andre-marcos-perez avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.