Git Product home page Git Product logo

pyspark-playground's Introduction

About the repo

I started this repo because I wanted to learn PySpark. However, I also didn't want to use Jupyter notebook as it is typically the case in the examples I came across.

Therefore, I started with setting up a spark cluster using docker.

Running the code (Spark standalone cluster)

You can run the spark standalone cluster by running:

make run

or with 3 workers using:

make run-scaled

You can submit Python jobs with the command:

make submit app=dir/relative/to/spark_apps/dir

e.g.

make submit app=data_analysis_book/chapter03/word_non_null.py

There are a number of commands to build the standalone cluster, you should check the Makefile to see them all. But the simplest one is:

make build

Web UIs

The master node can be accessed on: localhost:9090. The spark history server is accessible through: localhost:18080.

Running the code (Spark on Hadoop Yarn cluster)

Before running, check the virtual disk size that Docker assigns to the container. In my case, I needed to assign some 70 GB. You can run Spark on the Hadoop Yarn cluster by running:

make run-yarn

or with 3 data nodes:

make run-yarn-scaled

You can submit an example job to test the setup:

make submit-yarn-test

which will submit the pi.py example in cluster mode.

You can also submit a custom job:

make submit-yarn-cluster app=data_analysis_book/chapter03/word_non_null.py

There are a number of commands to build the cluster, you should check the Makefile to see them all. But the simplest one is:

make build-yarn

Web UIs

You can access different web UIs. The one I found the most useful is the NameNode UI:

http://localhost:9870

Other UIs:

  • ResourceManger - localhost:8088
  • Spark history server - localhost:18080

About the branch expose-docker-hostnames-to-host

The branch expose-docker-hostnames-to-host contains the shell scripts, templates, and Makefile modifications required to expose worker node web interfaces. To run the cluster you need to do the following.

  1. run
make run-ag n=3

which will generate a docker compose file with 3 worker nodes and the appropriate yarn and hdfs site files in the yarn-generated folder. 2. register the docker hostnames with /etc/hosts

sudo make dns-modify o=true

which will create a backup folder with your original hosts file.

Once you are done and terminate the cluster, restore your original hosts file with:

sudo make dns-restore

For more information read the story I published on Medium here.

Stories published on Medium

  1. Setting up a standalone Spark cluster can be found here.
  2. Setting up Hadoop Yarn to run Spark applications can be found here.
  3. Using hostnames to access Hadoop resources can be found here.

About the book_data directory

The official repo of the book Data Analysis with Python and PySpark can be found here:https://github.com/jonesberg/DataAnalysisWithPythonAndPySpark.

I did not include the files from this repo as there are files larger than 50 MB which is the limit for GitHub. At the time of writing the repo contains a link to Dropbox which contains the files. I suggest you download them.

The book_data directory contains some files found on this repo: https://github.com/maprihoda/data-analysis-with-python-and-pyspark. Which is also a repo of someone who read the mentioned book.

pyspark-playground's People

Contributors

mrn-aglic avatar tiemobang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.