PySpark Cookbook

This is the code repository for PySpark Cookbook, published by Packt.

Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

What is this book about?

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.

This book covers the following exciting features:

Configure a local instance of PySpark in a virtual environment
Install and configure Jupyter in local and multi-node environments
Create DataFrames from JSON and a dictionary using pyspark.sql
Explore regression and clustering models available in the ML module
Use DataFrames to transform data used for modeling

If you feel this book is for you, get your copy today!

Instructions and Navigations

All of the code is organized into folders. For example, Chapter02.

The code will look like the following:

if [ "${_check_R_req}" = "true" ]; then
 checkR
fi

Following is what you need for this book: The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.

With the following software and hardware list you can run all code files present in the book (Chapter 1-8).

Software and Hardware List

Chapter	Software required	OS required
1-8	Apache Spark, Python, Jupyter, Cloudera QuickStart VM	Linux distro (preferably Ubuntu >14.04)

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Get to Know the Authors

Denny Lee Denny Lee is a technology evangelist at Databricks. He is a hands-on data science engineer with 15+ years of experience. His key focuses are solving complex large-scale data problems—providing not only architectural direction but hands-on implementation of such systems. He has extensive experience of building greenfield teams as well as being a turnaround/change catalyst. Prior to joining Databricks, he was a senior director of data science engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

Tomasz Drabas Tomasz Drabas is a data scientist specializing in data mining, deep learning, machine learning, choice modeling, natural language processing, and operations research. He is the author of Learning PySpark and Practical Data Analysis Cookbook. He has a PhD from University of New South Wales, School of Aviation. His research areas are machine learning and choice modeling for airline revenue management.

Other books by the authors

Suggestions and Feedback

Click here if you have any feedback or suggestions.

techlord-rce / pyspark-cookbook Goto Github PK