The dsc-machine-learning-with-spark-lab from jesrodriguez0816

Machine Learning with Spark - Lab

Introduction

Previously you saw how to manipulate data with Spark DataFrames as well as create machine learning models. In this lab, you're going to practice loading data, manipulating it, preparing visualizations, and fitting it in the Spark MLlib framework. Let's get started!

Objectives

In this lab you will:

Load and manipulate data using Spark DataFrames
Create a Spark ML pipeline that transforms data and runs over a grid of hyperparameters

The Data

This dataset is from a Taiwanese financial company, and the task is to determine which individuals are going to default on their credit card based off of characteristics such as limit balance, past payment history, age, marriage status, and sex.

You'll use the file credit_card_default.csv, which comes from the UCI ML Repository.

Initial Data Exploration

Get started by writing the relevant import statement and creating a local SparkSession called spark, then use that SparkSession to read credit_card_default.csv into a Spark SQL DataFrame.

# import necessary libraries

# initialize Spark Session
spark = None

# read in csv to a spark dataframe
spark_df = None

Use .head() to display the first 5 records, and print out the schema.

# Display the first 5 records

# Print out the schema

It looks like we have three non-numeric features. For each non-numeric (string) feature, select and show all distinct categories.

# Select and show all distinct categories

Interesting...it looks like we have some extraneous values in our categories. For example both EDUCATION and MARRIAGE have a category 0.

Let's create some visualizations of each of these to determine just how many of them there are.

Create bar plots of the variables EDUCATION and MARRIAGE to see how the records are distributed between the categories.

Click to reveal hint

To create a bar plot, you need to group by the category (.groupBy()) and then aggregate by the count in that category (.count()). That will result in a small DataFrame containing EDUCATION and count columns.

Then the easiest way to create a bar plot is to call .toPandas() to make that small Spark SQL DataFrame into a pandas DataFrame, and call .plot() on the pandas DataFrame.

# Create bar plot of EDUCATION

# Create bar plot of MARRIAGE

Binning

It looks like there are barely any records in the 0, 5, and 6 categories. Let's go ahead and bin (combine) those with the current Other records into a single catch-all Other category for both EDUCATION and MARRIAGE.

The approach we'll use is similar to the CASE WHEN technique in SQL. If this were a SQL query, it would look something like this:

SELECT CASE
       WHEN EDUCATION = '0' THEN 'Other'
       WHEN EDUCATION = '5' THEN 'Other'
       WHEN EDUCATION = '6' THEN 'Other'
       ELSE EDUCATION
       END AS EDUCATION
  FROM credit_card_default;

With Spark SQL DataFrames, this is achieved using .withColumn() (documentation here) in conjunction with .when() (documentation here) and .otherwise() (documentation here).

# Bin EDUCATION categories

# Bin MARRIAGE categories

# Select and show all distinct categories for EDUCATION and MARRIGE again

Let's also re-create the plots from earlier, now that the data has been binned:

# Plot EDUCATION

# Plot MARRIAGE

Much better. Now, let's do a little more investigation into our target variable before diving into the machine learning aspect of this project.

Class Balance Exploration

Let's first look at the overall distribution of class balance of the default column (the target for our upcoming machine learning process).

Create a bar plot to compare the number of defaults (0) vs. non-defaults (1). Consider customizing your plot labels as well, since 0 and 1 are not particularly understandable values.

# Group and aggregate target data

# Plot target data

Looks like we have a fairly imbalanced dataset.

Let's also visualize the difference in default rate between males and females in this dataset. Group by both default and SEX and visualize the comparison.

# Group and aggregate target and sex data

# Plot target and sex data

It looks like males have an ever so slightly higher default rate than females, and also represent a smaller proportion of the dataset.

On to the Machine Learning!

Now, it's time to fit the data to the PySpark machine learning model pipeline. You will need:

3 StringIndexers
- One for each categorical feature
- Documentation here
A OneHotEncoder
- To encode the newly indexed strings into categorical variables
- Documentation here
A VectorAssembler
- To combine all features into one SparseVector
- Documentation here

All of these initialized estimators should be stored in a list called stages.

# Import the necessary classes


# Create the string indexers and determine the names of the numeric
# and indexed columns. Note that ID is an identifier and should NOT
# be included in the numeric columns


# Create a OneHotEncoder to encode the indexed string features


# Determine the names of the final list of features going into the model


# Create a VectorAssembler to combine all features


# Assemble a list of stages that includes all indexers, the one-hot
# encoder, and the vector assembler

Great! Now let's see if that worked. Let's investigate how it transforms your dataset. Put all of the stages in a Pipeline and fit it to your data. Look at the features column. Did you obtain the number of features you expected?

# Import relevant class


# Instantiate a pipeline using stages list


# Fit and transform the data using the pipeline, then look at
# the size of the array in the 'features' column

Click to reveal answer

The pipeline should have produced a sparse vector with 29 features.

This comes from:

20 numeric features
3 one-hot encoded features with dropLast=True, containing
- 1 SEX feature
- 3 EDUCATION features
- 2 MARRIAGE features

Fitting Machine Learning Models

That looks good! Now let's go ahead and fit data to different machine learning models. To evaluate these models, you should use the BinaryClassificationEvaluator.

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(
    rawPredictionCol='prediction',
    labelCol='default',
    metricName='areaUnderROC'
)

Logistic Regression

First, we'll try a LogisticRegression (documentation here):

split the data into a train and test set. The basic structure of this is:

train, test = df.randomSplit(weights=[0.8, 0.2], seed=1)

make sure you replace df with the actual name of your prepared dataframe
instantiate a logistic regression with standardization=True and add it to the stages list
instantiate a new Pipeline estimator with all of the stages
fit the pipeline on the training data
transform both train and test data using the pipeline
use evaluator to evaluate performance on train vs. test

from pyspark.ml.classification import LogisticRegression

# Your code here

Looks like the defaults for LogisticRegression are working pretty well, since the train and test metrics are pretty similar.

Still, let's try a CrossValidator (documentation here) + ParamGridBuilder (documentation here) approach with a few different regularization parameters.

We'll use these regularization parameters:

[0.0, 0.01, 0.1, 1.0]

In the cell below:

instantiate a ParamGridBuilder that tests out the regParam values listed above
instantiate a CrossValidator that uses the param grid you just created as well as evaluator and the pipeline you created earlier
fit the CrossValidator on the full DataFrame
display the metrics for all models, and identify the best model parameters

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Your code here

Now try this again with other classifiers. Try and create a function that will allow you to easily test different models with different parameters. You can find all of the available classification model options here.

This function is optional, but it should allow for your code to be far more D.R.Y. The function should return the fitted cross-validated classifier as well as print out the AUC of the best-performing model and the best parameters.

# Create a function to cross validate different classifiers with different parameters

Now train one other classifier that is not a LogisticRegression. Use a ParamGridBuilder to try out some relevant parameters.

# Your code here
# ⏰ This cell may take a long time to run

And one more:

# Your code here
# ⏰ This cell may take a long time to run

Which classifier turned out to be the best overall?

# Your answer here
"""

""";

Level Up (Optional)

Create ROC curves for each of these models
Try the multi-layer perceptron classifier algorithm. You will soon learn about what this means in the neural network section!

Stop the Spark Session

spark.stop()

Summary

If you've made it this far, congratulations! Spark is an in-demand skill, but it is not particularly easy to master. In this lesson, you fit multiple different machine learning pipelines for a classification problem. If you want to take your Spark skills to the next level, connect to a distributed cluster using a service like AWS or Databricks and perform these Spark operations on the cloud.

jesrodriguez0816 / dsc-machine-learning-with-spark-lab Goto Github PK

dsc-machine-learning-with-spark-lab's Introduction

Machine Learning with Spark - Lab

Introduction

Objectives

The Data

Initial Data Exploration

Binning

Class Balance Exploration

On to the Machine Learning!

Fitting Machine Learning Models

Logistic Regression

Level Up (Optional)

Stop the Spark Session

Summary

dsc-machine-learning-with-spark-lab's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent