Previously you saw how to manipulate data with Spark DataFrames as well as create machine learning models. In this lab, you're going to practice loading data, manipulating it, preparing visualizations, and fitting it in the Spark MLlib framework. Let's get started!
In this lab you will:
- Load and manipulate data using Spark DataFrames
- Create a Spark ML pipeline that transforms data and runs over a grid of hyperparameters
This dataset is from a Taiwanese financial company, and the task is to determine which individuals are going to default on their credit card based off of characteristics such as limit balance, past payment history, age, marriage status, and sex.
You'll use the file credit_card_default.csv
, which comes from the UCI ML Repository.
Get started by writing the relevant import statement and creating a local SparkSession called spark
, then use that SparkSession to read credit_card_default.csv
into a Spark SQL DataFrame.
# import necessary libraries
# initialize Spark Session
spark = None
# read in csv to a spark dataframe
spark_df = None
Use .head()
to display the first 5 records, and print out the schema.
# Display the first 5 records
# Print out the schema
It looks like we have three non-numeric features. For each non-numeric (string
) feature, select and show all distinct categories.
# Select and show all distinct categories
Interesting...it looks like we have some extraneous values in our categories. For example both EDUCATION
and MARRIAGE
have a category 0
.
Let's create some visualizations of each of these to determine just how many of them there are.
Create bar plots of the variables EDUCATION
and MARRIAGE
to see how the records are distributed between the categories.
Click to reveal hint
To create a bar plot, you need to group by the category (.groupBy()
) and then aggregate by the count in that category (.count()
). That will result in a small DataFrame containing EDUCATION
and count
columns.
Then the easiest way to create a bar plot is to call .toPandas()
to make that small Spark SQL DataFrame into a pandas DataFrame, and call .plot()
on the pandas DataFrame.
# Create bar plot of EDUCATION
# Create bar plot of MARRIAGE
It looks like there are barely any records in the 0
, 5
, and 6
categories. Let's go ahead and bin (combine) those with the current Other
records into a single catch-all Other
category for both EDUCATION
and MARRIAGE
.
The approach we'll use is similar to the CASE WHEN
technique in SQL. If this were a SQL query, it would look something like this:
SELECT CASE
WHEN EDUCATION = '0' THEN 'Other'
WHEN EDUCATION = '5' THEN 'Other'
WHEN EDUCATION = '6' THEN 'Other'
ELSE EDUCATION
END AS EDUCATION
FROM credit_card_default;
With Spark SQL DataFrames, this is achieved using .withColumn()
(documentation here) in conjunction with .when()
(documentation here) and .otherwise()
(documentation here).
# Bin EDUCATION categories
# Bin MARRIAGE categories
# Select and show all distinct categories for EDUCATION and MARRIGE again
Let's also re-create the plots from earlier, now that the data has been binned:
# Plot EDUCATION
# Plot MARRIAGE
Much better. Now, let's do a little more investigation into our target variable before diving into the machine learning aspect of this project.
Let's first look at the overall distribution of class balance of the default
column (the target for our upcoming machine learning process).
Create a bar plot to compare the number of defaults (0
) vs. non-defaults (1
). Consider customizing your plot labels as well, since 0
and 1
are not particularly understandable values.
# Group and aggregate target data
# Plot target data
Looks like we have a fairly imbalanced dataset.
Let's also visualize the difference in default rate between males and females in this dataset. Group by both default
and SEX
and visualize the comparison.
# Group and aggregate target and sex data
# Plot target and sex data
It looks like males have an ever so slightly higher default rate than females, and also represent a smaller proportion of the dataset.
Now, it's time to fit the data to the PySpark machine learning model pipeline. You will need:
- 3
StringIndexer
s- One for each categorical feature
- Documentation here
- A
OneHotEncoder
- To encode the newly indexed strings into categorical variables
- Documentation here
- A
VectorAssembler
- To combine all features into one
SparseVector
- Documentation here
- To combine all features into one
All of these initialized estimators should be stored in a list called stages
.
# Import the necessary classes
# Create the string indexers and determine the names of the numeric
# and indexed columns. Note that ID is an identifier and should NOT
# be included in the numeric columns
# Create a OneHotEncoder to encode the indexed string features
# Determine the names of the final list of features going into the model
# Create a VectorAssembler to combine all features
# Assemble a list of stages that includes all indexers, the one-hot
# encoder, and the vector assembler
Great! Now let's see if that worked. Let's investigate how it transforms your dataset. Put all of the stages in a Pipeline and fit it to your data. Look at the features column. Did you obtain the number of features you expected?
# Import relevant class
# Instantiate a pipeline using stages list
# Fit and transform the data using the pipeline, then look at
# the size of the array in the 'features' column
Click to reveal answer
The pipeline should have produced a sparse vector with 29 features.
This comes from:
- 20 numeric features
- 3 one-hot encoded features with
dropLast=True
, containing- 1 SEX feature
- 3 EDUCATION features
- 2 MARRIAGE features
That looks good! Now let's go ahead and fit data to different machine learning models. To evaluate these models, you should use the BinaryClassificationEvaluator
.
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(
rawPredictionCol='prediction',
labelCol='default',
metricName='areaUnderROC'
)
First, we'll try a LogisticRegression
(documentation here):
- split the data into a train and test set. The basic structure of this is:
train, test = df.randomSplit(weights=[0.8, 0.2], seed=1)
- make sure you replace
df
with the actual name of your prepared dataframe - instantiate a logistic regression with
standardization=True
and add it to the stages list - instantiate a new Pipeline estimator with all of the stages
- fit the pipeline on the training data
- transform both train and test data using the pipeline
- use
evaluator
to evaluate performance on train vs. test
from pyspark.ml.classification import LogisticRegression
# Your code here
Looks like the defaults for LogisticRegression
are working pretty well, since the train and test metrics are pretty similar.
Still, let's try a CrossValidator
(documentation here) + ParamGridBuilder
(documentation here) approach with a few different regularization parameters.
We'll use these regularization parameters:
[0.0, 0.01, 0.1, 1.0]
In the cell below:
- instantiate a
ParamGridBuilder
that tests out theregParam
values listed above - instantiate a
CrossValidator
that uses the param grid you just created as well asevaluator
and the pipeline you created earlier - fit the
CrossValidator
on the full DataFrame - display the metrics for all models, and identify the best model parameters
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Your code here
Now try this again with other classifiers. Try and create a function that will allow you to easily test different models with different parameters. You can find all of the available classification model options here.
This function is optional, but it should allow for your code to be far more D.R.Y. The function should return the fitted cross-validated classifier as well as print out the AUC of the best-performing model and the best parameters.
# Create a function to cross validate different classifiers with different parameters
Now train one other classifier that is not a LogisticRegression
. Use a ParamGridBuilder
to try out some relevant parameters.
# Your code here
# ⏰ This cell may take a long time to run
And one more:
# Your code here
# ⏰ This cell may take a long time to run
Which classifier turned out to be the best overall?
# Your answer here
"""
""";
- Create ROC curves for each of these models
- Try the multi-layer perceptron classifier algorithm. You will soon learn about what this means in the neural network section!
spark.stop()
If you've made it this far, congratulations! Spark is an in-demand skill, but it is not particularly easy to master. In this lesson, you fit multiple different machine learning pipelines for a classification problem. If you want to take your Spark skills to the next level, connect to a distributed cluster using a service like AWS or Databricks and perform these Spark operations on the cloud.