In this lab, we'll gain experience detecting and dealing with a ANN model that is overfitting using various regularization and hyperparameter tuning techniques!
In this lab, we'll work with a large dataset of customer complaints to a bank, with the goal of predicting what product the customer is complaining about based on the text of their complaint. There are 7 different possible products that we can predict, making this a multi-class classification task.
We'll start by preprocessing our dataset by tokenizing the complaints and limiting the number of words we consider to reduce dimensionality.
Once we have preprocessed our data set, we'll build a model and explore the various ways that we can reduce overfitting using the following strategies:
- Early stopping to minimize the discrepancy between train and test accuracy.
- L1 and L2 regularization.
- Dropout regularization.
- Using more data.
Let's Get Started!
Run the cell below to import everything we'll need for this lab.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from sklearn import preprocessing
from keras.preprocessing.text import Tokenizer
import random
random.seed(0)
Now, in the cell below, import our data into a DataFrame. The data is currently stored in Bank_complaints.csv
.
Then, .describe()
the dataset to get a feel for what we're working with.
df = None
In order to speed things up during the development process (and also to give us the ability to see how adding more data affects our model performance), we're going to work with a sample of our dataset rather than the whole thing. The entire dataset consists of 60,000 rows--we're going to build a model using only 10,000 items randomly sampled from this.
In the cell below:
- Get a random sample of
10000
items from our dataset (HINT: use thedf
object's.sample()
method to make this easy) - Reset the indexes on these samples to
range(10000)
, so that the indices for our rows are sequential and make sense. - Store our labels, which are found in
"Product"
, in a different variable. - Store the data, found in
"Consumer complaint narrative
, in the variablecomplaints
.
df = None
df.index = None
product = None
complaints = None
We'll only keep 2,000 most common words and use one-hot encoding to quickly vectorize our dataset from text into a format that a neural network can work with.
In the cell below:
- Create a
Tokenizer()
object, and set thenum_words
parameter to2000
. - Call the tokenizer object's
fit_on_texts()
method and pass in ourcomplaints
variable we created above.
tokenizer = None
Now, we'll create some text sequences by calling the tokenizer
object's .texts_to_sequences()
method and feeding in our complaints
object.
sequences = None
Finally, we'll convert our text data from text to a vectorized matrix.
In the cell below:
- Call the
tokenizer
object's.texts_to_matrix
method, passing in ourcomplaints
variable, as well as setting themode
parameter equal to'binary'
. - Store the tokenizer's
.word_index
in the appropriate variable. - Check the
np.shape()
of ourone_hot_results
.
one_hot_results= None
word_index = None
np.shape(one_hot_results) # Expected Results (10000, 2000)
Now that we've tokenized and encoded our text data, we still need to one-hot encode our label data.
In the cell below:
- Create a
LabelEncoder
object, which can found inside thepreprocessing
module. fit
the label encoder we just created toproduct
.
le = None
Let's check what classes our label encoder found. Run the cell below to examine a list of classes that product
contains.
list(le.classes_)
Now, we'll need to transform product
into a numeric vector.
In the cell below, use the label encoder's .transform
method on product
to create an integer encoded version of our labels.
Then, access product_cat
to see an example of how the labels are now encoded.
product_cat = None
product_cat
Now, we need to go from integer encoding to one-hot encoding. Use the to_categorical
method from keras to do this easily in the cell below.
product_onehot = None
product_onehot
Finally, let's check the shape of our one-hot encoded labels to make sure everything worked correctly.
np.shape(product_onehot) # Expected Output: (10000, 7)
Now, we'll split our data into training and testing sets.
We'll accomplish this by generating a random list of 1500 different indices between 1 and 10000. Then, we'll slice these rows and store them as our test set, and delete them from the training set (it's very important to remember to remove them from the training set!)
Run the cell below to create a set of random indices for our test set.
test_index = random.sample(range(1,10000), 1500)
Now:
- Slice the
test_index
rows fromone_hot_results
and store them intest
.
test = None
# This line returns a version of our one_hot_results that has every item with an index in test_index removed
train = np.delete(one_hot_results, test_index, 0)
Now, we'll need to repeat the splitting process on our labels, making sure that we use the same indices we used to split our data.
In the cell below:
- Slice
test_index
fromproduct_onehot
- Use
np.delete
to removetest_index
items fromproduct_onehot
(the syntax is exactly the same above)
label_test = None
label_train = None
Now, let's examine the shape everything we just did to make sure that the dimensions match up.
In the cell below, use np.shape
to check the shape of:
label_test
label_train
test
train
print(None) # Expected Output: (1500, 7)
print(None) # Expected Output: (8500, 7)
print(None) # Expected Output: (1500, 2000)
print(None) # Expected Output: (8500, 2000)
In the lecture we mentioned that in deep learning, we generally keep aside a validation set, which is used during hyperparameter tuning. Then when we have made the final model decision, the test set is used to define the final model perforance.
In this example, let's take the first 1000 cases out of the training set to become the validation set. You should do this for both train
and label_train
.
Run the cell below to create our validation set.
random.seed(123)
val = train[:1000]
train_final = train[1000:]
label_val = label_train[:1000]
label_train_final = label_train[1000:]
Let's rebuild a fully connected (Dense) layer network with relu activations in Keras.
Recall that we used 2 hidden with 50 units in the first layer and 25 in the second, both with a relu
activation function. Because we are dealing with a multiclass problem (classifying the complaints into 7 classes), we use a use a softmax classifyer in order to output 7 class probabilities per case.
In the cell below:
- Import
Sequential
from the appropriate module in keras. - Import
Dense
from the appropriate module in keras.
Now, build a model with the following specifications in the cell below:
- An input layer of shape
(2000,)
- Hidden layer 1: Dense, 50 neurons, relu activation
- Hidden layer 2: Dense, 25 neurons, relu activation
- Output layer: Dense, 7 neurons, softmax activation
model = None
model.add(None)
model.add(None)
model.add(None)
In the cell below, compile
the model with the following settings:
- Optimizer is
"SGD"
- Loss is
'categorical_crossentropy'
- metrics is
['accuracy']
model.compile(optimizer=None,
loss=None,
metrics=None)
Now, Train the model for 120 epochs in mini-batches of 256 samples. Also pass in (val, label_val)
to the validation_data
parameter, so that we see how our model does on the test set after every epoch.
model_val = model.fit(None,
None,
epochs=None,
batch_size=None,
validation_data=(None, None))
The dictionary history
contains four entries this time: one per metric that was being monitored during training and during validation.
In the cell below:
- Store the model's
.history
inside ofmodel_val_dict
- Check what
keys()
this dictionary contains
model_val_dict = None
model_val_dict.keys()
Now, let's get the final results on the training and testing sets using model.evaluate()
on train_final
and label_train_final
.
results_train = None
Let's also use this function to get the results on our testing set. Call the function again, but this time on test
and label_test
.
results_test = None
Now, check the contents of each.
results_train # Expected Results: [0.33576024494171142, 0.89600000000000002]
results_test # Expected Results: [0.72006658554077152, 0.74333333365122478]
Let's plot the results. Let's include the training and the validation loss in the same plot. We'll do the same thing for the training and validation accuracy.
Run the cell below to visualize a plot of our training and validation loss.
plt.clf()
import matplotlib.pyplot as plt
loss_values = model_val_dict['loss']
val_loss_values = model_val_dict['val_loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'g', label='Training loss')
plt.plot(epochs, val_loss_values, 'g.', label='Validation loss')
plt.title('Training & validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Interesting!
Run the cell below to visualize a plot of our training and validation accuracy>
plt.clf()
acc_values = model_val_dict['acc']
val_acc_values = model_val_dict['val_acc']
plt.plot(epochs, acc_values, 'r', label='Training acc')
plt.plot(epochs, val_acc_values, 'r.', label='Validation acc')
plt.title('Training & validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
We observe an interesting pattern here: although the training accuracy keeps increasing when going through more epochs, and the training loss keeps decreasing, the validation accuracy and loss seem to be reaching a status quo around the 60th epoch. This means that we're actually overfitting to the train data when we do as many epochs as we were doing. Luckily, you learned how to tackle overfitting in the previous lecture! For starters, it does seem clear that we are training too long. So let's stop training at the 60th epoch first (so-called "early stopping") before we move to more advanced regularization techniques!
Now that we know that the model starts to overfit around epoch 60, we can just retrain the model from scratch, but this time only up to 60 epochs! This will help us with our overfitting problem. This method is called Early Stopping.
In the cell below:
- Recreate the exact model we did above.
- Compile the model with the exact same hyperparameters.
- Fit the model with the exact same hyperparameters, with the exception of
epochs
. This time, set epochs to60
instead of120
.
Now, as we did before, get our results using model.evaluate()
on the appropriate variables.
results_train = None
results_test = None
results_train # Expected Output: [0.58606486314137773, 0.79826666669845581]
results_test # [0.74768974288304646, 0.71333333365122475]
We've significantly reduced the variance, so this is already pretty good! Our test set accuracy is slightly worse, but this model will definitely be more robust than the 120 epochs one we fitted before.
Now, let's see what else we can do to improve the result!
Let's include L2 regularization. You can easily do this in keras adding the argument kernel_regulizers.l2
and adding a value for the regularization parameter lambda between parentheses.
In the cell below:
- Recreate the same model we did before.
- In our two hidden layers (but not our output layer), add in the parameter
kernel_regularizer=regularizers.l2(0.005)
to add L2 regularization to each hidden layer. - Compile the model with the same hyperparameters as we did before.
- Fit the model with the same hyperparameters as we did before, but this time for
120
epochs. - Store the fitted model that the
.fit
call returns inside a variable calledL2_model
.
from keras import regularizers
Now, let's see how regularization has affected our model results.
Run the cell below to get the model's .history
.
L2_model_dict = None
L2_model_dict.keys()
Let's look at the training accuracy as well as the validation accuracy for both the L2 and the model without regularization (for 120 epochs).
Run the cell below to visualize our training and validation accuracy both with and without L2 regularization, so that we can compare them directly.
plt.clf()
acc_values = L2_model_dict['acc']
val_acc_values = L2_model_dict['val_acc']
model_acc = model_val_dict['acc']
model_val_acc = model_val_dict['val_acc']
epochs = range(1, len(acc_values) + 1)
plt.plot(epochs, acc_values, 'g', label='Training acc L2')
plt.plot(epochs, val_acc_values, 'g', label='Validation acc L2')
plt.plot(epochs, model_acc, 'r', label='Training acc')
plt.plot(epochs, model_val_acc, 'r', label='Validation acc')
plt.title('Training & validation accuracy L2 vs regular')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
The results of L2 regularization are quite disappointing here. We notice the discrepancy between validation and training accuracy seems to have decreased slightly, but the end result is definitely not getting better.
Let's have a look at L1 regularization. Will this work better?
In the cell below:
- Recreate the same model we did above, but this time, set the
kernel_regularizer
toregularizers.l1(0.005)
inside both hidden layers. - Compile and fit the model exactly as we did for our L2 Regularization experiment (
120
epochs) - Store the fitted model that the
.fit
call returns inside a variable calledL1_model
.
Now, run the cell below to get and visualize the model's .history
.
L1_model_dict = L1_model.history
plt.clf()
acc_values = L1_model_dict['acc']
val_acc_values = L1_model_dict['val_acc']
epochs = range(1, len(acc_values) + 1)
plt.plot(epochs, acc_values, 'g', label='Training acc L1')
plt.plot(epochs, val_acc_values, 'g.', label='Validation acc L1')
plt.title('Training & validation accuracy with L1 regularization')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Notice how The training and validation accuracy don't diverge as much as before! Unfortunately, the validation accuracy doesn't reach rates much higher than 70%. It does seem like we can still improve the model by training much longer.
To complete our comparison, let's use model.evaluate()
again on the appropriate variables to compare results.
results_train = None
results_test = None
results_train # Expected Output: [1.3186310468037923, 0.72266666663487755]
results_test # Expected Output: [1.3541648308436076, 0.70800000031789145]
This is about the best we've seen so far, but we were training for quite a while! Let's see if dropout regularization can do even better and/or be more efficient!
Dropout Regularization is accomplished by adding in an additional Dropout
layer wherever we want to use it, and providing a percentage value for how likely any given neuron is to get "dropped out" during this layer.
In the cell below:
- Import
Dropout
fromkeras.layers
- Recreate the same network we have above, but this time without any L1 or L2 regularization
- Add a
Dropout
layer between hidden layer 1 and hidden layer 2. This should have a dropout chance of0.3
. - Add a
Dropout
layer between hidden layer 2 and the output layer. This should have a dropout chance of0.3
. - Compile the model with the exact same hyperparameters as all other models we've built.
- Fit the model with the same hyperparameters we've used above. But this time, train the model for
200
epochs.
Now, let's check the results from model.evaluate
to see how this change has affected our training.
results_train = None
results_test = None
results_train # Expected Results: [0.36925017188787462, 0.88026666666666664]
results_test # Expected Results: [0.69210424280166627, 0.74333333365122478]
You can see here that the validation performance has improved again! However, the variance did become higher again, compared to L1-regularization.
Another solution to high variance is to just get more data. We actually have more data, but took a subset of 10,000 units before. Let's now quadruple our data set, and see what happens. Note that we are really just lucky here, and getting more data isn't always possible, but this is a useful exercise in order to understand the power of big data sets.
Run the cell below to preprocess our entire dataset, instead of just working with a subset of the data.
df = pd.read_csv('Bank_complaints.csv')
random.seed(123)
df = df.sample(40000)
df.index = range(40000)
product = df["Product"]
complaints = df["Consumer complaint narrative"]
#one-hot encoding of the complaints
tokenizer = Tokenizer(num_words=2000)
tokenizer.fit_on_texts(complaints)
sequences = tokenizer.texts_to_sequences(complaints)
one_hot_results= tokenizer.texts_to_matrix(complaints, mode='binary')
word_index = tokenizer.word_index
np.shape(one_hot_results)
#one-hot encoding of products
le = preprocessing.LabelEncoder()
le.fit(product)
list(le.classes_)
product_cat = le.transform(product)
product_onehot = to_categorical(product_cat)
# train test split
test_index = random.sample(range(1,40000), 4000)
test = one_hot_results[test_index]
train = np.delete(one_hot_results, test_index, 0)
label_test = product_onehot[test_index]
label_train = np.delete(product_onehot, test_index, 0)
#Validation set
random.seed(123)
val = train[:3000]
train_final = train[3000:]
label_val = label_train[:3000]
label_train_final = label_train[3000:]
Now, build the first model that we built, without any regularization or dropout layers included.
Train this model for 120 epochs. All other hyperparameters should stay the same. Store the fitted model inside of moredata_model
.
Now, finally, let's check the results returned from model.evaluate()
to see how this model stacks up with the other techniques we've used.
results_train = None
results_test = None
results_train # Expected Output: [0.31160746300942971, 0.89160606060606062]
results_test # Expected Output: [0.56076071488857271, 0.8145]
With the same amount of epochs, we were able to get a fairly similar validation accuracy of 89.1%. Our test set accuracy went up from ~75% to a staggering 81.45% though, without any other regularization technique. You can still consider early stopping, L1, L2 and dropout here. It's clear that having more data has a strong impact on model performance!
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb
https://catalog.data.gov/dataset/consumer-complaint-database
https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/