Regularization and Optimization Lab


In this lab, we'll gain experience detecting and dealing with a ANN model that is overfitting using various regularization and hyperparameter tuning techniques!

Getting Started

In this lab, we'll work with a large dataset of customer complaints to a bank, with the goal of predicting what product the customer is complaining about based on the text of their complaint. There are 7 different possible products that we can predict, making this a multi-class classification task.

Preprocessing our Data Set

We'll start by preprocessing our dataset by tokenizing the complaints and limiting the number of words we consider to reduce dimensionality.

Building our Tuning our Model

Once we have preprocessed our data set, we'll build a model and explore the various ways that we can reduce overfitting using the following strategies:

  • Early stopping to minimize the discrepancy between train and test accuracy.
  • L1 and L2 regularization.
  • Dropout regularization.
  • Using more data.

Let's Get Started!

2. Preprocessing the Bank Complaints Data Set

2.1 Import the libraries and take a sample

Run the cell below to import everything we'll need for this lab.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from sklearn import preprocessing
from keras.preprocessing.text import Tokenizer
import random

Now, in the cell below, import our data into a DataFrame. The data is currently stored in Bank_complaints.csv. Then, .describe() the dataset to get a feel for what we're working with.

df =  None

In order to speed things up during the development process (and also to give us the ability to see how adding more data affects our model performance), we're going to work with a sample of our dataset rather than the whole thing. The entire dataset consists of 60,000 rows--we're going to build a model using only 10,000 items randomly sampled from this.

In the cell below:

  • Get a random sample of 10000 items from our dataset (HINT: use the df object's .sample() method to make this easy)
  • Reset the indexes on these samples to range(10000), so that the indices for our rows are sequential and make sense.
  • Store our labels, which are found in "Product", in a different variable.
  • Store the data, found in "Consumer complaint narrative, in the variable complaints.
df = None
df.index = None
product = None
complaints = None

2.2 Tokenizing the Complaints

We'll only keep 2,000 most common words and use one-hot encoding to quickly vectorize our dataset from text into a format that a neural network can work with.

In the cell below:

  • Create a Tokenizer() object, and set the num_words parameter to 2000.
  • Call the tokenizer object's fit_on_texts() method and pass in our complaints variable we created above.
tokenizer = None

Now, we'll create some text sequences by calling the tokenizer object's .texts_to_sequences() method and feeding in our complaints object.

sequences = None

Finally, we'll convert our text data from text to a vectorized matrix.

In the cell below:

  • Call the tokenizer object's .texts_to_matrix method, passing in our complaints variable, as well as setting the mode parameter equal to 'binary'.
  • Store the tokenizer's .word_index in the appropriate variable.
  • Check the np.shape() of our one_hot_results.
one_hot_results= None
word_index = None
np.shape(one_hot_results) # Expected Results (10000, 2000)

2.3 One-hot Encoding of the Products Column

Now that we've tokenized and encoded our text data, we still need to one-hot encode our label data.

In the cell below:

  • Create a LabelEncoder object, which can found inside the preprocessing module.
  • fit the label encoder we just created to product.
le = None

Let's check what classes our label encoder found. Run the cell below to examine a list of classes that product contains.


Now, we'll need to transform product into a numeric vector.

In the cell below, use the label encoder's .transform method on product to create an integer encoded version of our labels.

Then, access product_cat to see an example of how the labels are now encoded.

product_cat = None

Now, we need to go from integer encoding to one-hot encoding. Use the to_categorical method from keras to do this easily in the cell below.

product_onehot = None

Finally, let's check the shape of our one-hot encoded labels to make sure everything worked correctly.

np.shape(product_onehot) # Expected Output: (10000, 7)

2.4 Train - test split

Now, we'll split our data into training and testing sets.

We'll accomplish this by generating a random list of 1500 different indices between 1 and 10000. Then, we'll slice these rows and store them as our test set, and delete them from the training set (it's very important to remember to remove them from the training set!)

Run the cell below to create a set of random indices for our test set.

test_index = random.sample(range(1,10000), 1500)


  • Slice the test_index rows from one_hot_results and store them in test.
test = None

# This line returns a version of our one_hot_results that has every item with an index in test_index removed
train = np.delete(one_hot_results, test_index, 0)

Now, we'll need to repeat the splitting process on our labels, making sure that we use the same indices we used to split our data.

In the cell below:

  • Slice test_index from product_onehot
  • Use np.delete to remove test_index items from product_onehot (the syntax is exactly the same above)
label_test = None
label_train = None

Now, let's examine the shape everything we just did to make sure that the dimensions match up.

In the cell below, use np.shape to check the shape of:

  • label_test
  • label_train
  • test
  • train
print(None) # Expected Output: (1500, 7)
print(None) # Expected Output: (8500, 7)
print(None) # Expected Output: (1500, 2000)
print(None) # Expected Output: (8500, 2000)

3. Running the model using a validation set.

3.1 Creating the validation set

In the lecture we mentioned that in deep learning, we generally keep aside a validation set, which is used during hyperparameter tuning. Then when we have made the final model decision, the test set is used to define the final model perforance.

In this example, let's take the first 1000 cases out of the training set to become the validation set. You should do this for both train and label_train.

Run the cell below to create our validation set.

val = train[:1000]
train_final = train[1000:]
label_val = label_train[:1000]
label_train_final = label_train[1000:]

3.2 Creating, compiling and running the model

Let's rebuild a fully connected (Dense) layer network with relu activations in Keras.

Recall that we used 2 hidden with 50 units in the first layer and 25 in the second, both with a relu activation function. Because we are dealing with a multiclass problem (classifying the complaints into 7 classes), we use a use a softmax classifyer in order to output 7 class probabilities per case.

In the cell below:

  • Import Sequential from the appropriate module in keras.
  • Import Dense from the appropriate module in keras.

Now, build a model with the following specifications in the cell below:

  • An input layer of shape (2000,)
  • Hidden layer 1: Dense, 50 neurons, relu activation
  • Hidden layer 2: Dense, 25 neurons, relu activation
  • Output layer: Dense, 7 neurons, softmax activation
model = None

In the cell below, compile the model with the following settings:

  • Optimizer is "SGD"
  • Loss is 'categorical_crossentropy'
  • metrics is ['accuracy']

Now, Train the model for 120 epochs in mini-batches of 256 samples. Also pass in (val, label_val) to the validation_data parameter, so that we see how our model does on the test set after every epoch.

model_val =,
                    validation_data=(None, None))

The dictionary history contains four entries this time: one per metric that was being monitored during training and during validation.

In the cell below:

  • Store the model's .history inside of model_val_dict
  • Check what keys() this dictionary contains
model_val_dict = None

Now, let's get the final results on the training and testing sets using model.evaluate() on train_final and label_train_final.

results_train = None

Let's also use this function to get the results on our testing set. Call the function again, but this time on test and label_test.

results_test = None

Now, check the contents of each.

results_train # Expected Results: [0.33576024494171142, 0.89600000000000002]
results_test # Expected Results: [0.72006658554077152, 0.74333333365122478]

Plotting the results

Let's plot the results. Let's include the training and the validation loss in the same plot. We'll do the same thing for the training and validation accuracy.

Run the cell below to visualize a plot of our training and validation loss.


import matplotlib.pyplot as plt
loss_values = model_val_dict['loss']
val_loss_values = model_val_dict['val_loss']

epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'g', label='Training loss')
plt.plot(epochs, val_loss_values, 'g.', label='Validation loss')

plt.title('Training & validation loss')


Run the cell below to visualize a plot of our training and validation accuracy>


acc_values = model_val_dict['acc'] 
val_acc_values = model_val_dict['val_acc']

plt.plot(epochs, acc_values, 'r', label='Training acc')
plt.plot(epochs, val_acc_values, 'r.', label='Validation acc')
plt.title('Training & validation accuracy')

We observe an interesting pattern here: although the training accuracy keeps increasing when going through more epochs, and the training loss keeps decreasing, the validation accuracy and loss seem to be reaching a status quo around the 60th epoch. This means that we're actually overfitting to the train data when we do as many epochs as we were doing. Luckily, you learned how to tackle overfitting in the previous lecture! For starters, it does seem clear that we are training too long. So let's stop training at the 60th epoch first (so-called "early stopping") before we move to more advanced regularization techniques!

3. Early stopping

Now that we know that the model starts to overfit around epoch 60, we can just retrain the model from scratch, but this time only up to 60 epochs! This will help us with our overfitting problem. This method is called Early Stopping.

In the cell below:

  • Recreate the exact model we did above.
  • Compile the model with the exact same hyperparameters.
  • Fit the model with the exact same hyperparameters, with the exception of epochs. This time, set epochs to 60 instead of 120.

Now, as we did before, get our results using model.evaluate() on the appropriate variables.

results_train = None
results_test = None
results_train  # Expected Output: [0.58606486314137773, 0.79826666669845581]
results_test # [0.74768974288304646, 0.71333333365122475]

We've significantly reduced the variance, so this is already pretty good! Our test set accuracy is slightly worse, but this model will definitely be more robust than the 120 epochs one we fitted before.

Now, let's see what else we can do to improve the result!

4. L2 regularization

Let's include L2 regularization. You can easily do this in keras adding the argument kernel_regulizers.l2 and adding a value for the regularization parameter lambda between parentheses.

In the cell below:

  • Recreate the same model we did before.
  • In our two hidden layers (but not our output layer), add in the parameter kernel_regularizer=regularizers.l2(0.005) to add L2 regularization to each hidden layer.
  • Compile the model with the same hyperparameters as we did before.
  • Fit the model with the same hyperparameters as we did before, but this time for 120 epochs.
  • Store the fitted model that the .fit call returns inside a variable called L2_model.
from keras import regularizers

Now, let's see how regularization has affected our model results.

Run the cell below to get the model's .history.

L2_model_dict = None

Let's look at the training accuracy as well as the validation accuracy for both the L2 and the model without regularization (for 120 epochs).

Run the cell below to visualize our training and validation accuracy both with and without L2 regularization, so that we can compare them directly.


acc_values = L2_model_dict['acc'] 
val_acc_values = L2_model_dict['val_acc']
model_acc = model_val_dict['acc']
model_val_acc = model_val_dict['val_acc']

epochs = range(1, len(acc_values) + 1)
plt.plot(epochs, acc_values, 'g', label='Training acc L2')
plt.plot(epochs, val_acc_values, 'g', label='Validation acc L2')
plt.plot(epochs, model_acc, 'r', label='Training acc')
plt.plot(epochs, model_val_acc, 'r', label='Validation acc')
plt.title('Training & validation accuracy L2 vs regular')

The results of L2 regularization are quite disappointing here. We notice the discrepancy between validation and training accuracy seems to have decreased slightly, but the end result is definitely not getting better.

5. L1 regularization

Let's have a look at L1 regularization. Will this work better?

In the cell below:

  • Recreate the same model we did above, but this time, set the kernel_regularizer to regularizers.l1(0.005) inside both hidden layers.
  • Compile and fit the model exactly as we did for our L2 Regularization experiment (120 epochs)
  • Store the fitted model that the .fit call returns inside a variable called L1_model.

Now, run the cell below to get and visualize the model's .history.

L1_model_dict = L1_model.history

acc_values = L1_model_dict['acc'] 
val_acc_values = L1_model_dict['val_acc']

epochs = range(1, len(acc_values) + 1)
plt.plot(epochs, acc_values, 'g', label='Training acc L1')
plt.plot(epochs, val_acc_values, 'g.', label='Validation acc L1')
plt.title('Training & validation accuracy with L1 regularization')

Notice how The training and validation accuracy don't diverge as much as before! Unfortunately, the validation accuracy doesn't reach rates much higher than 70%. It does seem like we can still improve the model by training much longer.

To complete our comparison, let's use model.evaluate() again on the appropriate variables to compare results.

results_train = None

results_test = None
results_train # Expected Output: [1.3186310468037923, 0.72266666663487755]
results_test # Expected Output: [1.3541648308436076, 0.70800000031789145]

This is about the best we've seen so far, but we were training for quite a while! Let's see if dropout regularization can do even better and/or be more efficient!

6. Dropout Regularization

Dropout Regularization is accomplished by adding in an additional Dropout layer wherever we want to use it, and providing a percentage value for how likely any given neuron is to get "dropped out" during this layer.

In the cell below:

  • Import Dropout from keras.layers
  • Recreate the same network we have above, but this time without any L1 or L2 regularization
  • Add a Dropout layer between hidden layer 1 and hidden layer 2. This should have a dropout chance of 0.3.
  • Add a Dropout layer between hidden layer 2 and the output layer. This should have a dropout chance of 0.3.
  • Compile the model with the exact same hyperparameters as all other models we've built.
  • Fit the model with the same hyperparameters we've used above. But this time, train the model for 200 epochs.

Now, let's check the results from model.evaluate to see how this change has affected our training.

results_train = None
results_test = None
results_train # Expected Results: [0.36925017188787462, 0.88026666666666664]
results_test # Expected Results: [0.69210424280166627, 0.74333333365122478]

You can see here that the validation performance has improved again! However, the variance did become higher again, compared to L1-regularization.

7. More Training Data?

Another solution to high variance is to just get more data. We actually have more data, but took a subset of 10,000 units before. Let's now quadruple our data set, and see what happens. Note that we are really just lucky here, and getting more data isn't always possible, but this is a useful exercise in order to understand the power of big data sets.

Run the cell below to preprocess our entire dataset, instead of just working with a subset of the data.

df = pd.read_csv('Bank_complaints.csv')
df = df.sample(40000)
df.index = range(40000)
product = df["Product"]
complaints = df["Consumer complaint narrative"]

#one-hot encoding of the complaints
tokenizer = Tokenizer(num_words=2000)
sequences = tokenizer.texts_to_sequences(complaints)
one_hot_results= tokenizer.texts_to_matrix(complaints, mode='binary')
word_index = tokenizer.word_index

#one-hot encoding of products
le = preprocessing.LabelEncoder()
product_cat = le.transform(product) 
product_onehot = to_categorical(product_cat)

# train test split
test_index = random.sample(range(1,40000), 4000)
test = one_hot_results[test_index]
train = np.delete(one_hot_results, test_index, 0)
label_test = product_onehot[test_index]
label_train = np.delete(product_onehot, test_index, 0)

#Validation set
val = train[:3000]
train_final = train[3000:]
label_val = label_train[:3000]
label_train_final = label_train[3000:]

Now, build the first model that we built, without any regularization or dropout layers included.

Train this model for 120 epochs. All other hyperparameters should stay the same. Store the fitted model inside of moredata_model.

Now, finally, let's check the results returned from model.evaluate() to see how this model stacks up with the other techniques we've used.

results_train = None
results_test = None
results_train # Expected Output:  [0.31160746300942971, 0.89160606060606062]
results_test # Expected Output: [0.56076071488857271, 0.8145]

With the same amount of epochs, we were able to get a fairly similar validation accuracy of 89.1%. Our test set accuracy went up from ~75% to a staggering 81.45% though, without any other regularization technique. You can still consider early stopping, L1, L2 and dropout here. It's clear that having more data has a strong impact on model performance!


Further reading

