Scrubbing Our Data - Lab

Introduction

In the previous labs, we joined the data from our separate files into a single DataFrame. In this lab, we'll scrub the data to get it ready for exploration and modeling!

Objectives

You will be able to:

Cast columns to the appropriate data types
Identify and deal with null values appropriately
Remove unnecessary columns
Understand how to normalize data

Getting Started

You'll find the resulting dataset from our work in the Obtaining Data Lab stored within the file walmart_data_not_cleaned.csv.

In the cells below:

Import pandas and set the standard alias
Import numpy and set the standard alias
Import matplotlib.pyplot and set the standard alias
Import seaborn and set the alias sns (this is the standard alias for seaborn)
Use the ipython magic command to set all matplotlib visualizations to display inline in the the notebook
Load the dataset stored in the .csv file into a DataFrame using pandas
Inspect the head of the DataFrame to ensure everything loaded correctly

# Import statements go here

# Now, load in the dataset and inspect the head to make sure everything loaded correctly
df = None

Great! Everything looks just like we left it as. Now, we can begin cleaning the data.

Before we jump right into data cleaning,. we'll want to consider our process first. Answer the following questions below:

What sorts of problems should we be checking for? Are there any obvious issues that we'll need to deal with? What are some other issues that this dataset could contain that might not be immediately obvious?

Write your answer below this line:

Dealing with Oversized Datasets

This dataset is quite large. Often, when starting out on a project, its a good idea to build the model on a subset of the data so that we're not bogged down by large runtimes. Let's investigate the dataset a bit to get a feel for if this is a good idea.

In the cell below, check how many rows this dataset contains.

This dataset contains 421570 rows! That's large enough that we should consider building our model on a subset of the data to increase our speed during the modeling step. Modeling is an iterative process, and we'll likely have to fit out model multiple times as we tweak it--by subsetting our dataset, we'll protect ourselves from insane runtimes everytime we make a small change and need to rerun our model. Once we have a prototype built, we can always add all the extra data back in!

Subsetting our Dataset

The typical method for subsetting our dataset is to just take a random sample of data. This is an option for us. However, when we inspect the columns of our dataset in a bit, we'll notice that we have 2 categorical columns with very high cardinality--Store, and Dept. This provides us with an opportunity to reduce dimensionality while subsampling. Instead of building a model on all the stores and departments in our dataset, we'll subset our data so that it only contains stores 1 through 10.

In the cell below, slice our dataset so that only rows with a Store value between 1 and 10 (inclusive) remain.

df = None

len(df)

Starting our Data Cleaning

We'll begin by dealing with the most obvious issues--data types and null values.

First, we'll check the different types of encoding that each column has, and then we'll check for null values and examine our options for dealing with them.

Checking Data Types

In the cell below, use the appropriate method to check the data type of each column.

Let's investigate the unique values inside of the Store and Dept columns.

In the cells below, use the appropriate DataFrame method to display all the unique values in the Store column, and in the Dept column.

Categorical Data Stored as Integers

A common issue we usually check for at this stage is numeric columns that have accidentally been encoded as strings. However, in this dataset, we'll notice that although the Store and Dept columns are both contain integer values, we can intuit that these are meant to be read as categorical data. We'll want to convert these columns to strings, so that they will be one-hot encoded when we get around to dealing with our categorical columns.

You may be wondering why we don't just leave it as is. This is because we would accidentally be creating numeric relationships between the different stores that shouldn't be there. If left with numeric encoding, our model would interpret Store 2 as twice Store 1, but half of Store 4. These sorts of mathematical relationships don't make sense--we'd much rather these be treated as categories, as the dataset intends.

In the cell below, cast the Store and Dept columns to strings.

Numeric Data Stored as Strings

It looks like we have two columns that are encoded as strings (remember, pandas denotes string columns as object)--Date and Type.

We don't need to worry about Date--those should obviously be encoded as strings. Let's quickly check out the Type column just to ensure that it doesn't contain numeric data.

In the cell below, get the unique values contained within the Type column.

Great job--the Type column is clearly a categorical column, and should currently be stored as a string. Once we've dealt with null values, we'll deal with this and other categorical columns by one-hot encoding them.

Let's double check the column encodings one more time to make sure that everything we did above worked correctly.

Detecting and Dealing With Null Values

Next, we'll need to check for null values. How we deal with the null values will be determined by the columns containing them, and how many null values exist in each.

In the cell below, use the appropriate pandas functionality to get a count of how many null values exist in each column in our DataFrame.

QUESTION: Interpret the output above. Do we know enough to have a strategy for dealing with these null values yet? Is dropping the rows a valid solution? How about dropping the columns? Can we replace them with interpolation, treat them as a categorical value, or deal with them through binning?

Write your answer below this line:

Let's investigate these columns further. In the cell below, get the top 20 value counts contained within MarkDown3.

Now, let's get the descriptive statistics for each of the markdown columns. We want to see where the minimum and maximum values lie.

Okay--let's examine what we know about these columns, and come up with a solution for dealing with these null values.

The data contained within each column are continuously-valued floats.
The range is quite large, with the smallest value being around 0 or even negative in some columns, and the max being greater than 100,000.
There is extremely high variance in each, with the standard deviation being larger than the mean in all 5 columns.

Dealing With Null Values Through Binning

This suggests that our best bet is to bin the columns. The hard part is figuring out the right amount of bins to use. Too many, and we subject ourselves to the curse of dimensionality. Too few, and we lose information from the columns that could be important.

For now, let's start with with 5 bins of equal size.

In the cell below:

Create a binned version of each MarkDown column and add them to our DataFrame.
When calling pd.cut(), pass in the appropriate column as the object to be binned, the number of bins we want, 5, and set the labels parameter to bins, so that we have clearly labeled names for each bin.

For more information on how to bin these columns using pd.cut, see the pandas documentation for this method.

bins = ['0-20%', '21-40%', '41-60%', '61-80%', '81-100%']

for i in range (1, 6):
    df["binned_markdown_" + str(i)] = None

Great! Now, let's check the .dtypes attribute of our DataFrame to see that these new categorical columns have been created.

They exist! However, they still contain null values. We need to replace all null values with a string that will represent all missing values. This is easy enough for us--we can just use the replace() method or the fillna() method on each column and replace NaN with "NaN".

In the cell below, replace all missing values inside our binned_markdown columns with the string "NaN".

NOTE: If you're unsure of how to do this, check the pandas documentation for replace.

for i in range (1,6):
    None

Great! Now, let's check if those columns still contain null values.

In the cell below, display the number of null values contained within each column of our DataFrame.

Excellent! We've now dealt with all the null values in our dataset through Coarse Classification by binning our data and treating null values as a distinct category. All that's left to do is to drop our original MarkDown columns from the DataFrame.

Note that in this step, we'll also drop the Date column, because we are going to build a generalized model and will not be making use of any time series data.

In the cell below:

Create a list called to_drop that contains the name of every MarkDown column we need to drop (for a challenge, try doing this with a list comprehension!)
Append "Date" to to_drop
Drop these columns (in place) from our DataFrame
Display the number of null values in each column again to confirm that these columns have been dropped, and that our DataFrame now contains no missing values

to_drop = None

Checking for Multicollinearity

Before we one-hot encode our categorical columns, we'll want to quickly check the dataset for multicollinearity, since this can really mess up our model if it exists in the dataset. We want to make sure that the columns within the dataset are not highly correlated.

We'll do this by creating a correlation heatmap. We want to do this before one-hot-encoding our data because each column becomes both a row and a column in the visualization, and after one-hot encoding, this dataset will contain over 300 columns!

The seaborn documentation provides some great code samples to help us figure out how to display a Correlation Heatmap.

Check out this documentation, and then modify the code included so that it displays a Correlation Heatmap for our dataset below.

# Set the style of the visualization
sns.set(style="white")

# Create a covariance matrix
corr = None

# Generate a mask the size of our covariance matrix
mask = None
mask[np.triu_indices_from(mask)] = None

# Set up the matplotlib figure
f, ax = None

# Generate a custom diverging colormap
cmap = None

# Draw the heatmap with the mask and correct aspect ratio

Interpret the Correlation Heatmap we created above to answer the following questions:

Which columns are highly correlated with the target column our model will predict? Are any of our predictor columns highly correlated enough that we should consider dropping them? Explain your answer.

Write your answer below this line:

Normalizing our Data

Now, we'll need to convert all of our numeric columns to the same scale by normalizing our dataset. Recall that we normalize our dataset by converting each numeric value to it's corresponding z-score for the column, which is obtained by subtracting the column's mean and then dividing by the column's standard deviation for every value.

Since we only have 4 columns containing numeric data that needs to be normalized, we'll do this by hand in the cell below. This allows us to avoid errors that stem from trying to normalize datasets that contain strings in all of our categorical columns. Plus, it's good practice to help us remember how normalization works!

In the cell below:

Normalize the following columns individually: Size, Temperature, Fuel_Price, CPI, and Unemployment by subtracting the column mean and dividing by the column standard deviation.

df.Size = None
df.Temperature = None
df.Fuel_Price = None
df.CPI = None
df.Unemployment = None

One-Hot Encoding Categorical Columns

For our final step, we'll need to deal with our categorical columns. Categorical data work work for our modeling step--we'll need to convert these to numeric columns through one-hot encoding.

In the cell below, use the appropriate function in pandas to one-hot encode the dataset.

df = None

df.head()

That's it! We've now successfully scrubbed our dataset--we're now ready for data exploration and modeling.

Conclusion

In this lesson, we learned gain practice with data cleaning by:

Casting columns to the appropriate data types
Identifying and deal with null values appropriately
Removing unnecessary columns
Checking for and deal with multicollinearity
Normalizing our data

learn-co-students / dsc-1-12-06-scrubbing-our-data-lab-bain-trial-jan19 Goto Github PK

dsc-1-12-06-scrubbing-our-data-lab-bain-trial-jan19's Introduction

Scrubbing Our Data - Lab

Introduction

Objectives

Getting Started

Dealing with Oversized Datasets

Subsetting our Dataset

Starting our Data Cleaning

Checking Data Types

Categorical Data Stored as Integers

Numeric Data Stored as Strings

Detecting and Dealing With Null Values

Dealing With Null Values Through Binning

Checking for Multicollinearity

Normalizing our Data

One-Hot Encoding Categorical Columns

Conclusion

dsc-1-12-06-scrubbing-our-data-lab-bain-trial-jan19's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org