In today's lesson, we talked about continuous distributions (mainly normal distribution), linear regression, and how multicollinearity can impact the model. In this lab, we will test your knowledge of those things using the marketing_customer_analysis.csv
file. You can continue using the same Jupyter file. The file can be found in the files_for_lab
folder.
Use the Jupyter file from the last lab (Customer Analysis Round 3)
- Check the data types of the columns. Get the numeric data into dataframe called
numerical
and categorical columns in a dataframe calledcategoricals
. (You can use np.number and np.object to select the numerical data types and categorical data types respectively) - Now we will try to check the normality of the numerical variables visually
- Use seaborn library to construct distribution plots for the numerical variables
- Use Matplotlib to construct histograms
- Do the distributions for different numerical columns look symmetrical? Compute the skewness for each, and add a comment with your findings.
- For the numerical variables, check the multicollinearity between the features. Please note that we will use the column
total_claim_amount
later as the target variable. - If you find a pair of columns that show a high correlation between them (greater than 0.9), drop the one that is less correlated with the column
total_claim_amount
. Write code for both the correlation matrix. If there is no pair of features that have a high correlation, then do not drop any features. - Plot the heatmap of the correlation matrix after the filtering.