This repository shows how linear models like Support Vector Machines (SVM) and Logistic Regression (LR) behaves for the data having features with different variances and how we can tackle this issue.
Variance is a measure of how data points differ from the mean. It is a measure of how far a set of data (numbers) are spread out from their mean (average) value. In the variance formula, squared sigma is the variance, summation is summing/looping through all the datapoints, xi is the datapoint, xi_bar is the mean of the datapoints and N is the number of datapoints
Consider there are 3 features f1,f2,f3 having different variance values. One of the feature f3 is highly correlated with the ouput label y. Typically, f3 should get highest feature importance amongst all the features. But this does not happen. Due to different variances, the linear models are not able to interpret the feature importances properly.
After using linear models like Logistic Regression and Support Vector Machines, below are the feature importances obtained.
For Logistic Regression
Feature f1 score is : 11070.275564782807
Feature f2 score is : -13901.807078828466
Feature f3 score is : 9221.654056354826
For SVM
Feature f1 score is : 3370.3580271414594
Feature f2 score is : -12120.653905895764
Feature f3 score is : 10724.13991642756
Observation
- In Logistic Regression, Feature 2(f2) with negative sign has the highest value whereas in SVM, Feature 2(f2) with negative sign has the highest value. So, due to different variance of features, we are getting uninterpretable feature importances.
- Feature 3 should get the highest feature importance as Feature 3 is the most correlated feature to the variable y.(f3 = 0.839060 which is highly correlated) compared to other 2 features(f1 = 0.067172 and f2 = -0.017944)
To tackle the different feature variances porblem, data standardization can be used.
Data Standardization is the statistical technique to center the mean at zero and scale the variance to unit length.
After standardizing the data using sklearn's StandardScaler(), below are the feature importance scores for Logistic Regression and SVM linear models.
For Logistic Regression
Feature f1 score is : 2.4215865651559607
Feature f2 score is : 4.513194368442512
Feature f3 score is : 13.365047741400916
For SVM
Feature f1 score is : -2.7506137522289436
Feature f2 score is : 1.4934906475265775
Feature f3 score is : 12.279421550485994
Observation
- After applying Logistic Regression and SVM on standardized(In standardization, mean centering and variance scaling is done((x-μ)/σ),to obtain mean = 0 and std-dev = 1) data, we can see that feature f3 gets highest feature importance value(highest weight).
- Feature 3 should get the highest feature importance as Feature 3 is the most correlated feature to the variable y.(f3 = 0.839060 which is highly correlated) compared to other 2 features(f1 = 0.067172 and f2 = -0.017944)