Git Product home page Git Product logo

sachelsout / effect-of-data-having-features-with-different-variances Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 53 KB

This repository shows how linear models like Support Vector Machines (SVM) and Logistic Regression (LR) behaves for the data having features with different variances and how we can tackle this issue.

Jupyter Notebook 100.00%
correlation data-science linear-models logistic-regression machine-learning python svm variance

effect-of-data-having-features-with-different-variances's Introduction

effect-of-data-having-features-with-different-variances

This repository shows how linear models like Support Vector Machines (SVM) and Logistic Regression (LR) behaves for the data having features with different variances and how we can tackle this issue.

What is Variance

image
Variance is a measure of how data points differ from the mean. It is a measure of how far a set of data (numbers) are spread out from their mean (average) value. In the variance formula, squared sigma is the variance, summation is summing/looping through all the datapoints, xi is the datapoint, xi_bar is the mean of the datapoints and N is the number of datapoints

What if data has features with different Variances

image
Consider there are 3 features f1,f2,f3 having different variance values. One of the feature f3 is highly correlated with the ouput label y. Typically, f3 should get highest feature importance amongst all the features. But this does not happen. Due to different variances, the linear models are not able to interpret the feature importances properly.
After using linear models like Logistic Regression and Support Vector Machines, below are the feature importances obtained.

For Logistic Regression
Feature f1 score is : 11070.275564782807
Feature f2 score is : -13901.807078828466
Feature f3 score is : 9221.654056354826

For SVM
Feature f1 score is : 3370.3580271414594
Feature f2 score is : -12120.653905895764
Feature f3 score is : 10724.13991642756

Observation

  1. In Logistic Regression, Feature 2(f2) with negative sign has the highest value whereas in SVM, Feature 2(f2) with negative sign has the highest value. So, due to different variance of features, we are getting uninterpretable feature importances.
  2. Feature 3 should get the highest feature importance as Feature 3 is the most correlated feature to the variable y.(f3 = 0.839060 which is highly correlated) compared to other 2 features(f1 = 0.067172 and f2 = -0.017944)

Data Standardization to the Rescue

image
To tackle the different feature variances porblem, data standardization can be used.
Data Standardization is the statistical technique to center the mean at zero and scale the variance to unit length.
After standardizing the data using sklearn's StandardScaler(), below are the feature importance scores for Logistic Regression and SVM linear models.

For Logistic Regression
Feature f1 score is : 2.4215865651559607
Feature f2 score is : 4.513194368442512
Feature f3 score is : 13.365047741400916

For SVM
Feature f1 score is : -2.7506137522289436
Feature f2 score is : 1.4934906475265775
Feature f3 score is : 12.279421550485994

Observation

  1. After applying Logistic Regression and SVM on standardized(In standardization, mean centering and variance scaling is done((x-μ)/σ),to obtain mean = 0 and std-dev = 1) data, we can see that feature f3 gets highest feature importance value(highest weight).
  2. Feature 3 should get the highest feature importance as Feature 3 is the most correlated feature to the variable y.(f3 = 0.839060 which is highly correlated) compared to other 2 features(f1 = 0.067172 and f2 = -0.017944)

effect-of-data-having-features-with-different-variances's People

Contributors

sachelsout avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.