Git Product home page Git Product logo

dont-overfit's Introduction

Dont-Overfit

Final Submission

The Final Submission of Dont Overfit! including all the codes complied in one Jupyter Notebook, Project Report and Presentation can be found in the Final Code, Presentation and Report directory.

Presentation Video Link

The Final Video Recording of the Presentation can be found here.

Link to Problem Statement:

https://www.kaggle.com/c/dont-overfit-ii/overview

Link to Datasets:

https://www.kaggle.com/c/dont-overfit-ii/data

The dataset for the project can also be found in the Datasets directory.

Process

1: Literature Survey and Exploratory Data Analysis:

a: Literature Survey

The Literature Survey of this project can be found in the Literature Survey directory.

b: Exploratory Data Analysis

  • Basic dataset analysis has been performed on the dataset in question. The mean and standard deviations of the feature have been computed and observed. These are no missing values in the dataset. Basically the dataset is as clean as it can be.

  • A simple Logistic Regression classifier was fitted and tested on the dataset and provided 100% accuracy on the training data. However, when tested on the test data, the same model, produced an accuracy of 66.2%. This is a clear indicator of overfitting.

  • Some analysis was performed on the features of the data to determine the most important features with respect to the target output. A correlogram was plotted and observed. The distributions of each feature was observed and found to be Gaussian and unimodal in nature.

  • We know that the target value for each data point depends on the 300 features associated with it. However, since the origin of the data is unknown, there is a possibility that the data may be sequential i.e. the current row's target value can depend on the target value of the previous row. This possibility cannot be ignored as it will change the way we look at data henceforth. To verify the same the ACF and PACF plots were plotted for the target output. The results state unequivocally, that rows do not have autocorrelation. Hence we conclude that the data is not Sequential ion nature and the target values for each data point are independent of one another.

The detailed implementation and explaination of the Exploratory Data Analysis of this project can be found in the Exploratory Data Analysis directory.

2: Experimentation:

a: Selecting a Classification Model:

Let us try out a few Classification Models without any Dimensionality Reduction and select the best one for future experimentation.

  1. Logistic Regression Classifier (Accuracy = 0.662)
  2. Support Vector Classifier (Accuracy = 0.663)
  3. Decision Tree Classifier (Accuracy = 0.568)
  4. Gaussian Naive Bayes Classifier (Accuracy = 0.568)
  5. Gaussian Process Classifier (Accuracy = 0.526)
  6. Random Forest Classifier (Accuracy = 0.542)
  7. AdaBoost Classifier (Accuracy = 0.542)
  8. K Nearest Neighbours Classifier (Accuracy = 0.560)
  9. Artificial Neural Network Classifier (Accuracy = 0.660)

Based on the accuracy result, implementation complexity and training time of the above Classification Models, we decided to move ahead using the Logistic Regression Classifier for our further experimentation.

NOTE: Python Implementation for the above Classification Models can be found in the Classification Models directory.

b: Selecting a Dimensionality Reduction Technique:

Let us try out a few Dimensionality Reduction techniques to try and improve the accuracy of the result produced by the Logistic Regression Classifier.

  1. Principle Component Analysis (Accuracy = 0.649)
  2. Singular Value Decomposition (Accuracy = 0.724)
  3. Lasso Regression (Accuracy = 0.848)

Based on the accuracy result and implementation complexity, Lasso Regression appears to be the best choice and the Results obtained by using Lasso Regression are satisfactory.

NOTE: Python Implementation for the above Dimensionality Reduction Techniques can be found in the Dimensionality Reduction directory.

Results

The Class Labels predicted by our various Machine Learning Models can be found in the Class Labels Generated by Various Models directory.

The Screenshots of the Kaggle Test results can be found in the Kaggle Test Result Screenshots directory.

dont-overfit's People

Contributors

amitk526 avatar hardik-g avatar mayankagarwals avatar

Watchers

 avatar

Forkers

hardik-g

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.