Dont-Overfit

Final Submission

The Final Submission of Dont Overfit! including all the codes complied in one Jupyter Notebook, Project Report and Presentation can be found in the Final Code, Presentation and Report directory.

Presentation Video Link

The Final Video Recording of the Presentation can be found here.

Link to Problem Statement:

https://www.kaggle.com/c/dont-overfit-ii/overview

Link to Datasets:

https://www.kaggle.com/c/dont-overfit-ii/data

The dataset for the project can also be found in the Datasets directory.

Process

1: Literature Survey and Exploratory Data Analysis:

a: Literature Survey

The Literature Survey of this project can be found in the Literature Survey directory.

b: Exploratory Data Analysis

Basic dataset analysis has been performed on the dataset in question. The mean and standard deviations of the feature have been computed and observed. These are no missing values in the dataset. Basically the dataset is as clean as it can be.
A simple Logistic Regression classifier was fitted and tested on the dataset and provided 100% accuracy on the training data. However, when tested on the test data, the same model, produced an accuracy of 66.2%. This is a clear indicator of overfitting.
Some analysis was performed on the features of the data to determine the most important features with respect to the target output. A correlogram was plotted and observed. The distributions of each feature was observed and found to be Gaussian and unimodal in nature.
We know that the target value for each data point depends on the 300 features associated with it. However, since the origin of the data is unknown, there is a possibility that the data may be sequential i.e. the current row's target value can depend on the target value of the previous row. This possibility cannot be ignored as it will change the way we look at data henceforth. To verify the same the ACF and PACF plots were plotted for the target output. The results state unequivocally, that rows do not have autocorrelation. Hence we conclude that the data is not Sequential ion nature and the target values for each data point are independent of one another.

The detailed implementation and explaination of the Exploratory Data Analysis of this project can be found in the Exploratory Data Analysis directory.

2: Experimentation:

a: Selecting a Classification Model:

Let us try out a few Classification Models without any Dimensionality Reduction and select the best one for future experimentation.

Logistic Regression Classifier (Accuracy = 0.662)
Support Vector Classifier (Accuracy = 0.663)
Decision Tree Classifier (Accuracy = 0.568)
Gaussian Naive Bayes Classifier (Accuracy = 0.568)
Gaussian Process Classifier (Accuracy = 0.526)
Random Forest Classifier (Accuracy = 0.542)
AdaBoost Classifier (Accuracy = 0.542)
K Nearest Neighbours Classifier (Accuracy = 0.560)
Artificial Neural Network Classifier (Accuracy = 0.660)

Based on the accuracy result, implementation complexity and training time of the above Classification Models, we decided to move ahead using the Logistic Regression Classifier for our further experimentation.

NOTE: Python Implementation for the above Classification Models can be found in the Classification Models directory.

b: Selecting a Dimensionality Reduction Technique:

Let us try out a few Dimensionality Reduction techniques to try and improve the accuracy of the result produced by the Logistic Regression Classifier.

Principle Component Analysis (Accuracy = 0.649)
Singular Value Decomposition (Accuracy = 0.724)
Lasso Regression (Accuracy = 0.848)

Based on the accuracy result and implementation complexity, Lasso Regression appears to be the best choice and the Results obtained by using Lasso Regression are satisfactory.

NOTE: Python Implementation for the above Dimensionality Reduction Techniques can be found in the Dimensionality Reduction directory.

Results

The Class Labels predicted by our various Machine Learning Models can be found in the Class Labels Generated by Various Models directory.

The Screenshots of the Kaggle Test results can be found in the Kaggle Test Result Screenshots directory.

mayankagarwals / dont-overfit Goto Github PK

dont-overfit's Introduction

Dont-Overfit

Final Submission

Presentation Video Link

Link to Problem Statement:

Link to Datasets:

Process

1: Literature Survey and Exploratory Data Analysis:

a: Literature Survey

b: Exploratory Data Analysis

2: Experimentation:

a: Selecting a Classification Model:

b: Selecting a Dimensionality Reduction Technique:

Results

dont-overfit's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org