The machine_learning-challenge's intro from areeratk

Exoplanet Exploration

Preprocessing
- Clean data
  - Drop the null columns where all values are null.
  - Drop the null rows.
- Create a train/test split
  - Label (Y): koi_disposition.
  - Features (X): all other variables in other remaining columns.
- Scale the data using the MinMaxScaler to scale features to fit X_train and tranfrom X-train and X_test. Why?
Tune-Model-Parameters

There are two classifiers in this exploration:
- Support Vector Machine Linear Classifier
  - Training Data Score: 0.847919426654467
  - Testing Data Score: 0.862946020128088
  - SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto_deprecated', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
- Hyperparameter Tuning (GridSearch CV model)
  - Best parameter: C=10, gamma=0.0001
  - Best score: 0.8690149435803599

Evaluate-Model-Performance

SVM classification report

               precision    recall  f1-score   support

     CANDIDATE       0.37      0.24      0.29       528
     CONFIRMED       0.19      0.15      0.17       568
FALSE POSITIVE       0.67      0.86      0.75      1090

      accuracy                           0.53      2186
     macro avg       0.41      0.42      0.40      2186
  weighted avg       0.47      0.53      0.49      2186

Grid Search classification report

                precision    recall  f1-score   support

     CANDIDATE       0.32      0.24      0.28       528
     CONFIRMED       0.23      0.23      0.23       568
FALSE POSITIVE       0.70      0.77      0.73      1090

      accuracy                           0.50      2186
     macro avg       0.41      0.41      0.41      2186
  weighted avg       0.48      0.50      0.49      2186

SVM model provided 3% slightly better accuracy over the Grid Search model. The accuracy rate, however is quite low within the low range of the 50th percentile. False positive are also very high, 67% for the SVM model and 70% for the Grid Search model. Again the SVM does slightly better. In addition, the best parameters for the Grid Search model are quite constrained with high C = 10 (versus 1 for the SVM model) and low gamma = .0001. On the other hand, the SVM gamma is unspecified. This may be an issue given that its function is polynomial, not linear (degree = 3).

In summary, these models need revisions of features to improve the accuracy. Meanwhile, other classifiers of machine learning may be explored as well.

areeratk / machine_learning-challenge Goto Github PK

machine_learning-challenge's Introduction

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent