Git Product home page Git Product logo

phase-3-review's Introduction

Phase 3 Review - Predictive Classification Workflow

Students Will Be Able To

  • Understand the overall process to solve a predictive classification problem
  • Understand and implement multiple classification algorithms
  • Implement cross-validation techniques
  • Handle class imbalance using SMOTE
  • Perform GridSearch to determine optimal hyperparameter combinations
  • Create Pipelines to streamline the modeling process

Business and Data Understanding

This dataset was downloaded from Kaggle and contains information on adult incomes. We are trying to predict whether or not an individual's yearly salary was greater than or equal to $50,000 (binary classification). The column salary will be either a 0 (less than \$50,000) or a 1 (greater than or equal to $50,000). The metric we will be using is accuracy.

Tasks

Data Preparation

Train-Test Split

We will be using cross-validation for the duration of this notebook. Please perform two train-test splits. First splitting the entire dataframe into train and test sets and then splitting the train data into *training and validation sets. Use random_state=2021 and test_size=.15 in both splits for reproducibility. We will be using the train and validation sets for the majority of this notebook. The test set should be left alone until the very end.

Preprocessing

Please perform the standard data preprocessing steps on the training and validation data:

  • Check for missing data and impute if necessary (or drop)
  • Scale numerical data
  • OneHotEncode categorical data

Modeling

Baseline Logistic Regression

Create a LogisticRegression model and fit it on the preprocessed training data. Check the performance of the model on the training and validation data.

Please plot a confusion matrix of the model's predictions and compare it to the previous performance metric. What might be causing the accuracy score to be misleading? (HINT: Check the value counts of your target variable)

Second Logistic Regression

Please use SMOTE (documentation here) to adjust the imbalance of target classes. You can use your preprocessed training data at this step. Once you have resampled your training data, please fit another Logistic Regression model and check its performance using the training and validation data. Plot another confusion matrix and explain whether or not resampling helped improve the performance of your model.

Inspect the coefficients of this model and report the 5 features with the largest coefficients and the 5 features with the lowest coefficients. (documentaion here)

Third LogisticRegression

Please create a third and final LogisticRegression model and adjust at least one hyperparameter related to the regularization of the model. Fit the model on the preprocessed, resampled training data. Check the performance on the training and validation data.

Once again, inspect the coefficients of this model and report the 5 features with the largest coefficients and the 5 features with the lowest coefficients. (documentaion here). How have the coefficients changed from regularization?

DecisionTreeClassifier

Create a DecisionTreeClassifier using hyperparameters of your choosing. Please fit the model on the training data and check its performance on the train and validation sets.

GridSearch RandomForest

For your final model, please use GridSearchCV (documentation here) to determine the optimal hyperparameter combination for a RandomForestClassifier. Assign the GridSearch's best_estimator_ to a variable and check its performance on the train and validation sets.

Model Evaluation

Of your five models created, which performs best? Please assign this model to the variable best_model.

Transform your X_test using the same preprocessing tools fitted on your X_train. Calculate performance metrics on this final test set. How did the model perform on the real test set?

Pipelines

Using your best performing model, please create a Pipeline (documentation here) to perform the entire modeling process. The Pipeline should impute missing data using SimpleImputer, scale numerical data using StandardScaler and one hot encode categorical data with OneHotEncoder.

For this pipeline, you will need to make use of make_column_selector (documentation and make_column_transformer (documentation).

You will need to perform another train test split. Just perform a single split (you will not need a validation set) using random_state=2021 and test_size=.25.

Once your pipeline has been created, pass it into cross_val_score along with your training data to calculate the 5 Fold cross-validation accuracy score. How does the average accuracy score of these 5 splits compare to your best performing model's accruacy score from the previous section?

Use your pipeline to make predictions on your test set. How does the accuracy score for this test set compare to the the score from the previous section?

phase-3-review's People

Contributors

mattcarr17 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.