Git Product home page Git Product logo

amex-default-classification's Introduction

American Express - Default Prediction


  • Built a classification model to predict the probability that a customer does not pay back their credit card balance (defaults) based on their monthly customer statements using the data provided by American Express.

  • The data was particularly challenging to deal with as it had 5.5 million records and 191 anonymized features. 122 features had more than 10% missing values. The target variable had severe class imbalance.

  • Engineered new features by taking different aggregations over time which helped increase model accuracy by 12%.

  • Optimized XGBoost and LightGBM Classifiers using RandomSearchCV to reach the best model.

  • A Soft-Voting Ensemble of the best performing XGBoost and LightGBM Models was used to make final predictions which yielded an Accuracy of 94.48%, an F1-Score of 96.71% and an ROC-AUC Score of 96.40%.

Data

Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics.

The dataset contains profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

D_* = Delinquency variables
S_* = Spend variables
P_* = Payment variables
B_* = Balance variables
R_* = Risk variables

with the following features being categorical:

'B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68'

The dataset can be downloaded from here.

Analysis

The complete analysis can be viewed here.

Target Distribution

  • In the data present we observe that 25.9% of records have defaulted on thier credit card payments whereas 74.1% have paid thier bills on time.

  • This distribution shows us that there is severe class imbalance present.

Target Distribution



Distribution of Number of Defaults per day for the first Month:

The proportion of customers that defualt is consistent across each day in the data, with a slight weekly seasonal trend influenced by the day when the customers receive their statements.

Target Distribution

Frequency of Customer Statements for the first month:


Target Distribution


  • There is weekly seasonal pattern observed in the number of statements recieved per day.
  • As seen above this trend does not seem to be significantly affecting the proportion of default.

Distribution of values of Payment Variables:


Target Distribution


  • We notice that Payment 2 is heavily negatively skewed (left skewed).
  • Even though Payment 4 have continuous values between 0 and 1, most of the density is clustered around 0 and 1.
  • This tells us that there may be some Gaussian Noise present. The noise can be removed and into a binary variable.

Correlation of Features with Target Variable:

  • Payment 2 is negatively correlated with the target with a correlation of -0.67.
  • Delinquency 48 is positively correlated with the target with a correlation of 0.61.

Correlation of Payment Variables with Target


Target Distribution


  • We observe that Payment 2 and Target are highly negatively correlated.
  • This could be probably be due to the fact that people paying their bill have a less chance of default.

Experiments:

  • There is a substantial number of missing values in the data. These cannot be imputed since the features are anonymized and there is no clear rationale behind imputation. This constraint forces us to choose models that can handle missing values.

  • There is a high cardinality of features ie. 191 features in the data. The presence of missing values restricts the usage of traditional dimensionality reduction techniques like PCA as well as feature selection methods like RFE.

  • Instead we have engineer new features using aggregations over the time dimension. As the aggregations ignore missing values, the engineered features are dense and can be used for modelling.

  • Some of the prominent models that are used for classification and accept inputs with missing values are XGBoost, LightGBM, and CatBoost. They all internally impute the data depending on whatever imputation technique delivers the greatest performance benefit.

  • A baseline was created using a XGBoost model with default hyperparameters which yielded an Accuracy of 78.84%, an F1-Score of 54.64% and an ROC-AUC Score of 65.72%.

  • The LightGBM model with default hyperparameters was tried after that and improved Accuracy by 1%, F1-Score by 12% and ROC-AUC Score by 6%.

  • A Randomized Grid Search with 5 Cross Validation folds was carried out to fine tune the XGBoost and LightGBM models.

  • Hyperparameters of the XGBoost model such as n_estimators, max_depth and learning_rate were tuned to improve Accuracy by 9%, F1-Score by 18% and ROC-AUC Score by 3%.

  • Hyperparameters of the LightGBM model such as n_estimators, feature_fraction and learning_rate were tuned to improve Accuracy by 0.1%, F1-Score by 6% and ROC-AUC Score by 10%.

Results:

A Soft Voting Classifier was used to create a ensemble of both the models and was used for generating the final predictions. It achieved an Accuracy of 94.48%, an F1-Score of 96.71% and an ROC-AUC Score of 96.40%.

The results from all the models have been summarized below:

Model Accuracy F1-Score ROC-AUC Score
XGBoost (default) 78.84 54.64 65.72
LightGBM (default) 79.84 62.92 71.86
XGBoost (fine-tuned) 88.61 80.74 74.96
LightGBM (fine-tuned) 88.72 86.42 84.22
Voting Classifier (XGB + LGBM) 94.48 96.72 96.40

Run Locally

  1. Install required libraries:
      pip install -r requirements.txt
  2. Generate features:
      python amex-feature-engg.py
  3. Fine-tune models:
      python amex-fine-tuning.py
  4. Generate predictions:
      python amex-final-prediction.py

License    MIT License

Author: @awinml

Feedback

If you have any feedback, please reach out to me at:     LinkedIn

amex-default-classification's People

Contributors

awinml avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

somya1393

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.