Git Product home page Git Product logo

hc_ds's Introduction

hc_ds Short Report

Overview

Data includes loan applications from 2014-06-20 to 2015-05-31. Target of prediction: Ability to repay the loan back of the applications (values: 0 and 1).

Number of application is quite stable over time. image

Data Exploration

Data Distribution

Categorical features are distributed quite evenly. image

Most of numerical features have normal distribution. image

However, NUMERICAL_10 and NUMERICAL_40 are bias on 0 values (after fill null with 0). numerical_10_distributionnumerical_40_distribution

Missing Data

June 2014 data is not completed.

image

NUMERICAL_10 has null value in a long period of time. If filling null with any constaent, it might lead to big bias for this period of time in prediction.

image

Data Exploration Conclusion

  • Remove NUMERICAL_10 from feature list.
  • NUMERICAL_40 fill null with mean value of the last 30 days (could be another time range, but with limited time for testing).
  • June 2014 is not too importance, can be use for mean and standard deviation aggregation.

Feature Engineering

Filling null

Numerical features: Fill null with mean value of the last 30 days since application date.

Categorical features: Fill null with "unknown".

Date Features

Extract these features from TIME column:

  • Months Columns: month_1, month_2...
  • Dates Columns: date_1, date_2...
  • DOW Columns: dow_0, dow_1...
  • Hour: hour_1, hour_2...

Standard Deviation Features

Calculate standard deviation of numerical features using mean of the last 30 days since application date.

Machine Learning Model

Chosen 2 algorithms: LogisticRegression and RandomForestClassifier

LogisticRegression

LogisticRegression is a data analysis technique that uses mathematics to find the relationships between two data factors. It then uses this relationship to predict the value of one of those factors based on the other. The prediction usually has a finite number of outcomes, like yes or no.

This algorithm is fast in training and prediction because its complexity is low. However, accuracy rate of LogisticRegression is not normally high comparing to other algorithms for complex input data (bit amount of features).

RandomForestClassifier

RandomForestClassifier is a set of binary trees taking their majority vote for classification.

Training and prediction speed is quite slow, depending on the complexity of algorithm settings. The more trees or the more depth, the slower the algorithm runs. Exchange of its speed, the accuracy rate is noemally higher than LogisticRegression for high complexity data.

Comparison

LogisticRegression RandomForestClassifier
Speed Fast Slow
Size on Disk Small Big
Accuracy Rate Normally low for high complexity data Normally high for high complexity data
Complexity Data Normally handle well low complexity data Normally handle well high complexity data

Model Performance

ROC_AUC Score

ROC_AUC score of 2 models are quite high in training. image

ROC_AUC score of 2 models are low in test -> 2 models are heavily overfitted. image

Tunning hyper-parameters of the models will not reduce considerably overfit.

Reducing data complexity might be benificial more.

Features vs TARGET Relationship

Since both models have the similar ROC_AUC score, considering feature importance of both models are ok for reducing overfit in the future.

LogisticRegression depends more on numberical than categorical features. Highlights:

  • NUMERICAL_4, NUMERICAL_7, NUMERICAL_20_std_dev_last_30_days and NUMERICAL_18_std_dev_last_30_days affect TARGET highly negatively.
  • NUMERICAL_4_std_dev_last_30_days, NUMERICAL_7_std_dev_last_30_days, NUMERICAL_20 and NUMERICAL_18 affect TARGET highly positively.
  • Original numberical features and their standard deviation features affect TARGET oppositely.

image

RandomForestClassifier depends more on categorical than numberical features. Highlights:

  • CATEGORICAL_9, CATEGORICAL_7 and CATEGORICAL_1 highly affect TARGET. Around 60% of the predictions were based on these features.
  • If original numberical features highly affect TARGET, their standard deviation features will affect TARGET as well.

image

Those highly affected features might be considered in reduce data complexity, basing on their real meaning, those features can be excluded in training.

Conclusion

  • Input data is quite clean.
  • New features added did not show significant difference comparing to original features.
  • Even though training ROC_AUC scores are high, but both models are very overfitted.
  • Reducing data complexity might be the best way to reduce overfit.
  • The following features can be considered for feature reduction:
    • NUMERICAL_4
    • NUMERICAL_7
    • NUMERICAL_20
    • NUMERICAL_18
    • CATEGORICAL_9
    • CATEGORICAL_7
    • CATEGORICAL_1

hc_ds's People

Contributors

duongtruongtrong avatar duong-tekos avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.