hc_ds Short Report

Overview

Data includes loan applications from 2014-06-20 to 2015-05-31. Target of prediction: Ability to repay the loan back of the applications (values: 0 and 1).

Number of application is quite stable over time.

Data Exploration

Data Distribution

Categorical features are distributed quite evenly.

Most of numerical features have normal distribution.

However, NUMERICAL_10 and NUMERICAL_40 are bias on 0 values (after fill null with 0).

Missing Data

June 2014 data is not completed.

NUMERICAL_10 has null value in a long period of time. If filling null with any constaent, it might lead to big bias for this period of time in prediction.

Data Exploration Conclusion

Remove NUMERICAL_10 from feature list.
NUMERICAL_40 fill null with mean value of the last 30 days (could be another time range, but with limited time for testing).
June 2014 is not too importance, can be use for mean and standard deviation aggregation.

Feature Engineering

Filling null

Numerical features: Fill null with mean value of the last 30 days since application date.

Categorical features: Fill null with "unknown".

Date Features

Extract these features from TIME column:

Months Columns: month_1, month_2...
Dates Columns: date_1, date_2...
DOW Columns: dow_0, dow_1...
Hour: hour_1, hour_2...

Standard Deviation Features

Calculate standard deviation of numerical features using mean of the last 30 days since application date.

Machine Learning Model

Chosen 2 algorithms: LogisticRegression and RandomForestClassifier

LogisticRegression

LogisticRegression is a data analysis technique that uses mathematics to find the relationships between two data factors. It then uses this relationship to predict the value of one of those factors based on the other. The prediction usually has a finite number of outcomes, like yes or no.

This algorithm is fast in training and prediction because its complexity is low. However, accuracy rate of LogisticRegression is not normally high comparing to other algorithms for complex input data (bit amount of features).

RandomForestClassifier

RandomForestClassifier is a set of binary trees taking their majority vote for classification.

Training and prediction speed is quite slow, depending on the complexity of algorithm settings. The more trees or the more depth, the slower the algorithm runs. Exchange of its speed, the accuracy rate is noemally higher than LogisticRegression for high complexity data.

Comparison

	LogisticRegression	RandomForestClassifier
Speed	Fast	Slow
Size on Disk	Small	Big
Accuracy Rate	Normally low for high complexity data	Normally high for high complexity data
Complexity Data	Normally handle well low complexity data	Normally handle well high complexity data

Model Performance

ROC_AUC Score

ROC_AUC score of 2 models are quite high in training.

ROC_AUC score of 2 models are low in test -> 2 models are heavily overfitted.

Tunning hyper-parameters of the models will not reduce considerably overfit.

Reducing data complexity might be benificial more.

Features vs TARGET Relationship

Since both models have the similar ROC_AUC score, considering feature importance of both models are ok for reducing overfit in the future.

LogisticRegression depends more on numberical than categorical features. Highlights:

NUMERICAL_4, NUMERICAL_7, NUMERICAL_20_std_dev_last_30_days and NUMERICAL_18_std_dev_last_30_days affect TARGET highly negatively.
NUMERICAL_4_std_dev_last_30_days, NUMERICAL_7_std_dev_last_30_days, NUMERICAL_20 and NUMERICAL_18 affect TARGET highly positively.
Original numberical features and their standard deviation features affect TARGET oppositely.

RandomForestClassifier depends more on categorical than numberical features. Highlights:

CATEGORICAL_9, CATEGORICAL_7 and CATEGORICAL_1 highly affect TARGET. Around 60% of the predictions were based on these features.
If original numberical features highly affect TARGET, their standard deviation features will affect TARGET as well.

Those highly affected features might be considered in reduce data complexity, basing on their real meaning, those features can be excluded in training.

Conclusion

Input data is quite clean.
New features added did not show significant difference comparing to original features.
Even though training ROC_AUC scores are high, but both models are very overfitted.
Reducing data complexity might be the best way to reduce overfit.
The following features can be considered for feature reduction:
- NUMERICAL_4
- NUMERICAL_7
- NUMERICAL_20
- NUMERICAL_18
- CATEGORICAL_9
- CATEGORICAL_7
- CATEGORICAL_1

duongtruongtrong / hc_ds Goto Github PK

hc_ds's Introduction