Credit Card Fraud Detection is a Binary Classification ML Project announced on Kaggle competition. The contest behind the project is that credit card companies must be able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, Kaggle didn't provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Identify fraudulent credit card transactions.
1. Data Reading & Understanding
2. Data Preparation
3. Data Visualization
4. Feature Normalization
5. Data Selection
6. Model Selection
7. Summerize Models with their results
First I read the "creditcard.csv" file in pandas and created its data frame and then I started understanding data by applying basic pandas statistical methods on the data frame.
The creditcard data was highly imbalanced. 99.83% of the transactions in the data set were not fraudulent while only 0.17% were fraudulent.Using the original data set would not prove to be a good idea for a very simple reason: Since over 99% of our transactions are non-fraudulent, an algorithm that always predicts that the transaction is non-fraudulent would achieve an accuracy higher than 99%. Nevertheless, that is the opposite of what we want. We do not want a 99% accuracy that is achieved by never labeling a transaction as fraudulent, we want to detect fraudulent transactions and label them as such.
To create our balanced training data set, I took all of the fraudulent transactions in our data set and counted them. Then, I randomly selected the same number of non-fraudulent transactions and concatenated the two. After shuffling this newly created data set, I decided to output the class distributions once more to visualize the difference.
Note: The dataset we created wasn't completely balanced, The dataset contained 62.5% non-fraud transactions while 37.5% fraud transactions but it is good for making classification model.
And then we created train.csv and test.csv.
After that, I visualized data distribution between both classes with the help of different charts. I also made charts to visualize the Correlation of all features with the target variable(Class).
Feature normalization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. In this dataset I have only one feature named "normalizedAmount" having values greater than 1, except this feature, all others have values in range 0. So I applied feature normalization on "normalizedAmount".
Now I made a subset of features that have a high impact on the target variable(class).
From the experience, I became to know that in Classification Algorithms, the best algorithms in terms of efficiency and accuracy are Random Forest and XG Boost, so I used both of them to make a model and then selected the better one from both of them.
Now I'm able to make the ML model for classification purposes. First of all, I have chosen the Random Forest model. It is best supervised learning algorithm. It is considered as a highly accurate and robust method because of the number of decision trees participating in the process. So the process of creating model and training and testing model is given below:
First I applied Grid Search on Random Forest to find the best hyperparameters.
After finding the best hyperparameters, I applied cross-validation on the Random forest to find average accuracy score, f1-score, roc-auc score, and log-loss.
Then I extracted features and made a subset of best features by finding feature importance from random forest algo.
But features that I have extracted from random Forest algo(X11-X15) weren't giving results as good as the features we extracted before by correlation(X1-X10). So I selected those features for training and testing purposes.
Now I trained and tested the model on training and testing data. It has given the following result.
After that I have finalized the best Random Forest Model that is given below:
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. So after Random Forest, we created the XGBoost model to train our data. It followed the same process that was followed by Random Forest before.
First I applied Grid Search on XGB to find the best hyperparameters.
After that, I tried to apply cross-validation on XGB, but it was taking too much time. So first, I reduced the number of a subset of features. To reduce the number of the subset of features, first I extracted the most relevant features from XGBoost algo.
The above process reduced the number of a subset of features, and then the new subset was X11-X16
Then I applied Grid Search on a new subset of features and they all were giving the accuracy of 0.94. So I selected a subset that has a minimum number of features(they are X13-X16) and applied cross-validation on them. You can see the result of cross-validation:
Then I selected XGB hyperparameters with the best result.
Now I trained and tested the model on training and testing data. It has given the following result.
I finalized 2 XGB models that were giving the best result.
In the end, I have finalized 3 models that are giving the best results, 1 related to random forest and the other 2 belongs to XGBoost, they are given below.
So in this binary classification project named "Credit Card Fraud Detection", the best model is Random forest, that is giving
Accuracy Score: 0.96
F1-Score: 0.94
roc-auc Score: 0.95
Log Loss:1.45
On
n-estimator: 100