Git Product home page Git Product logo

credit-cart-fraud-detection--kaggle-project-'s Introduction

Credit Cart Fraud Detection (Kaggle Project)

Context

Credit Card Fraud Detection is a Binary Classification ML Project announced on Kaggle competition. The contest behind the project is that credit card companies must be able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase

Content

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, Kaggle didn't provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Inspiration

Identify fraudulent credit card transactions.

Table of Content:

1. Data Reading & Understanding
2. Data Preparation
3. Data Visualization
4. Feature Normalization
5. Data Selection
6. Model Selection

6.1. Random Forest Model
6.2. XGBoost

7. Summerize Models with their results

Conclusion

1. Data Reading & Understanding:

First I read the "creditcard.csv" file in pandas and created its data frame and then I started understanding data by applying basic pandas statistical methods on the data frame.




Screenshot_63


Go Back

2. Data Preparation:

The creditcard data was highly imbalanced. 99.83% of the transactions in the data set were not fraudulent while only 0.17% were fraudulent.Using the original data set would not prove to be a good idea for a very simple reason: Since over 99% of our transactions are non-fraudulent, an algorithm that always predicts that the transaction is non-fraudulent would achieve an accuracy higher than 99%. Nevertheless, that is the opposite of what we want. We do not want a 99% accuracy that is achieved by never labeling a transaction as fraudulent, we want to detect fraudulent transactions and label them as such. To create our balanced training data set, I took all of the fraudulent transactions in our data set and counted them. Then, I randomly selected the same number of non-fraudulent transactions and concatenated the two. After shuffling this newly created data set, I decided to output the class distributions once more to visualize the difference.

Note: The dataset we created wasn't completely balanced, The dataset contained 62.5% non-fraud transactions while 37.5% fraud transactions but it is good for making classification model.




Screenshot_64


Screenshot_65


Screenshot_66


And then we created train.csv and test.csv.

Go Back

3. Data Visualization:

After that, I visualized data distribution between both classes with the help of different charts. I also made charts to visualize the Correlation of all features with the target variable(Class).




Screenshot_64


Screenshot_65


Screenshot_66


Screenshot_67


Screenshot_68


Go Back

4. Feature Normalization:

Feature normalization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. In this dataset I have only one feature named "normalizedAmount" having values greater than 1, except this feature, all others have values in range 0. So I applied feature normalization on "normalizedAmount".




Screenshot_69


Go Back

5. Data Selection:

Now I made a subset of features that have a high impact on the target variable(class).




Screenshot_70


Go Back

6. Model Selection:

From the experience, I became to know that in Classification Algorithms, the best algorithms in terms of efficiency and accuracy are Random Forest and XG Boost, so I used both of them to make a model and then selected the better one from both of them.

6.1. Random Forest Model:

Now I'm able to make the ML model for classification purposes. First of all, I have chosen the Random Forest model. It is best supervised learning algorithm. It is considered as a highly accurate and robust method because of the number of decision trees participating in the process. So the process of creating model and training and testing model is given below:

First I applied Grid Search on Random Forest to find the best hyperparameters.




Screenshot_71


After finding the best hyperparameters, I applied cross-validation on the Random forest to find average accuracy score, f1-score, roc-auc score, and log-loss.




Screenshot_72


Then I extracted features and made a subset of best features by finding feature importance from random forest algo.




Screenshot_73


Screenshot_74


But features that I have extracted from random Forest algo(X11-X15) weren't giving results as good as the features we extracted before by correlation(X1-X10). So I selected those features for training and testing purposes.




Screenshot_75


Now I trained and tested the model on training and testing data. It has given the following result.




Screenshot_76


After that I have finalized the best Random Forest Model that is given below:




Screenshot_77


Confusion Matrix of Best RF Model:




Screenshot_78


Go Back

6.2. XGBoost:

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. So after Random Forest, we created the XGBoost model to train our data. It followed the same process that was followed by Random Forest before.

First I applied Grid Search on XGB to find the best hyperparameters.

After that, I tried to apply cross-validation on XGB, but it was taking too much time. So first, I reduced the number of a subset of features. To reduce the number of the subset of features, first I extracted the most relevant features from XGBoost algo.




Screenshot_79


The above process reduced the number of a subset of features, and then the new subset was X11-X16


Screenshot_80


Then I applied Grid Search on a new subset of features and they all were giving the accuracy of 0.94. So I selected a subset that has a minimum number of features(they are X13-X16) and applied cross-validation on them. You can see the result of cross-validation:


Screenshot_81


Then I selected XGB hyperparameters with the best result.




Screenshot_82


Now I trained and tested the model on training and testing data. It has given the following result.




Screenshot_83


I finalized 2 XGB models that were giving the best result.




Screenshot_84


Confusion Matrix of Best XGB Model:




Screenshot_85


Screenshot_86


Go Back

7- Summerize Models with their results:

In the end, I have finalized 3 models that are giving the best results, 1 related to random forest and the other 2 belongs to XGBoost, they are given below.




Screenshot_87


Screenshot_88

Go Back

Conclusion

So in this binary classification project named "Credit Card Fraud Detection", the best model is Random forest, that is giving
Accuracy Score: 0.96
F1-Score: 0.94
roc-auc Score: 0.95
Log Loss:1.45
On
n-estimator: 100

Go Back

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.