Git Product home page Git Product logo

fraud_detection's Introduction

Fraud Detection

This project is based on the analysis of a Kaggle dataset that is a conglomeration of instances with anonymous feature data and their corresponding classification of fraud or not fraud. The intent of this project was to utilize machine learning in order to create a model that was able to most accurately predict fraud.

An important aspect of fraud detection, is to understand which metric you want to optimize in order to have a well performing model. In many instances of fraud detection, it is important to ensure that the occurence of fraud is not missed. This means that one would want to create a model that is more sensitive to classifying something as fraud, however, this increases the chances of classifying something as fraud even though it may not be the case.

EDA

Feature Engineering

Because features were anonymous, it was difficult to decide which features would be more or less important to look at intuitively. The approach taken here, is to plot the distribution of fraud and not fraud instances on the same graph for each of the features. The idea is that if the distribution of both fraud and not fraud instances are the same, we can deem that feature unimportant for our machine learning model that we will implement. Shown below is an instance of a feature that would be deemed unimportant for classification purposes:

image

In the image below, we can see that there are significant differences in the distributions of fraud and not fraud instances. This implies that there are differences in behavior that would be important to look at for classification. Note how if we were to take a certain range of this feature (for example taking values less than -5 in this feature, V12) we can see that we are able to separate the different classes more effectively. We can then engineer features that more easily distinguish fraud and not fraud groups in order to aid in our machine learning classification.

image

Machine Learning Model

In this instance, a Logistic Regression model was utilized in order to predict an instance as being fraud or not fraud. The important thing to note about this set of data, and most likely most other sets of data dealing with fraud, is the scarcity of fraud instances among the hundreds of thousands of normal transactions that have taken place. The problem with training a model on the entire set of data is that the algorithm will quickly come to find out that classifying all instances as not fraud will still gain a considerably high accuracy score. Because of this, undersampling was utilized here to avoid these issues.

After training a model on the undersampled data, I was able to see how my model could perform among that subsample.

image

The classification model trained on the subgroup of data achieved a recall score of above 95% when predicting on the test subgroup. Athough an excellent score was achieved on this subsample of data, below we can see the performance on a test of the entire dataset.

image

Here we see that our model, even predicting on the entire test dataset, achieves recall score of ~94.6%.

Important Considerations

For this specific example, we are looking at decreasing the recall score because I was specifically interested in decreasing False Negative classifications. Meaning, I am more interested in Classifying as many fraudulent incidences as possible without worrying about False Positives (or falsely predicting a normal transaction as fraudulent). In many circumstances, one would have to evaluate a cost benefit in order to find the optimal threshhold for positive or negative classification.

For example, the average cost for not catching a fraudulent incident (a False Negative classification) may be $1000. However, there may also be a cost associated with predicting a fraudulent incident when in actuality it was a normal transaction. Let's say this is $10. In my example, with 11,041 False Positives and 8 False Negatives, my False Positives cost me $110,410 while my False Negatives only cost me $8,000. So there is a tradeoff associated with how sensitive you would want your model to be on either side of the spectrum.

fraud_detection's People

Contributors

terran6 avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.