Git Product home page Git Product logo

202112-31-credit-card-fraud-detection-via-cluster-based-scoring-and-anomaly-detection's Introduction

202112-31-Credit-Card-Fraud-detection-via-Cluster-based-Scoring-and-Anomaly-Detection

Team Members: Vedant Kumar (vrk2109), Siddharth Nijhawan (sn2951), Sushant Tiwari (st3425)

Description

The repository contains 4 jupyter notebooks containing end-to-end pipelines of implementing various iterative and clustering based anomaly detection algorithms on the dataset of Credit Card Fraud Detection

Dataset is available here: https://www.kaggle.com/mlg-ulb/creditcardfraud

  1. data_analysis.ipynb - performs initial data analysis by generating statistical metrics for each feature dimension like mean, std, min-max values, etc. Notebook also generates histograms for each feature vector and plots correlation heatmap as well

  2. kmeans.ipynb - runs Kmeans clustering on the given dataset to generate consistency scores using the following methodology:

  • Run K-means algorithm 10 times.
  • Every run takes bootstrapped samples which are normalised between 0 and 1.
  • K is varied between 0 and 20 and cluster indices, cluster centroids and number of data points in the clusters are calculated.
  • Finally, a weighted score for the data point for each combination of the assigned cluster is computed by calculating dot products of the C centroids.
  • Precision-Recall Curves, ROC Curves, and AUPRC, AUROC, Scatter Plots are generated
  1. isolation_forest.ipynb - runs Isolation Forest algorithm on the given dataset to generate anomaly scores using the following methodology:
  • Isolation Forest algorithm is run 10 times.
  • Every run takes bootstrapped samples with no. of trees = 100
  • Scikit Learn’s inbuilt isolation forest class is used to generate isolation trees on our data set.
  • decision_function() and predict() functions generate scores & predicted labels respectively.
  • Outlier fraction (ratio of fraudulent to non-fraudulent transactions) is passed to the isolation forest class.
  • Precision-Recall Curves, ROC Curves, and AUPRC, AUROC, Scatter Plots are generated.
  1. local_outlier_factor.ipynb - runs Local Outlier Factor algorithm on the given dataset to generate anomaly scores using the following methodology:
  • Local Outlier Factor algorithm is run 10 times .
  • Computes LOF(X) = (sum of avg. LRD of X’s neighbors)/ LRD(X)
  • LRD(X) = Local Reachability Distance (X) = 1/(Avg. Reachability of X from neighbors)
  • Scores and predictions are generated using negative_outlier_factor_ object and fit_predict() functions of LOF class.
  • “Minkowski” distance is used as a distance metric with the number of neighbors = 20
  • Precision-Recall curves, histogram plots of score distribution, and ROC curves are plotted.

202112-31-credit-card-fraud-detection-via-cluster-based-scoring-and-anomaly-detection's People

Contributors

fgethell avatar

Stargazers

Shubham Bhardwaj avatar

Watchers

Lev E. Givon avatar James Cloos avatar Bhavdeep Sethi avatar  avatar  avatar  avatar Yinglong Xia avatar  avatar Frank Ou Yang avatar  avatar  avatar  avatar Ghazal Fazelnia avatar

Forkers

longshen931

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.