Git Product home page Git Product logo

sanjaykrishnagouda / anomaly-detection Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 118 KB

Easy to understand classification problem from a highly skewed kaggle dataset. Solved using logistic regression and SVM, code inspired from top contributor.

Home Page: https://www.kaggle.com/dalpozz/creditcardfraud

Python 100.00%
machine-learning svm-classifier svm-model svm-training logistic-regression scikit-learn scikitlearn-machine-learning kaggle kaggle-dataset anomaly-detection

anomaly-detection's Introduction

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine


Using Logistic Regression and SVM (with RBF kernel) on a highly skewed data set.

Download the data set from here:
https://www.kaggle.com/dalpozz/creditcardfraud

Ideas:
______

-->Dropped the time column from the original dataset and also normalized the amount column. (Done)

-->Can further analyse each column (feature) to see if we can discard other features. 

-->Can do PCA to project complete data in a meaningful way. Work in progress

-->Use Neural Networks (with more than 1 hidden layers) for classification.

-->Show a comparitive analysis of using different parameters and types of kernels in SVM function calls. 
	This may not yield much different results.


Dependencies:
_____________

python 3.5
scikit-learn
pandas
numpy (optional)
tensorflow (comment out, works fine without. Idea is to use tensorflow in future.)
matplotlib

How to run:

call python cc.py (or python 3, however your env for python 3 is configured) 
You can call logistic regression/ SVM seperately using cc_2.py by commenting out appropirate lines.
Use function printing_Kfold_cross for Log Reg and 
diff_kernels() for SVM with different kernels (all kernels will run)

pass complete dataset (temp[0],temp[1]) or the undersampled one(temp[2],temp[3]) as parameters to the function.
SVM runs for really long time on the complete dataset. Adding the graphs for this run.

Note:
Please ignore the function using_SVM, it needs proper structuring and further analysis.
Including graphs for recall on complete data and undersampled data comparing the performance of SVM and Logistic Regression. 

Questions to explore:
Use of L1/L2 loss function needs further analysis. (Used L2 here, maybe bfgs is faster.)
Does SVM perform worse on skewed data?
Is recall a good metric to analyse the performance? It really depends on the data:
	If false positives are much less tolerable than false negatives. (You do not have cancer but it predicts that you do.)
(or) the other way around.



Some results :


(C:\ProgramData\Anaconda3) C:\Users\gsk69>m:

(C:\ProgramData\Anaconda3) M:\>cd "Course stuff\Machine Learning\projects\Data Sets\credit card"

(C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card>python cc_2.py
USING Logistic Regression::
Best Mean: 0.612260 with inverse regularization strength 10.000000
USING SVM :
Best Mean: 0.684470 with inverse regularization strength 10.0000 using rbf kernel
USING SVM :
Best Mean: 0.754652 with inverse regularization strength 1.0000 using poly kernel
USING SVM :
Best Mean: 0.275240 with inverse regularization strength 1.0000 using sigmoid kernel

(C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card>
_______________________________________________________________________________________________________________
_______________________________________________________________________________________________________________
currently running  rbf kernel
USING SVM :
Best Mean: 0.607521 with inverse regularization strength 1.2800 using rbf kernel
SVM with rbf took 6244.27 seconds for completion.
currently running  linear kernel
USING SVM :
Best Mean: 0.782996 with inverse regularization strength 0.0200 using linear kernel
SVM with linear took 1137.45 seconds for completion.
currently running  sigmoid kernel
USING SVM :
Best Mean: 0.274755 with inverse regularization strength 0.1600 using sigmoid kernel
SVM with sigmoid took 1676.50 seconds for completion.
currently running  poly kernel
USING SVM :
Best Mean: 0.711231 with inverse regularization strength 0.1600 using poly kernel
SVM with poly took 1000.95 seconds for completion.


____________________________________________________________________________________

Run details : 
Used BaggingClassifier with default values and base classifier = SVM (SVC/LinearSVC)
All kernels with max_iter set to 1000
Used LinearSVC for linear kernel


currently running  rbf kernel
USING SVM :
Best Mean: 0.604330 with inverse regularization strength 1.2800 using rbf kernel
rbf took 3615.35 seconds for completion.
currently running  linear kernel
USING SVM :
Best Mean: 0.727653 with inverse regularization strength 1.2800 using linear kernel
linear took 6100.82 seconds for completion.
currently running  sigmoid kernel
USING SVM :
Best Mean: 0.274755 with inverse regularization strength 0.3200 using sigmoid kernel
sigmoid took 1713.92 seconds for completion.
currently running  poly kernel
USING SVM :
Best Mean: 0.703166 with inverse regularization strength 0.3200 using poly kernel
poly took 1635.25 seconds for completion.

Total run time = 13065.34s = 3h 37m

anomaly-detection's People

Contributors

sanjaykrishnagouda avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.