Git Product home page Git Product logo

spam-url-detection's Introduction

Spam-Page-Detection

Abstract

In this work we create a classification model to identify Spam Pages based on their URL. ISCXURL2016 dataset, contains 79 features extracted from various benign and spam URLs. However, having a large number of features may lead overfitting, or consume extra time to compute unnecessary features. For that reason, we performed a feature exploration and chose to use only the most "meaningful" features. To evaluate our selection, we've performed a Cross Validation for various classifiers. Finally, we've implemented a simple Flask App, where a user can insert a URL to find out if it is spam or not.

Datasets

ISCXURL2016 Final Spam dataset: Contains the features extracted by various spam & benign URLS. This dataset has 80 columns in total: 79 feature columns and 1 label column.

Spam URL dataset: Contains spam URLS.

Benign URL dataset: Contains benign URLS.

Spam+Benign URL dataset: A concatenation of Spam & Benign URL datasets.

Feature selection

In order to keep the most important features, we've created two sets of features. The first one contained the ones that are highly correlated (with threshold = 0.85), and the second one, the low score features resulted from k-best algorithm (where k=35). The union of those two sets produced a large set which contained 62 features. These features were removed.

The selected features are the following:

- Token count in domain
- Average token length in domain
- Digit Letter Digit count in URL
- Domain length
- Argument-URL ratio
- Number of dots in URL
- Character continuity rate in URL
- Length of URL query variable
- Delimiter count in path
- Number rate in URL
- Number rate in filename
- Number rate in extension
- Number rate in afterpath
- Symbol count in URL
- Symbol count in filename
- Alphabet Entropy of domain
- Alphabet Entropy of extension

After selecting these 17 features we modified our two main dataset as follows:

- ISCXURL2016 Final Spam dataset: Only these 17 features were kept.
- Spam+Benign URL dataset: The 17 features were extracted from the URLS. The URL column was then dropped.

Classifier selection

We've tested the reliability of this selection by performing a 10-Fold Cross Validation with various classifiers.


Results of ISCXURL2016 Final Spam dataset

|     Classifier    |  Mean F1 - Score  |
-----------------------------------------
| bagging-dtree     |       0.997       |
| decision-tree     |       0.995       |     
| knn               |       0.994       |
| logisticreg       |       0.986       |
| naive-bayes       |       0.941       |
| random-forest     |       0.998       |
| linear-svc        |       0.988       |
| voting-classifier |       0.988       |

Results of Benign+Spam dataset

|     Classifier    |  Mean F1 - Score  |
-----------------------------------------
| bagging-dtree     |       0.999       |
| decision-tree     |       0.998       |     
| knn               |       0.948       |
| logisticreg       |       0.972       |
| naive-bayes       |       0.901       |
| random-forest     |       0.999       |
| linear-svc        |       0.948       |
| voting-classifier |       0.978       |

We've selected Random Forest as the most suitable classifier for this problem as it achieved the highest score in both cases.

Code

Feature Exploration, Classification & Feature Selection are implemented in Spam_Page_Detection.ipynb.

Application

Run

$ export FLASK_APP=flaskapp/__init__.py
$ python -m flask run

Implementation Details

Having a trained model, our application receives a URL and outputs whether this URL is spam or not. Our model is trained on ISCXURL2016 Final Spam dataset, using Random Forest.

Code

The implementation of our application lies under several files in main folder. Each file has a descriptive name that yields its use.

Presentation

You can find a mini presentation of our project in pres/ folder

Authors

George Panagiotopoulos (cs2190012) Maria Despoina Siampou (cs220017)

spam-url-detection's People

Contributors

giorgospan avatar msiampou avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.