Git Product home page Git Product logo

spam-filter-classifier's Introduction

Spam-Filter-Classifier

Task : Our team receives too much email spam. We only want to legitimate emails in our inbox. All incoming spam should be filtered out.

We will take the raw emails and pre-process the text data. Then train a machine learning model that classifies (Naive Bayes Classifer) the email as either spam or not-spam. Finally test the model's performance.

Spam Email Data Source

Corpus: A large and structured set of text.

Document : Refers particular item in corpus.

Naive Bayes Algorithm

Naive Bayes compares two probabilities. If the probability of an email being spam is higher, then the email classified as spam.

Probability Graph

Basic Probablity

$P(\text{Spam}) = \frac{\text{Nr. Spam Emails}}{\text{Total Nr. Emails}}$

Joint Probabilty

$P(\text{Heads n Heads}) ={\text{P(Heads)}} \times{\text{P(Heads)}}$

Thats how we get probability of getting heads two times in a row. Independent.

Conditional Probability

E-mail contains the word "Viagra" what's the probability of this e-mail being spam? Dependent.

$P(\text{Spam | Viagra}) = \frac{\text{P(Spam n Viagra)}}{\text{P(Viagra)}}$

Bayes Theorem

Bayes theroem makes our calculation more easier.

$P(\text{Spam} | \text{Viagra}) = \frac{P(\text{Viagra} | \text{Spam}) \cdot P(\text{Spam})}{P(\text{Viagra})}$

$ P(\text{Viagra} | \text{Spam}) = \frac{\text{Nr. Viagra in Spam Emails}}{\text{Nr. Total Words in Spam Emails}} $

Naive Bayes

We need conditional probability en each words in email. When email contains both "Viagra" and "Free". At this point we will use joint probability because two events are independent. The reason our algorithm is naive is because it assumes independence between the words in the email.

We can calculate the probability of email being spam with using each word. Then we can compare this number of probability that is email is a normal email. Then we can do our comparison. This is called the Bag of Words approach for classfiyng the documents. Each word becomes a feature for us.

Evaluation Metrics

Metric Score
Accuracy 96,98
Recall 91,20
Precision 97,86
F1 Score 95,47

Graph with Decision Boundary

Additional Resources

Dataset

  • spam: 500 spam messages, all received from non-spam-trap sources.

  • easy_ham: 2500 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc).

  • hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc.

  • easy_ham_2: 1400 non-spam messages. A more recent addition to the set.

  • spam_2: 1397 spam messages. Again, more recent.

Total count: 6047 messages, with about a 31% spam ratio.

Requirements

Requirements:

  • pyenv with Python: 3.11.3
pyenv local 3.11.3
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.