Git Product home page Git Product logo

ml-spam-sms-classification's Introduction

Image

Which one does it catch whole* SPAM SMS?

Problem Data Methods Libs Link
NLP Text Naive Bayesian, SVM, Random Forest Classifier, Deep Learning - LSTM, Word2Vec Sklearn, Keras, Gensim, Pandas, Seaborn https://github.com/erdiolmezogullari/ml-spam-sms-classification

If you want to see the further ML projects, you may visit my main repo: https://github.com/erdiolmezogullari/ml-projects

In this project, We applied supervised learning (classification) algorithms and deep learning (LSTM).

We used a public SMS Spam dataset, which is not a purely clean dataset. The data consists of two different columns (features), such as context, and class. The column context is referring to SMS. The column class may take a value that can be either spam or ham corresponding to related SMS context.

Before applying any supervised learning methods, we applied a bunch of data cleansing operations to get rid of messy and dirty data since it has some broken and messy context.

After obtaining the cleaned dataset, we created tokens and lemmas of SMS corpus separately by using Spacy, and then, we generated bag-of-word and TF-IDF of SMS corpus, respectively. In addition to these data transformations, we also performed SVD, SVC, PCA to reduce dimension of dataset.

To manage data transformation in the training and testing phase effectively and avoid data leakage, we used Sklearn's Pipeline class. So, we added each data transformation step (e.g. bag-of-word, TF-IDF, SVC) and classifier (e.g. Naive Bayesian, SVM, Random Forest Classifier) into an instance of class Pipeline.

After applying those supervised learning methods, we also performed deep learning. The deep learning architecture we used is based on LSTM. To perform LSTM approaching in Keras (Tensorflow), we needed to create an embedding matrix of our corpus. So, we used Gensim's Word2Vec approach to obtain embedding matrix, rather than TF-IDF.

At the end of each processing by using a different classifier, we plotted confusion matrix to compare which one the best classifier for filtering SPAM SMS.

ml-spam-sms-classification's People

Contributors

30lm32 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.