The ml-spam-sms-classification from ankitchandola

Which one does it catch whole* SPAM SMS?

Problem	Data	Methods	Libs	Link
`NLP`	Text	`Naive Bayesian`, `SVM`, `Random Forest Classifier`, `Deep Learning - LSTM`, `Word2Vec`	`Sklearn`, `Keras`, `Gensim`, `Pandas`, `Seaborn`	https://github.com/erdiolmezogullari/ml-spam-sms-classification

If you want to see the further ML projects, you may visit my main repo: https://github.com/erdiolmezogullari/ml-projects

In this project, We applied supervised learning (classification) algorithms and deep learning (LSTM).

We used a public SMS Spam dataset, which is not a purely clean dataset. The data consists of two different columns (features), such as context, and class. The column context is referring to SMS. The column class may take a value that can be either spam or ham corresponding to related SMS context.

Before applying any supervised learning methods, we applied a bunch of data cleansing operations to get rid of messy and dirty data since it has some broken and messy context.

After obtaining the cleaned dataset, we created tokens and lemmas of SMS corpus separately by using Spacy, and then, we generated bag-of-word and TF-IDF of SMS corpus, respectively. In addition to these data transformations, we also performed SVD, SVC, PCA to reduce dimension of dataset.

To manage data transformation in the training and testing phase effectively and avoid data leakage, we used Sklearn's Pipeline class. So, we added each data transformation step (e.g. bag-of-word, TF-IDF, SVC) and classifier (e.g. Naive Bayesian, SVM, Random Forest Classifier) into an instance of class Pipeline.

After applying those supervised learning methods, we also performed deep learning. The deep learning architecture we used is based on LSTM. To perform LSTM approaching in Keras (Tensorflow), we needed to create an embedding matrix of our corpus. So, we used Gensim's Word2Vec approach to obtain embedding matrix, rather than TF-IDF.

At the end of each processing by using a different classifier, we plotted confusion matrix to compare which one the best classifier for filtering SPAM SMS.

ankitchandola / ml-spam-sms-classification Goto Github PK

ml-spam-sms-classification's Introduction

Which one does it catch whole* SPAM SMS?

ml-spam-sms-classification's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent