Git Product home page Git Product logo

spamclassification's Introduction

SpamClassification

Skærmbillede 2021-06-23 kl  18 54 29

This project is about classifying emails/sms texts as spam or ham using deep learning with a primary focus on applying MLOPS principles to the task. Given that the data for this task is text data we will be utilizing the huggingface transformers library. This library provides pretrained tokenizer modules that eliminates the need to develop our own text preprocessing, and instead focus on the ML Ops apsects of the implementation.

Data is collected from the Kaggle [Spam Text Message Classification] (https://www.kaggle.com/team-ai/spam-text-message-classification) dataset. This data is a collection of personal text messages and include many informal words.

We will use an LSTM network as our classifier, as this type of model can be very good at handling sequential data because of it's recurrent structure.

Group members:

  • Simon Jacobsen, s152655
  • Jakob Vexø, s152830
  • Morten Thomsen, s164501
  • Gustav Hartz, s174315

Major Frameworks and principles applied

For version controlling and ensuring reproducible results we have been applying the hydra framework to our pytorch lightning framework.

OPTUNA: An open source hyperparameter optimization framework to automate hyperparameter search and we use it for baysian grid search using evolutionary algorithms. This is configured using the config_hydra_optuna file.

HYDRA: Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs - much like a Hydra with multiple heads.

Pytorch Lightning: The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Some of the advantages include

  • Models become hardware agnostic
  • Code is clear to read because engineering code is abstracted away
  • Easier to reproduce
  • Make fewer mistakes because lightning handles the tricky engineering

Weights & Biases: Is used for visualizations of training and is implemented as the logger in pytorch lightning. It's primary purpose is a tracking the progress of model training. WandB can do hyperparameter sweeps, but we decided to focus on HYDRA and OPTUNA reports/figures/WandB.png

Data Drifting: Using the TorchDrift framework, data drifting can be identified. First the classification network is setup as as feature extractor, using only the embedding and LSTM layers. This creates a feature representation that is used to define the distribution of the data. New data can be compared to this distribution to catch if the data has drifted, based on a test of significance.

The extracted features of the high-dimensional data can also be plotted to a 2d space by the sklearn.Isomap function, where the visual repræsentation can contribute to an intuitive illustration of the drifting.

CI/CD: Pytest are run for the entire pytest directory "./tests". Furthermore, we also have actions for monitoring that the commits live up to the PEP8 standard. This is done with Flake8 and isort.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
│── deployment         <- Scripts for deploying the model as an Azure endpoint 
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
│
├── tests              <- pytests using the suggested src layout from pytest documentation
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

spamclassification's People

Contributors

gustavhartz avatar mhthomsen avatar jako4689 avatar jacobsen100 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.