Git Product home page Git Product logo

rafaelgreca / disaster-tweet-classification Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 34 KB

This project aims to create a Deep Learning application designed to tackle Natural Language Processing with Disaster Tweets competition proposed by Kaggle using PyTorch, BERT, Flask, and Docker.

License: MIT License

Python 99.05% Dockerfile 0.95%
bert deep-learning docker fine-tuning flask machine-learning mlops nlp pytorch natural-language-processing

disaster-tweet-classification's Introduction

Disaster Tweet Classification

This project aims to create a Deep Learning model designed to tackle Natural Language Processing with Disaster Tweets competition proposed by Kaggle. However, it's not focused on training and evaluating different algorithms to see which has a better performance on Kaggle's ranking, but rather on building an application using Flask and Docker to use the training dataset to build and train a BERT model and then use it to make predictions on the test dataset.

Table of Contents

Installation

To install this package, first clone the repository to the directory of your choice using the following command:

git clone https://github.com/rafaelgreca/disaster-tweet-classification.git

Finally, you need to create a conda or virtual environment and install the requirements. This can be done using the following command:

For pip use the following command:

conda create --name disaster-classification python=3.11
conda activate disaster-classification
pip install -r requirements.txt

Getting Started

Download the Dataset

Before continuing, to the code work properly you need to download the dataset correctly. If you install using other sources, the code might not work. Download the dataset using Kaggle's link. After downloading it, create a data folder on the root and put the train.csv and test.csv files inside of it.

Directory Structure

./
├── data/
│   ├── train.csv
│   └── test.csv
├── src/
│   ├── __init__.py
│   ├── bert.py
│   ├── dataset.py
│   ├── preprocessing.py
│   └── utils.py
├── __init__.py
├── LICENSE
├── README.md
├── requirements.txt
├── Dockerfile
└── api.py

Explaining briefly the main folders and files:

  • requirements.txt: the main libraries used to develop the project;
  • src: where the core functions are implemented, such as the text preprocessing steps (preprocessing.py), BERT model definition (bert.py), the dataset creation (dataset.py), and input/output files' operations (utils.py);
  • api.py: the main file responsible for creating the API's endpoints (training and inference). Additionally, also all functions that were used in both endpoints.

Running the Code

Building the Docker image:

sudo docker build -f Dockerfile -t disaster-tweet . --no-cache

Running the Docker container:

sudo docker run -d -p 8000:5000 --name disaster disaster-tweet

Training the BERT model (we are using cross-validation with 5 folds, therefore 5 BERT models will be created and trained, then saved in a folder called models located on the root folder):

curl -X GET http://127.0.0.1:8000/train

An example of what the API will return after the BERT is trained:

{
  "0": {
    "train_f1": "0.8549920922517874",
    "train_loss": "0.12686705062205486",
    "valid_f1": "0.8109330292567293",
    "valid_loss": "0.14864667398311818"
  },
  "1": {
    "train_f1": "0.8335076047278006",
    "train_loss": "0.14481979079465282",
    "valid_f1": "0.8793904647160197",
    "valid_loss": "0.09332963859196752"
  },
  "2": {
    "train_f1": "0.8962081752840937",
    "train_loss": "0.10410643358060971",
    "valid_f1": "0.9332780567691313",
    "valid_loss": "0.05860341369384514"
  },
  "3": {
    "train_f1": "0.9299154303900699",
    "train_loss": "0.07443954636580608",
    "valid_f1": "0.942967428795232",
    "valid_loss": "0.053271645408434175"
  },
  "4": {
    "train_f1": "0.9423206835786697",
    "train_loss": "0.06326480431759611",
    "valid_f1": "0.9519999548124859",
    "valid_loss": "0.042643080713655"
  }
}

Using the trained model on the test dataset (the model's name can be bert_fold0, bert_fold1, bert_fold2, bert_fold3 or bert_fold4):

curl -X POST http://127.0.0.1:8000/inference -H \ 
'Content-Type: application/json' -H 'Accept: application/json' \
-d '{"model_name": "bert_fold0"}'

A small sample of what the API will return after the inference:

[
  {
    "cleaned_text": "death toll suicide car bombing pg position village rajman eastern province hasaka risen",
    "id": 10858,
    "prediction": 1,
    "text": "The death toll in a #IS-suicide car bombing on a #YPG position in the Village of Rajman in the eastern province of Hasaka has risen to 9"
  },
  {
    "cleaned_text": "earthquake safety los angeles uo safety fasteners xrwn",
    "id": 10861,
    "prediction": 1,
    "text": "EARTHQUAKE SAFETY LOS ANGELES \u0089\u00db\u00d2 SAFETY FASTENERS XrWn"
  },
  {
    "cleaned_text": "storm ri worse last hurricane city amp others hardest hit yard looks like bombed around still without power",
    "id": 10865,
    "prediction": 1,
    "text": "Storm in RI worse than last hurricane. My city&3others hardest hit. My yard looks like it was bombed. Around 20000K still without power"
  },
  {
    "cleaned_text": "green line derailment chicago",
    "id": 10868,
    "prediction": 0,
    "text": "Green Line derailment in Chicago http://t.co/UtbXLcBIuY"
  },
  {
    "cleaned_text": "meg issues hazardous weather outlook hwo",
    "id": 10874,
    "prediction": 1,
    "text": "MEG issues Hazardous Weather Outlook (HWO) http://t.co/3X6RBQJHn3"
  },
  {
    "cleaned_text": "city calgary activated municipal emergency plan yy storm",
    "id": 10875,
    "prediction": 1,
    "text": "#CityofCalgary has activated its Municipal Emergency Plan. #yycstorm"
  }
]

Contributing

Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. Don't forget to give the project a star! Thanks again!

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Author: Rafael Greca Vieira - GitHub - LinkedIn - [email protected]

disaster-tweet-classification's People

Contributors

rafaelgreca avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.