Git Product home page Git Product logo

pmdl_assignment1's Introduction

Practical Machine Learning and Deep Learning - Assignment 1 - Text De-toxification

Nikita Sergeev BS20-AI [email protected]

Project structure

text-detoxification
├── README.md # The top-level README
│
├── data 
│   ├── external # Data from third party sources|
│   ├── interim  # Intermediate data that has been transformed.
│   └── raw      # The original, immutable data
│
├── models       # Trained and serialized models, final checkpoints
│
├── notebooks    #  Jupyter notebooks. Naming convention is a number (for ordering),
│                   and a short delimited description, e.g.
│                   "1.0-initial-data-exporation.ipynb"            
│ 
├── references   # Data dictionaries, manuals, and all other explanatory materials.
│
├── reports      # Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures  # Generated graphics and figures to be used in reporting
│
├── requirements.txt # The requirements file for reproducing the analysis environment, e.g.
│                      generated with pip freeze › requirements. txt'
└── src                 # Source code for use in this assignment
    │                 
    ├── data            # Scripts to download or generate data
    │   └── make_dataset.py
    │
    ├── models          # Scripts to train models and then use trained models to make predictions
    │   ├── predict_model.py
    │   └── train_model.py
    │   
    └── visualization   # Scripts to create exploratory and results oriented visualizations
        └── visualize.py

In the top README.md file put your name, email and group number. Additionaly, put basic commands how to use your repository. How to transform data, train model and make a predictions.

Data preparation

To download and transform the training data into the appropriate format you should use the following bash script:

sh download_data.bash

You should execute this script from the src/data folder make all the paths correct. After this, all the preprocessed data needed for trining models will be stored in data/interim/hf_dataset

Model training

Execute the following python script

python T5_train.py

This will train T5 model on the filtered ParaNMT dataset and store the trained model weights in the models/T5_paraphraser/ dir.

Model evaluation

There is 3 different options for the model evaluation.

  • Pretrained BART to detoxify the given text - s-nlp/bart-base-detox
  • Manually Fine-Tuned T5 for toxic texts paraphraser
  • Mask the toxic words in the given text with the s-nlp/roberta_toxicity_classifier

Results of the evaluation in the test set (s-nlp/paradetox) are available in the notebooks/test_set_results

Model Inference

Where are 3 different options for the model inference

  • Manually finetuned T5 - python T5_inference.py
  • Pretrained BART - python BART_inference.py
  • Toxic words masking - python toxic_words_masking.py

This script will ask you to enter some text and will detoxify it.

Visualization

This script will show you some visual and text info about the initial dataset. And will show the visual representation, how T5 detoxification method cahnges the structure of the texts in the trein set in the visual way(WordClouds) To run it, use the following command

cd src/visualization
python visualization.py

pmdl_assignment1's People

Contributors

naryst avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.