dipruQuan

dipruQuan explores compresssion methods (distillation, pruning and Quantization) of conversational AI models. As of 13/09/2021, it contains code for the online and offline distillation of DialoGPT

Setup

This project has only been tested for Linux architectures.

Create a conda environment from LSP-linux.yml
Install any Pytorch 1.2 version compatible with the version of CUDA Toolkit on your machine.
For mixed precision training, install Apex. See Apex: Quickstart

Data

The training dataset (Reddit small) should be around 140MB while the raw validation dataset should be about 4.7GB. To generate both datasets and the 6K Multi-Reference Datset, run

cd src/msft/reddit_extractor
make -j 4

This operation may require up to 500GB of local disk space and will take significant time to complete.

Training and Validation Set:

The make command should create both data/train_raw.tsv and data/validation_raw.tsv. Both datasets have to be compressed into lazy-loading database files for use in training. Here are the steps:

Convert the file to the right format

cd data
less train_raw.tsv | awk -F '\t' '{print "0.0 "$1"\t1.0 "$2}'> train.tsv; cd ..

Compress into database file

python src/msft/prepro.py --corpus data/train.tsv

6K Multi-reference test set:

After running the make command, the 6k multi-reference test will be located at data/test.refs.txt. You need to create a test.source containing the prompt sentences for which the model is to generate responses.

cd data
cat test.refs.txt | cut -f 1 > test.source
mv test.source ../src/eval/data
cat test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > test.refs.tmp.txt
paste keys.6k.txt test.refs.tmp.txt > test.refs.txt
mv test.refs.txt ../src/eval/data

See #48 and #63

Distillation

Evaluation

You are going to need a few things:

meteval14.pl
meteor-1.5 to compute METEOR. It requires Java.

See 3rdparty for more information about this.

To evaluate any model or checkpoint, use model_eval.py

python src/eval/model_eval.py
--model-name microsoft/DialoGPT-medium
--from-hf
--context-file src/eval/data/test.source
--output-file outputs/dialogpt-medium.6k.resp.txt
--force
--batch-size 64
--tokenizer-max-len 128
--model-max-len 256
--beam 10
--refs src/eval/data/test.refs.txt
--keys src/eval/data/keys.6k.txt
--vshuman -1

This will generate responses to the 6K multi-ref and evaluate. Evaluation results will be available in the same directory as the output-file specified.

theyorubayesian / dipruquan Goto Github PK

dipruquan's Introduction

dipruQuan

Setup

Data

Training and Validation Set:

6K Multi-reference test set:

Distillation

Evaluation

Caveats

dipruquan's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent