RuTaBERT

Model for solving the problem of Column Type Annotation with BERT, trained on russian corpus.

RuTaBERT
- Project structure
- Configuration
- Dataset files
- Training
  - Locally
  - Docker
  - Slurm
- Testing
- Inference

Project structure

📦RuTaBERT
 ┣ 📂checkpoints
 ┃ ┗ Saved PyTorch models `.pt` 
 ┣ 📂data
 ┃ ┣ 📂inference
 ┃ ┃ ┗ Tabels to inference `.csv`
 ┃ ┣ 📂test
 ┃ ┃ ┗ Test dataset files `.csv`
 ┃ ┣ 📂train
 ┃ ┃ ┗ Train dataset files `.csv`
 ┃ ┗  Directory for storing dataset files.
 ┣ 📂dataset
 ┃ ┗  Dataset wrapper classes, dataloaders
 ┣ 📂logs
 ┃ ┗ Log files (train / test / error)
 ┣ 📂model
 ┃ ┗ Model and metrics
 ┣ 📂trainer
 ┃ ┗ Trainer
 ┣ 📂utils
 ┃ ┗ Helper functions
 ┗ Entry points (train.py, test.py, inference.py), configuration, building files.

Configuration

The model configuration can be found in the file config.json.

The configuratoin argument parameters are listed below:

argument	description
num_labels	Number of labels used for classification
num_gpu	Number of GPUs to use
save_period_in_epochs	Number characterizing with what periodicity the checkpoint is saved (in epochs)
metrics	The classification metrics used are
pretrained_model_name	BERT shortcut name from HuggingFace
table_serialization_type	Method of serializing a table into a sequence
batch_size	Batch size
num_epochs	Number of training epochs
random_seed	Random seed
logs_dir	Directory for logging
train_log_filename	File name for train logging
test_log_filename	File name for test logging
start_from_checkpoint	Flag to start training from checkpoint
checkpoint_dir	Directory for storing checkpoints of model
checkpoint_name	File name of a checkpoint (model state)
inference_model_name	File name of a model for inference
inference_dir	Directory for storing inference tables `.csv`
dataloader.valid_split	Amount of validation subset split
dataloader.num_workers	Number of dataloader workers
dataset.num_rows	Number of readable rows in the dataset, if `null` read all rows in files
dataset.data_dir	Directory for storing train/test/inference files
dataset.train_path	Directory for storing train dataset files `.csv`
dataset.test_path	Direcotry for storing test dataset files `.csv`

We recomend to change ONLY theese parameters:

num_gpu - Any positive ingeter number + {0}. 0 stand for training / testing on CPU.
save_period_in_epochs - Any positive integer number, measures in epochs.
table_serialization_type - "column_wise" or "table_wise".
pretrained_model_name - BERT shorcut names from Huggingface PyTorch pretrained models.
batch_size - Any positive integer number.
num_epochs - Any positive integer number.
random_seed - Any integer number.
start_from_checkpoint - "true" or "false".
checkpoint_name - Any name of model, saved in checkpoint directory.
inference_model_name - Any name of model, saved in checkpoint directory. But we recommend to use the best models: [model_best_f1_weighted.pt, model_best_f1_macro.pt, model_best_f1_micro.pt].
dataloader.valid_split - Real number within range [0.0, 1.0] (0.0 stands for 0 % of train subset, 0.5 stands for 50 % of train subset). Or positive integer number (Denoting a fixed number of a validation subset).
dataset.num_rows - "null" stands for read all lines in dataset files. Positive integer means the number of lines to read in the files of the dataset.

Dataset files

Before training / testing the model you need to:

Download dataset repository in the same directory as RuTaBERT, example dir strucutre:

├── src
│  ├── RuTaBERT
│  ├── RuTaBERT-Dataset
│  │  ├── move_dataset.sh

Run script move_dataset.sh from dataset reporitory, to move dataset files into RuTaBERT data directory:

RuTaBERT-Dataset$ ./move_dataset.sh

configure config.json file before training.

Training

RuTaBERT supports training / testing locally and inside Docker container. Also supports slurm workload manager.

Locally

Create virtual environment:

RuTaBERT$ virtualenv venv

RuTaBERT$ python -m virtualenv venv

Install requirements and start train and test.

RuTaBERT$ source venv/bin/activate &&\
    pip install -r requirements.txt &&\
    python3 train.py 2> logs/error_train.log &&\
    python3 test.py 2> logs/error_test.log

Models will be saved in checkpoint directory.
Output will be in logs/ directory (training_results.csv, train.log, test.log, error_train.log, error_test.log).

Docker

Requirements:

Docker installation guide (ubuntu)
NVIDIA driver
NVIDIA Container Toolkit installation guide (ubuntu)

Make sure all dependencies are installed.
Build image:

RuTaBERT$ sudo docker build -t rutabert .

Run image

RuTaBERT$ sudo docker run -d --runtime=nvidia --gpus=all \
    --mount source=rutabert_logs,target=/app/rutabert/logs \
    --mount source=rutabert_checkpoints,target=/app/rutabert/checkpoints \
    rutabert

Move models and logs from container after training / testing.

RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_checkpoints/_data ./checkpoints

RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_logs/_data ./logs

Don't forget to remove volumes after training! Docker wont do it for you.
Models will be saved in checkpoint directory.
Output will be in logs/ directory (training_results.csv, train.log, test.log, error_train.log, error_test.log).

Slurm

Create virtual environment:

RuTaBERT$ virtualenv venv

RuTaBERT$ python -m virtualenv venv

Run slurm script:

RuTaBERT$ sbatch run.slurm

Check job status:

RuTaBERT$ squeue

Models will be saved in checkpoint directory.
Output will be in logs/ directory (train.log, test.log, error_train.log, error_test.log).

Testing

Make sure data placed in data/test directory.
(Optional) Download pre-trained models:

RuTaBERT$ ./download.sh table_wise

RuTaBERT$ ./download.sh column_wise

Configure which model to test in config.json.
Run:

RuTaBERT$ source venv/bin/activate &&\
    pip install -r requirements.txt &&\
    python3 test.py 2> logs/error_test.log

Output will be in logs/ directory (test.log, error_test.log).

Inference

Make sure data placed in data/inference directory.
(Optional) Download pre-trained models:

RuTaBERT$ ./download.sh table_wise

RuTaBERT$ ./download.sh column_wise

Configure which model to inference in config.json
Run:

RuTaBERT$ source venv/bin/activate &&\
    pip install -r requirements.txt &&\
    python3 inference.py

Labels will be in data/inference/result.csv

sti-team / rutabert Goto Github PK

rutabert's Introduction

RuTaBERT

Table of contents

Project structure

Configuration

Dataset files

Training

Locally

Docker

Slurm

Testing

Inference

rutabert's People

Contributors

Stargazers

Watchers

rutabert's Issues

Recommend Projects

Recommend Topics

Recommend Org