dhlee347 / pytorchic-bert Goto Github PK

Pytorch Implementation of Google BERT

License: Apache License 2.0

Python 100.00%

pytorchic-bert's Introduction

Pytorchic BERT

This is re-implementation of Google BERT model [paper] in Pytorch. I was strongly inspired by Hugging Face's code and I referred a lot to their codes, but I tried to make my codes more pythonic and pytorchic style. Actually, the number of lines is less than a half of HF's.

(It is still not so heavily tested - let me know when you find some bugs.)

Requirements

Python > 3.6, fire, tqdm, tensorboardx, tensorflow (for loading checkpoint file)

Overview

This contains 9 python files.

tokenization.py : Tokenizers adopted from the original Google BERT's code
checkpoint.py : Functions to load a model from tensorflow's checkpoint file
models.py : Model classes for a general transformer
optim.py : A custom optimizer (BertAdam class) adopted from Hugging Face's code
train.py : A helper class for training and evaluation
utils.py : Several utility functions
pretrain.py : An example code for pre-training transformer
classify.py : An example code for fine-tuning using pre-trained transformer

Example Usage

Fine-tuning (MRPC) Classifier with Pre-trained Transformer

Download pretrained model BERT-Base, Uncased and GLUE Benchmark Datasets before fine-tuning.

make sure that "total_steps" in train_mrpc.json is n_epochs*(num_data/batch_size)

export GLUE_DIR=/path/to/glue
export BERT_PRETRAIN=/path/to/pretrain
export SAVE_DIR=/path/to/save

python classify.py \
    --task mrpc \
    --mode train \
    --train_cfg config/train_mrpc.json \
    --model_cfg config/bert_base.json \
    --data_file $GLUE_DIR/MRPC/train.tsv \
    --pretrain_file $BERT_PRETRAIN/bert_model.ckpt \
    --vocab $BERT_PRETRAIN/vocab.txt \
    --save_dir $SAVE_DIR \
    --max_len 128

Output :

cuda (8 GPUs)
Iter (loss=0.308): 100%|██████████████████████████████████████████████| 115/115 [01:19<00:00,  2.07it/s]
Epoch 1/3 : Average Loss 0.547
Iter (loss=0.303): 100%|██████████████████████████████████████████████| 115/115 [00:50<00:00,  2.30it/s]
Epoch 2/3 : Average Loss 0.248
Iter (loss=0.044): 100%|██████████████████████████████████████████████| 115/115 [00:50<00:00,  2.33it/s]
Epoch 3/3 : Average Loss 0.068

Evaluation of the trained Classifier

export GLUE_DIR=/path/to/glue
export BERT_PRETRAIN=/path/to/pretrain
export SAVE_DIR=/path/to/save

python classify.py \
    --task mrpc \
    --mode eval \
    --train_cfg config/train_mrpc.json \
    --model_cfg config/bert_base.json \
    --data_file $GLUE_DIR/MRPC/dev.tsv \
    --model_file $SAVE_DIR/model_steps_345.pt \
    --vocab $BERT_PRETRAIN/vocab.txt \
    --max_len 128

Output :

cuda (8 GPUs)
Iter(acc=0.792): 100%|████████████████████████████████████████████████| 13/13 [00:27<00:00,  2.01it/s]
Accuracy: 0.843137264251709

Google BERT original repo also reported 84.5%.

Pre-training Transformer

Input file format :

One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because we use the sentence boundaries for the "next sentence prediction" task).
Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

Document 1 sentence 1
Document 1 sentence 2
...
Document 1 sentence 45

Document 2 sentence 1
Document 2 sentence 2
...
Document 2 sentence 24

Usage :

export DATA_FILE=/path/to/corpus
export BERT_PRETRAIN=/path/to/pretrain
export SAVE_DIR=/path/to/save

python pretrain.py \
    --train_cfg config/pretrain.json \
    --model_cfg config/bert_base.json \
    --data_file $DATA_FILE \
    --vocab $BERT_PRETRAIN/vocab.txt \
    --save_dir $SAVE_DIR \
    --max_len 512 \
    --max_pred 20 \
    --mask_prob 0.15

Output (with Toronto Book Corpus):

cuda (8 GPUs)
Iter (loss=5.837): : 30089it [18:09:54,  2.17s/it]
Epoch 1/25 : Average Loss 13.928
Iter (loss=3.276): : 30091it [18:13:48,  2.18s/it]
Epoch 2/25 : Average Loss 5.549
Iter (loss=4.163): : 7380it [4:29:38,  2.19s/it]
...

Training Curve (1 epoch ~ 30k steps ~ 18 hours):

Loss for Masked LM vs Iteration steps Loss for Next Sentence Prediction vs Iteration steps

pytorchic-bert's People

Contributors

Stargazers

Watchers

Forkers

alphadl huyhoang17 mkim0710 guanlongtianzi hitxujian sayduke hunglethanh9 thesage21 deoko ramonyeung sungjunlee cheesama elbum haanjack noowad akakakakakaa jeonsworld zpeng1989 zhangjiekui rosssong yongwookha y9yk leesehoon chansongjo smksyj caydies jihan-jung gardenofmemories kiddj mahjiong goddoe coldog2333 seonghwapark graykode mowayao om00839 zihangjiang mohammed-mustafa cp-lim shenlanyilang apoorv2904 jwh7337 madblackcat fengmingquan-sjtu strategist922 annabelleyun guglie ywang021 realjl lgstd bbrangeo jyjy7 srlee-ai shmuhammadd arnegebert michqiu choidongyeon afcarl chenyang918 allenzyoung neotcr sivabuddi gdp-no-dg obin-hero issifuabdulmajeed nlp-minkyoung cqh6666 uuuchen purpleapples monskim999 jimmykimmy68 simba2017 suparek tooba-ts1700550 sayontang chrisbyd maulberto3 lainegates vrm1 w32zhong tigerchen52 seungonekim hubaimaster setrac4ci standardgalactic mathisall mathsml qihongl aqhali zyz0000 jamboneylj jasperyang deborahmit cy3 genoplan daniel4se jinghansunn reign12 oceanbupt haishen-ll

pytorchic-bert's Issues

pretrain for chinese text

Hi, i want to pretrain the code for chinese data as datafile . The formate is like this:
今天天气好
and can i use the my own vocab.txt ?
thanks a lot.

Could you add a license file?

Pretraining with checkpoints

Hey, if i want to restore a checkpointed model and perform pretraining, how would I do that?

Usage

Hello, I have the following questions about the usage part:

1.How do I get the following two files

2.For the Toronto Book Corpus,Should I download and manually adjust the format?

Looking forward to your reply，and thank you very much!!!

the total number of trainable parameters in 12 layer BERT

Hi,

Have you checked the total number of parameters in the model? I checked that it is 220 million, which is more than the 110 million parameters presented in the original BERT-base model. Hope for your reply! Thanks!

Visualizing the attention weights

Hi,

First of all, thank you so much for the great work you guys have done in your scripts.
I would like to visualize the attention weights obtained after training. I tried to use this visualization tool: https://github.com/jessevig/bertviz#attention-head-view

However, it seems to only work for the pre-trained BERT model, not the fine-tuned one we make ourselves via your script.

What would you recommend for visualization?

Thank you!

How can we use in on test dataset?

How can we use in on test dataset for cola task?

Padding bugs on data preprocess

On this code line,
the pad index 0 is same with first segment index.
So, it may not offer segment information exactly.

Can you please provide books_large_all.txt?

any sample dataset for pre-training?

Why is there any need of max_pred in pretraining?

questions for loading the pretrained_model

    def load(self, model_file, pretrain_file):
        """ load saved model or pretrained transformer (a part of model) """
        if model_file:
            print('Loading the model from', model_file)
            self.model.load_state_dict(torch.load(model_file))

        elif pretrain_file: # use pretrained transformer
            print('Loading the pretrained model from', pretrain_file)
            if pretrain_file.endswith('.ckpt'): # checkpoint file in tensorflow
                checkpoint.load_model(self.model.transformer, pretrain_file)
            elif pretrain_file.endswith('.pt'): # pretrain model file in pytorch
                self.model.transformer.load_state_dict(
                    {key[12:]: value
                        for key, value in torch.load(pretrain_file).items()
                        if key.startswith('transformer')}
                ) # load only transformer parts

Could I kindly ask that what is the meaning of key[12:]: value when you load a pretrained_model? Just want to keep the last layer? Thanks, hope for your reply.

h = (scores @ v).transpose(1, 2).contiguous() RuntimeError: CUDA error: out of memory

How much the token-level MLM loss usually is when the bert pre-training stops converging?

I use a chinese corpus to pre-train a bert on this project.

And i find that my loss almost stop decreasing when it reaches about 4.0. I never trained an english version bert. Is there some more training log for english bert? I just want to know the final token-level MLM loss on English bert pre-training. Thanks first.

How can I get the replacement of 'books_large_all.txt'?

Can you tell me the data set that can replace 'books_large_all.txt'?
thank you

some confusions

if I pretrain Bert with masking strategy，In principle, I can predict the word of mask given by a source sentence and a target sentence with mask.
anyone can tell me how to do that?
thank you man.

Running SQUAD

Can i fune tune this model to make it run squad?

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3793: ordinal not in range(128)

I downloaded the pretrained bert model.
Running the fine-tuning step brings an error
when loading vocab file, I assume.

Any idea to fix it?

Question about running the pretrain.py

Hey,

I have difficulties in running the pretrain, any help would be appreciated.
So I've prepared corpus.txt (quite small, about 1000 lines) that looks like this:

document 1 line 1...
document 1 line 2...
document 1 line 3...

document 2 line 1...
document 2 line 2...
document 2 line 3...

And I run the pretrain.py but I got an error on train.py file, on this line:
print('Epoch %d/%d : Average Loss %5.3f'%(e+1, self.cfg.n_epochs, loss_sum/(i+1)))
So for the time being I commented that line.

And after I run again, here what I got:

Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
....

Could you please point me where I could possibly make the mistake?
Thanks!

p.s. I have commented some part of the code in train.py (the part where it loads the checkpoint, because I dont install the tensorflow for a reason). What I want to do for now is training a pretrained bert model using my own data. I am not sure if it is causing the error above?

Nice work!

I like it! You may want to check the work NVIDIA did to incorporate FP16 training in our repo. It really speeds the model on recent GPUs (4x speed up on a V100!).
You basically just have to change the Layer Norm module in the model and tweak a bit the training to use NVIDIA's apex.

Pretraining data format and possible corner case of seek_random_offset()

Hi, thank you very much for the implementation!

I'm trying to compare your implementation with the official TF BERT head-to-head with the Gutenberg dataset (since the BookCorpus dataset is no longer available now).

I assume that the text input file format is the same as huggingface's implementation. Is that correct? A direct clarification of the text dataset format would be great for new users.
There might be a corner case of seek_random_offset() if using utf-8 text dataset (like the above) for pre-training. When doing f.seek(randint(0, max_offset), 0), If the function happens to truncate the utf-8 ' character (i.e. from \xe2\x80\x99 into something like \x99), pretrain.py will raise the error like the following:

  File "/home/tkdrlf9202/PycharmProjects/pytorchic-bert/pretrain.py", line 88, in __iter__
    seek_random_offset(self.f_neg)
  File "/home/tkdrlf9202/PycharmProjects/pytorchic-bert/pretrain.py", line 41, in seek_random_offset
    f.readline() # throw away an incomplete sentence
  File "/home/tkdrlf9202/anaconda3/envs/p36/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 0: invalid start byte

The error could be mitigated if we use

self.f_pos = open(file, "r", encoding='utf-8', errors='ignore')
self.f_neg = open(file, "r", encoding='utf-8', errors='ignore')

instead of self.f_pos = open(file, 'r') in SentPairDataLoader, but half-silently removing some characters might lead to reproducibility issues (I guess chances are minimal since the f.readline() next to f.seek(randint(0, max_offset), 0) is for ditching the incomplete sequence).

I'd like to hear your opinions and thanks again for the contribution!

Does this support multi GPU training?

Hello,

Thanks for the excellent work in compressing the HF code in a single repo for BERT.

Just a couple fo questions:

a) Is it possible to load pertained BERT weights and then fine tune on top of it on my own dataset?
b) Does this support multi GPU training?

Thanks
Abhishek

Can you please provide books_large_all.txt? And also, the pretrained model uncased_L-12_H-768_A-12/bert_model.ckpt?

Can you give me some details about files?

Thank you for your great code. I'm a student and a beginner of data analysis.
I want to executive your code but I have some questions. It may be a silly question, but can you give me some details about files?

python pretrain.py
--train_cfg config/pretrain.json
--model_cfg config/bert_base.json
--data_file $DATA_FILE
--vocab $BERT_PRETRAIN/vocab.txt
--save_dir $SAVE_DIR
--max_len 512
--max_pred 20
--mask_prob 0.15

config/pretrain.json
config/bert_base.json
$DATA_FILE
$BERT_PRETRAIN/vocab.txt

We need a $DATA_FILE as a train set, but what is vocab.txt? I can get the vocab.txt file from google's github. Just use it? or Can I customize it?(Because I want to make a bert which has lower parameters than BERT-BASE.)
Also, the ouput file model_steps_xxxx.pt is compatible with BERT in google's github?

Sorry I am not an expert, so maybe my questions are so silly. Thank you.

Question About fine-tuning

https://github.com/dhlee347/pytorchic-bert/blob/master/classify.py#L137
When fine-tuning, is there no needed '[SEP]', '[CLS]'?
In google-research code, They add 'CLS' and 'SEP' in fine-tuning.
https://github.com/santhoshkolloju/Abstractive-Summarization-With-Transfer-Learning/blob/master/preprocess.py#L197

Is GEGLU innovative, or is it derived from a certain paper?

Masked subword prediction problem

In pretrain get_loss function, loss_lm is calculated by mean.

Because of this, all zero values in loss_lm handles as a correct answer.

So, I think we need to change mean to numerator / denominator like tensorflow.

loss_lm = (loss_lm * masked_weights.float()).mean()
to
loss_lm_numerator = (loss_lm*masked_weights.float()).sum()
loss_lm_denominator = masked_weights.sum() + 1e-5
loss_lm = loss_lm_numerator / loss_lm_denominator

Is it correct?