Git Product home page Git Product logo

pytorchic-bert's People

Contributors

dhlee347 avatar kiddj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorchic-bert's Issues

questions for loading the pretrained_model

    def load(self, model_file, pretrain_file):
        """ load saved model or pretrained transformer (a part of model) """
        if model_file:
            print('Loading the model from', model_file)
            self.model.load_state_dict(torch.load(model_file))

        elif pretrain_file: # use pretrained transformer
            print('Loading the pretrained model from', pretrain_file)
            if pretrain_file.endswith('.ckpt'): # checkpoint file in tensorflow
                checkpoint.load_model(self.model.transformer, pretrain_file)
            elif pretrain_file.endswith('.pt'): # pretrain model file in pytorch
                self.model.transformer.load_state_dict(
                    {key[12:]: value
                        for key, value in torch.load(pretrain_file).items()
                        if key.startswith('transformer')}
                ) # load only transformer parts

Could I kindly ask that what is the meaning of key[12:]: value when you load a pretrained_model? Just want to keep the last layer? Thanks, hope for your reply.

Does this support multi GPU training?

Hello,

Thanks for the excellent work in compressing the HF code in a single repo for BERT.

Just a couple fo questions:

a) Is it possible to load pertained BERT weights and then fine tune on top of it on my own dataset?
b) Does this support multi GPU training?

Thanks
Abhishek

Can you give me some details about files?

Thank you for your great code. I'm a student and a beginner of data analysis.
I want to executive your code but I have some questions. It may be a silly question, but can you give me some details about files?


python pretrain.py
--train_cfg config/pretrain.json
--model_cfg config/bert_base.json
--data_file $DATA_FILE
--vocab $BERT_PRETRAIN/vocab.txt
--save_dir $SAVE_DIR
--max_len 512
--max_pred 20
--mask_prob 0.15

  1. config/pretrain.json
  2. config/bert_base.json
  3. $DATA_FILE
  4. $BERT_PRETRAIN/vocab.txt

We need a $DATA_FILE as a train set, but what is vocab.txt? I can get the vocab.txt file from google's github. Just use it? or Can I customize it?(Because I want to make a bert which has lower parameters than BERT-BASE.)
Also, the ouput file model_steps_xxxx.pt is compatible with BERT in google's github?

Sorry I am not an expert, so maybe my questions are so silly. Thank you.

pretrain for chinese text

Hi, i want to pretrain the code for chinese data as datafile . The formate is like this:
今天 天气 好
and can i use the my own vocab.txt ?
thanks a lot.

Visualizing the attention weights

Hi,

First of all, thank you so much for the great work you guys have done in your scripts.
I would like to visualize the attention weights obtained after training. I tried to use this visualization tool: https://github.com/jessevig/bertviz#attention-head-view

However, it seems to only work for the pre-trained BERT model, not the fine-tuned one we make ourselves via your script.

What would you recommend for visualization?

Thank you!

Nice work!

I like it! You may want to check the work NVIDIA did to incorporate FP16 training in our repo. It really speeds the model on recent GPUs (4x speed up on a V100!).
You basically just have to change the Layer Norm module in the model and tweak a bit the training to use NVIDIA's apex.

Question about running the pretrain.py

Hey,

I have difficulties in running the pretrain, any help would be appreciated.
So I've prepared corpus.txt (quite small, about 1000 lines) that looks like this:

document 1 line 1...
document 1 line 2...
document 1 line 3...

document 2 line 1...
document 2 line 2...
document 2 line 3...

And I run the pretrain.py but I got an error on train.py file, on this line:
print('Epoch %d/%d : Average Loss %5.3f'%(e+1, self.cfg.n_epochs, loss_sum/(i+1)))
So for the time being I commented that line.

And after I run again, here what I got:

Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
Iter (loss=X.XXX): 0it [00:00, ?it/s]
....

Could you please point me where I could possibly make the mistake?
Thanks!

p.s. I have commented some part of the code in train.py (the part where it loads the checkpoint, because I dont install the tensorflow for a reason). What I want to do for now is training a pretrained bert model using my own data. I am not sure if it is causing the error above?

some confusions

if I pretrain Bert with masking strategy,In principle, I can predict the word of mask given by a source sentence and a target sentence with mask.
anyone can tell me how to do that?
thank you man.

Usage

Hello, I have the following questions about the usage part:

1.How do I get the following two files
image

2.For the Toronto Book Corpus,Should I download and manually adjust the format?

Looking forward to your reply,and thank you very much!!!

Masked subword prediction problem

In pretrain get_loss function, loss_lm is calculated by mean.

Because of this, all zero values in loss_lm handles as a correct answer.

So, I think we need to change mean to numerator / denominator like tensorflow.

loss_lm = (loss_lm * masked_weights.float()).mean()
to
loss_lm_numerator = (loss_lm*masked_weights.float()).sum()
loss_lm_denominator = masked_weights.sum() + 1e-5
loss_lm = loss_lm_numerator / loss_lm_denominator

Is it correct?

Pretraining data format and possible corner case of seek_random_offset()

Hi, thank you very much for the implementation!

I'm trying to compare your implementation with the official TF BERT head-to-head with the Gutenberg dataset (since the BookCorpus dataset is no longer available now).

  1. I assume that the text input file format is the same as huggingface's implementation. Is that correct? A direct clarification of the text dataset format would be great for new users.

  2. There might be a corner case of seek_random_offset() if using utf-8 text dataset (like the above) for pre-training. When doing f.seek(randint(0, max_offset), 0), If the function happens to truncate the utf-8 ' character (i.e. from \xe2\x80\x99 into something like \x99), pretrain.py will raise the error like the following:

  File "/home/tkdrlf9202/PycharmProjects/pytorchic-bert/pretrain.py", line 88, in __iter__
    seek_random_offset(self.f_neg)
  File "/home/tkdrlf9202/PycharmProjects/pytorchic-bert/pretrain.py", line 41, in seek_random_offset
    f.readline() # throw away an incomplete sentence
  File "/home/tkdrlf9202/anaconda3/envs/p36/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 0: invalid start byte

The error could be mitigated if we use

self.f_pos = open(file, "r", encoding='utf-8', errors='ignore')
self.f_neg = open(file, "r", encoding='utf-8', errors='ignore')

instead of self.f_pos = open(file, 'r') in SentPairDataLoader, but half-silently removing some characters might lead to reproducibility issues (I guess chances are minimal since the f.readline() next to f.seek(randint(0, max_offset), 0) is for ditching the incomplete sequence).

I'd like to hear your opinions and thanks again for the contribution!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.