Use output weights of first BERT fine-tuned on SQuAD train as input starter weights fo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have fetched one of the url for <a href="https://github.com/huggingface/pytorch-pret

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ideas for second fine-tuning: <div class="highlight highlight-source-python notran

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Fine-tune BERT on SQuAD train then SQuAD dev about cdqa HOT 13 CLOSED

cdqa-suite commented on May 23, 2024

Fine-tune BERT on SQuAD train then SQuAD dev

from cdqa.

Comments (13)

andrelmfarias commented on May 23, 2024 1

@fmikaelian

The command I ran was the following:

python run_squad.py \
  --bert_model models/bert_qa_squad_v1.1 \
  --do_train \
  --fp16 \
  --do_lower_case \
  --train_file samples/dev-v1.1.json \
  --train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir logs \
  --fp16

Yes, I looked at the BertTokenizer class. I understand the argument that is fed to the BertTokenizer is either the path to the saved model or the name of one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP. (args.bert_model at Line 883 in run_squad.py ).
If args.bert_model is not one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP, the variable vocab_file in the function from_pretrained(line 126 in BertTokenizer ) is gonna be the path we fed to run_squad.py, in our case models/bert_qa_squad_v1.1.

If there is no vocab.txtfile (variable VOCAB_NAME in tokenizer.py) in the repo where we saved the model, where are going to get an error. Yes, another solution would be to drop the vocab file there, but we would be constraint to do this every time we need to retrain. Yes, we can automatize this with another script, but in the end we are going to have to whether use more scripts or change the one we have (as I decided to do).

I am committing the new script in a new branch so that we can take a look at it. There are no major changes and it is easy to understand.

from cdqa.

andrelmfarias commented on May 23, 2024 1

@fmikaelian I understood we would use this model to do a 2nd fine-tune on the BNP dataset, as after this fine-tune on the dev set, the model would be able to generalise more (had seen more samples). I imagine however that the performance might not increase that much in comparison to the model trained on SQUAD train.

We can discuss it.

from cdqa.

fmikaelian commented on May 23, 2024

It would be a good thing to report for each model training:

the commands we used
the data we used
the training time
...

Also, using ML Flow Tracking might be easier for us to track different models.

from cdqa.

fmikaelian commented on May 23, 2024

I have fetched one of the url for --bert_model to see what is inside:

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar xvzf bert-base-uncased.tar.gz

It's just a weights file and a a model config file:

./pytorch_model.bin
./bert_config.json

I will adapt download.py to save what @andrelmfarias released under the /models folder with the same structure.

Then, it seems run_squad.py just loads a model specified by --bert_model when not in training mode (eg. --do_predict).

    if args.do_train:
        # Save a trained model and the associated configuration
        # Load a trained model and config that you have fine-tuned
    else:
        model = BertForQuestionAnswering.from_pretrained(args.bert_model)

The function from_pretrained can take as argument:

a path or url to a pretrained model archive containing:
. bert_config.json a configuration file for the model
. pytorch_model.bin a PyTorch dump of a BertForPreTraining instance

To predict with model fine tuned on SQuAD v1.1, we need to do:

python run_squad.py \
  --bert_model models/bert_qa_squad_v1.1 \
  --do_predict \
  --predict_fp16 \
  --do_lower_case \
  --predict_file samples/custom-sample-v2.0.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir logs

With samples/custom-sample-v2.0.json being the file to predict with SQuAD format.

from cdqa.

fmikaelian commented on May 23, 2024

@andrelmfarias Should I try this and let you know if it works?

from cdqa.

andrelmfarias commented on May 23, 2024

@fmikaelian I think if you do not select the option --do_train, it is only going to predict on samples/custom-sample-v2.0.json and is not going to run a second fine-tunning.

Don't we need to change the script in order to do the fine-tune on squad-dev?

from cdqa.

fmikaelian commented on May 23, 2024

@andrelmfarias Yes, you are right the code snippet above is only for prediction.

I can try to generate predictions on a sample samples/custom-sample-v2.0.json with the first model you released (bert_qa_squad_v1.1).

You can try to do the second fine-tuning on squad-dev to validate the workflow? Don't forget to report your actions and commands ✍️.

from cdqa.

fmikaelian commented on May 23, 2024

Ideas for second fine-tuning:

if args.do_train and args.bert_model != 'models/bert_qa_squad_v1.1':
       # Save a trained model and the associated configuration
       model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
       output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
       torch.save(model_to_save.state_dict(), output_model_file)
       output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
       with open(output_config_file, 'w') as f:
           f.write(model_to_save.config.to_json_string())        # Load a trained model and config that you have fine-tuned
       config = BertConfig(output_config_file)
       model = BertForQuestionAnswering(config)
       model.load_state_dict(torch.load(output_model_file))
   else:
model = BertForQuestionAnswering.from_pretrained(args.bert_model)

See: https://github.com/huggingface/pytorch-pretrained-BERT/blob/833774075447b5eaef92b9da92ee4ce2decf89fb/examples/run_squad.py#L1011

from cdqa.

andrelmfarias commented on May 23, 2024

Some errors when re-training with saved model 'models/bert_qa_squad_v1.1' using run_squad.py script:

02/21/2019 17:14:43 - ERROR - pytorch_pretrained_bert.tokenization -   Model name './output_bert/squad_1.1_train' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed './output_bert/squad_1.1_train/vocab.txt' was a path or url but couldn't find any file associated to this path or url.
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling -   loading archive file ./output_bert/squad_1.1_train
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

Traceback (most recent call last):
  File "run_squad.py", line 945, in main
    with open(cached_train_features_file, "rb") as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'squad_data/dev_mod.json_squad_1.1_train_384_128_64'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_squad.py", line 1077, in <module>
    main()
  File "run_squad.py", line 954, in main
    is_training=True)
  File "run_squad.py", line 211, in convert_examples_to_features
    query_tokens = tokenizer.tokenize(example.question_text)
AttributeError: 'NoneType' object has no attribute 'tokenize'

It seems to be some error related to the non-existance of tokenizer with the saved model.

I solved the problem by running a own-made script run_squad_fine-tunned.py which will be commited to the repo.

The usage of run_squad-fine-tunned.py for retrain a saved model is as below:

python run_squad_fine-tunned.py \
--bert_model bert-base-uncased \
--do_retrain \
--do_predict \
--do_lower_case \
--train_file <path-to-train-file> \
--predict_file squad_data/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir output_bert/squad_1.1_dev \
--fine_tunned_weights <path-to-model.bin-file>

from cdqa.

fmikaelian commented on May 23, 2024

@andrelmfarias

The error traceback for FileNotFoundError is weird: the squad_data path has been declared somewhere? What command did you use to get this error?

For the tokenizer error, did you look at the BertTokenizer class in the tokenization.py script? The from_pretrained method the tokenizer loads a vocab_file object located in a PRETRAINED_VOCAB_ARCHIVE_MAP

I'd like to see what you changed in the script. Changing the script is a strategy that we should debate.

Maybe we just need to drop the vocab file of the bert-base-uncased tokenizer under the models/bert_qa_squad_v1.1 folder?

from cdqa.

fmikaelian commented on May 23, 2024

@andrelmfarias You released model bert_qa_squad_v1.1_dev fined tuned on squad dev, but do you think we will use this model?

from cdqa.

fmikaelian commented on May 23, 2024

I've just released the model trained with the sklearn wrapper: https://github.com/fmikaelian/cdQA/releases/tag/bert_qa_squad_v1.1_sklearn

Meaning you can load it and predict directly. You might need to reset some parameters like model.device manually (I think the issue #68 was there when I trained it).

from cdqa.

fmikaelian commented on May 23, 2024

I could make predictions but didnt evaluated it yet (see #70).

from cdqa.

Fine-tune BERT on SQuAD train then SQuAD dev about cdqa HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent