Comments (13)
The command I ran was the following:
python run_squad.py \
--bert_model models/bert_qa_squad_v1.1 \
--do_train \
--fp16 \
--do_lower_case \
--train_file samples/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir logs \
--fp16
Yes, I looked at the BertTokenizer class. I understand the argument that is fed to the BertTokenizer is either the path to the saved model or the name of one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP. (args.bert_model at Line 883 in run_squad.py ).
If args.bert_model is not one of the models in PRETRAINED_VOCAB_ARCHIVE_MAP, the variable vocab_file
in the function from_pretrained
(line 126 in BertTokenizer ) is gonna be the path we fed to run_squad.py, in our case models/bert_qa_squad_v1.1
.
If there is no vocab.txt
file (variable VOCAB_NAME
in tokenizer.py
) in the repo where we saved the model, where are going to get an error. Yes, another solution would be to drop the vocab file there, but we would be constraint to do this every time we need to retrain. Yes, we can automatize this with another script, but in the end we are going to have to whether use more scripts or change the one we have (as I decided to do).
I am committing the new script in a new branch so that we can take a look at it. There are no major changes and it is easy to understand.
from cdqa.
@fmikaelian I understood we would use this model to do a 2nd fine-tune on the BNP dataset, as after this fine-tune on the dev set, the model would be able to generalise more (had seen more samples). I imagine however that the performance might not increase that much in comparison to the model trained on SQUAD train.
We can discuss it.
from cdqa.
It would be a good thing to report for each model training:
- the commands we used
- the data we used
- the training time
- ...
Also, using ML Flow Tracking might be easier for us to track different models.
from cdqa.
I have fetched one of the url for --bert_model
to see what is inside:
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar xvzf bert-base-uncased.tar.gz
It's just a weights file and a a model config file:
./pytorch_model.bin
./bert_config.json
I will adapt download.py
to save what @andrelmfarias released under the /models
folder with the same structure.
Then, it seems run_squad.py
just loads a model specified by --bert_model
when not in training mode (eg. --do_predict
).
if args.do_train:
# Save a trained model and the associated configuration
# Load a trained model and config that you have fine-tuned
else:
model = BertForQuestionAnswering.from_pretrained(args.bert_model)
The function from_pretrained
can take as argument:
- a path or url to a pretrained model archive containing:
.bert_config.json
a configuration file for the model
.pytorch_model.bin
a PyTorch dump of a BertForPreTraining instance
To predict with model fine tuned on SQuAD v1.1, we need to do:
python run_squad.py \
--bert_model models/bert_qa_squad_v1.1 \
--do_predict \
--predict_fp16 \
--do_lower_case \
--predict_file samples/custom-sample-v2.0.json \
--predict_batch_size 128 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir logs
With samples/custom-sample-v2.0.json
being the file to predict with SQuAD format.
from cdqa.
@andrelmfarias Should I try this and let you know if it works?
from cdqa.
@fmikaelian I think if you do not select the option --do_train, it is only going to predict on samples/custom-sample-v2.0.json
and is not going to run a second fine-tunning.
Don't we need to change the script in order to do the fine-tune on squad-dev?
from cdqa.
@andrelmfarias Yes, you are right the code snippet above is only for prediction.
I can try to generate predictions on a sample samples/custom-sample-v2.0.json
with the first model you released (bert_qa_squad_v1.1
).
You can try to do the second fine-tuning on squad-dev to validate the workflow? Don't forget to report your actions and commands ✍️.
from cdqa.
Ideas for second fine-tuning:
if args.do_train and args.bert_model != 'models/bert_qa_squad_v1.1':
# Save a trained model and the associated configuration
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
torch.save(model_to_save.state_dict(), output_model_file)
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
with open(output_config_file, 'w') as f:
f.write(model_to_save.config.to_json_string()) # Load a trained model and config that you have fine-tuned
config = BertConfig(output_config_file)
model = BertForQuestionAnswering(config)
model.load_state_dict(torch.load(output_model_file))
else:
model = BertForQuestionAnswering.from_pretrained(args.bert_model)
from cdqa.
Some errors when re-training with saved model 'models/bert_qa_squad_v1.1' using run_squad.py script:
02/21/2019 17:14:43 - ERROR - pytorch_pretrained_bert.tokenization - Model name './output_bert/squad_1.1_train' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed './output_bert/squad_1.1_train/vocab.txt' was a path or url but couldn't find any file associated to this path or url.
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling - loading archive file ./output_bert/squad_1.1_train
02/21/2019 17:14:44 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
Traceback (most recent call last):
File "run_squad.py", line 945, in main
with open(cached_train_features_file, "rb") as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'squad_data/dev_mod.json_squad_1.1_train_384_128_64'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_squad.py", line 1077, in <module>
main()
File "run_squad.py", line 954, in main
is_training=True)
File "run_squad.py", line 211, in convert_examples_to_features
query_tokens = tokenizer.tokenize(example.question_text)
AttributeError: 'NoneType' object has no attribute 'tokenize'
It seems to be some error related to the non-existance of tokenizer with the saved model.
I solved the problem by running a own-made script run_squad_fine-tunned.py which will be commited to the repo.
The usage of run_squad-fine-tunned.py for retrain a saved model is as below:
python run_squad_fine-tunned.py \
--bert_model bert-base-uncased \
--do_retrain \
--do_predict \
--do_lower_case \
--train_file <path-to-train-file> \
--predict_file squad_data/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir output_bert/squad_1.1_dev \
--fine_tunned_weights <path-to-model.bin-file>
from cdqa.
The error traceback for FileNotFoundError
is weird: the squad_data
path has been declared somewhere? What command did you use to get this error?
For the tokenizer error, did you look at the BertTokenizer class in the tokenization.py
script? The from_pretrained
method the tokenizer loads a vocab_file
object located in a PRETRAINED_VOCAB_ARCHIVE_MAP
I'd like to see what you changed in the script. Changing the script is a strategy that we should debate.
Maybe we just need to drop the vocab file of the bert-base-uncased
tokenizer under the models/bert_qa_squad_v1.1
folder?
from cdqa.
@andrelmfarias You released model bert_qa_squad_v1.1_dev
fined tuned on squad dev, but do you think we will use this model?
from cdqa.
I've just released the model trained with the sklearn wrapper: https://github.com/fmikaelian/cdQA/releases/tag/bert_qa_squad_v1.1_sklearn
Meaning you can load it and predict directly. You might need to reset some parameters like model.device
manually (I think the issue #68 was there when I trained it).
from cdqa.
I could make predictions but didnt evaluated it yet (see #70).
from cdqa.
Related Issues (20)
- While running the 'pdf_converter' function
- is there any limit on the no.of rows in a data-frame for the annotator to load the json file?
- ModuleNotFoundError: No module named 'torch' HOT 2
- -
- Syntax error when importing my csv file
- numpy core fromnumeric.py error in QAPipeline.fit_retriever HOT 1
- MemoryError workaround HOT 1
- Maintenance of the project HOT 5
- can not use PIP to install
- return link to the pdf file page where the answer is located
- cdqa install error HOT 2
- How to use cdQA for non-English language? HOT 2
- ValueError: empty vocabulary; perhaps the documents only contain stop words in TfidfVectorizer
- Wrong default
- getting this error while loading the custom data. HOT 1
- Adding annotated training dataset
- CDQA is not installing Anaconda Navigator using PIP command HOT 4
- How can I link cdQA model to SQuAD v2 model? For QA model
- ModuleNotFoundError: No module named 'transformers.modeling_bert' HOT 1
- pdf_converter cdqa throws AttributeError: type object 'object' has no attribute 'dtype'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cdqa.