BERT Question and Answer system meant and works well for only limited number of words summary like 1 to 2 paragraphs only. It can’t be able to answer well from understanding more than 10 pages of data. We can extend the BERT question and answer model to work as chatbot on large text. To accomplish the understanding of more than 10 pages of data, here we have used a specific approach of picking the data.

License: Apache License 2.0

Jupyter Notebook 37.77% Python 62.23%

extending-google-bert-as-question-and-answering-model-and-chatbot's Introduction

extending-google-bert-as-question-and-answering-model-and-chatbot's People

Contributors

Stargazers

Watchers

extending-google-bert-as-question-and-answering-model-and-chatbot's Issues

Preparing the dataset to fine-tune the model

Hi Naga,
I'm not sure I understand How to prepare custom dataset to fine-tune Bert.
when you say " The data wants to for the Questioning the BERT has to be copied in simple text file, delimiting with three '\n'. " What does it mean " delimiting with three '\n' " ?
As I see your data.txt file has 2 empty lines between paragraphs. Is it all you need to properly separate them?
Thank you for your repo and great articles, I'm very new to all this and you are the one who explained it very clearly, dough my luck of expertise in the field is not making me understand a couple of details.
Thanks again.
Vincenzo

Unknown command line flag 'f'

Hi Naga,
I'm trying to make a colab notebook from your repo to try it out .
I updated optimization.py to be compatible wit TF 2.0
After this I solved flag not been defined by using absl library both for app and flags.
but now when I run it I get a fatal error with exit 1:

FATAL Flags parsing error: Unknown command line flag 'f'
Pass --helpshort or --helpfull to see help on flags.
An exception has occurred, use %tb to see the full traceback.

SystemExit: 1

Thank you again
Vincenzo

input_file not defined & other issues

Hello,
your work is very interesting to me but I'm finding some issues while running the code.
I fixed a couple of minor bugs that I faced running your code:

at line 281 input_file is not defined --> I've just commented 281, 282 and 283 and it works (not necessary for the goal of your app
I added import time at the beginning otherwise time.time() is not defined at line 1293

After that, I managed to run your code and also to replicate the result shown by you from the question reported in your post.
However, with any other example of question, the answer is wrong or of very poor quality.

For example, if you ask "What is The Strength of Bidirectionality" which is almost the exact question then the answer is ").,The".

Or if I ask "what did we open source this week?" the answer is "words in the sentence DocumentData.txt The pre-trained model can then be fine-tuned on small-data NLP tasks like"

There is maybe any issue with the current release? Could you please try the same questions and check the answers?

Have you tried to formulate other questions on your Context sentences or maybe with different data? I'm interested in this approach and I'm curious if it may be robust enough.

Thank you in advance!
Emanuele

Instruction: how to retrained on different dataset

Hi Naga,

Hope you are doing well.
Is it possible you can share an instruction on how to retrain your model on different dataset? I am very interested to test your model on new dataset?

Thanks in advance.
Luke

Different output for same input

The idea is really great, And I tried in the interactive mode in run_squad.py and I found that for the same question and the input data, the output are differing every time. You already know about this ? Or is this something that has to be fixed ?

Thanks!

No module named _bz2

Hi Naga Kiran,

I am trying to execute the code. I am getting an error named "No module named _bz2".
I tried to fix it, but still I am getting this error. Can you help me in this.....

Random outputs on the demo questions

i have tried to reproduce the demo questions in interactive mode. For example on question ' what are bidirectional usage of BERT' i got following:

what are bidirectional usage of BERT
I1113 21:57:44.327840 140736219419520 keyedvectors.py:727] Removed 4 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.328052 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.328258 140736219419520 dictionary.py:216] built Dictionary(19 unique tokens: ['accuracy', 'answering', 'compared', 'datasets', 'fine-tuned']...) from 2 documents (total 19 corpus positions)
I1113 21:57:44.337434 140736219419520 keyedvectors.py:727] Removed 4 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.337589 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.337722 140736219419520 dictionary.py:216] built Dictionary(12 unique tokens: ['bidirectional', 'called', 'encoder', 'new', 'nlp']...) from 2 documents (total 13 corpus positions)
I1113 21:57:44.338706 140736219419520 keyedvectors.py:727] Removed 5 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.338789 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.338906 140736219419520 dictionary.py:216] built Dictionary(15 unique tokens: ['30', 'answering', 'cloud', 'hours', 'minutes']...) from 2 documents (total 16 corpus positions)
I1113 21:57:44.340019 140736219419520 keyedvectors.py:727] Removed 1 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.340101 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.340200 140736219419520 dictionary.py:216] built Dictionary(5 unique tokens: ['bert', 'makes', 'what', 'bidirectional', 'usage']) from 2 documents (total 6 corpus positions)
I1113 21:57:44.340631 140736219419520 keyedvectors.py:727] Removed 6 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.340713 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.340828 140736219419520 dictionary.py:216] built Dictionary(11 unique tokens: ['bert', 'builds', 'contextual', 'generative', 'including']...) from 2 documents (total 12 corpus positions)
I1113 21:57:44.341884 140736219419520 keyedvectors.py:727] Removed 8 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.342021 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.342170 140736219419520 dictionary.py:216] built Dictionary(11 unique tokens: ['bert', 'corpus', 'deeply', 'language', 'plain']...) from 2 documents (total 12 corpus positions)
I1113 21:57:44.343141 140736219419520 keyedvectors.py:727] Removed 4 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.343286 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.343441 140736219419520 dictionary.py:216] built Dictionary(6 unique tokens: ['if', 'strength', 'the', 'bert', 'bidirectional']...) from 2 documents (total 6 corpus positions)
I1113 21:57:44.344067 140736219419520 keyedvectors.py:727] Removed 1 and 0 OOV words from document 1 and 2 (respectively).
I1113 21:57:44.344172 140736219419520 dictionary.py:209] adding document #0 to Dictionary(0 unique tokens: [])
I1113 21:57:44.344314 140736219419520 dictionary.py:216] built Dictionary(16 unique tokens: ['conditioned', 'consider', 'efficiently', 'models', 'predicting']...) from 2 documents (total 16 corpus positions)
coder Representations from Transformers, or BERT.,With this release, anyone in the world can train their own state-of

it looks like we get a partial response:
'coder Representations from Transformers, or BERT.,With this release, anyone in the world can train their own state-of'

Am i missing something with configuration?

Error in running run_squad.py for Q&A

Ran below command and received error (shown below). Pls help on what additional argument to pass.

export BERT_BASE_DIR=~/models/cased_L-12_H-768_A-12
python3 run_squad.py
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--do_predict=True
--do_lower_case=False
--output_dir=../../output_chatbot_data
--question=Which NFL team represented the AFC at Super Bowl 50
--context=Super Bowl 50 was an American football game to determine the champion of the National Football League for the 2015 season. /

Returns Error:
File "run_squad.py", line 1131, in main
validate_flags_or_throw(bert_config)
File "run_squad.py", line 1112, in validate_flags_or_throw
"If do_predict is True, then predict_file must be specified.")
ValueError: If do_predict is True, then predict_file must be specified.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 680: invalid start byte

When doing:

python run_squad.py
--vocab_file=BERT_BASE_DIR/multi_cased_L-12_H-768_A-12/vocab.txt
--bert_config_file=BERT_BASE_DIR/multi_cased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=BERT_BASE_DIR/multi_cased_L-12_H-768_A-12/bert_model.ckpt
--do_predict=True
--interact=True
--context=data.txt
--output_dir=Data

I get this error:

/Extending-Google-BERT-as-Question-and-Answering-model-and-Chatbot/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 680: invalid start byte

Any idea?
I am using python 3.6.5 and tensorflow 1.11.0

nagakiran1 / extending-google-bert-as-question-and-answering-model-and-chatbot Goto Github PK

extending-google-bert-as-question-and-answering-model-and-chatbot's Introduction

extending-google-bert-as-question-and-answering-model-and-chatbot's People

Contributors

Stargazers

Watchers

Forkers

extending-google-bert-as-question-and-answering-model-and-chatbot's Issues

Preparing the dataset to fine-tune the model

Unknown command line flag 'f'

input_file not defined & other issues

Instruction: how to retrained on different dataset

Different output for same input

No module named _bz2

Random outputs on the demo questions

Error in running run_squad.py for Q&A

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 680: invalid start byte

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent