ncbi-nlp / bluebert Goto Github PK
View Code? Open in Web Editor NEWBlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).
Home Page: https://arxiv.org/abs/1906.05474
License: Other
BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).
Home Page: https://arxiv.org/abs/1906.05474
License: Other
Hi,
Pre-training with BERT:
When you pre-trained BERT on PUBMED abstracts, did you feed BERT with sentences or did you feed it with the entire abstract section?
After finishing the relation extraction operation for Chemprot Data. I get one tensorflow record and one tsv file. which is as follows.
0.15294948 0.14690137 0.14512947 0.17709649 0.15383159 0.22409162
0.17051792 0.11182242 0.13569161 0.20062451 0.17994083 0.20140274
0.15868504 0.12534 0.13230026 0.16700414 0.17487246 0.24179816
0.15179233 0.15275687 0.1445464 0.18002915 0.15485005 0.21602511
0.1666515 0.112129934 0.12779407 0.16605248 0.17233992 0.25503212
0.17778598 0.1046794 0.120703295 0.16985807 0.1723527 0.25462052
0.16187875 0.11779226 0.13078773 0.16470376 0.17073172 0.25410575
0.16386983 0.11533519 0.13250493 0.16242501 0.17533559 0.25052953
0.16774821 0.11089808 0.13004696 0.16742839 0.17313573 0.25074264
0.17160876 0.10854226 0.12566698 0.16681188 0.17577308 0.251597
0.16189644 0.11701936 0.1346444 0.16580313 0.17096885 0.24966782
0.16152743 0.11757539 0.13271543 0.16380313 0.17377672 0.25060192
0.16656019 0.111814655 0.12732682 0.16468228 0.17510706 0.25450897
0.16650678 0.11021362 0.12661286 0.16736448 0.17467242 0.25462982
0.16048956 0.1193142 0.13183114 0.16296855 0.17130655 0.25409
0.1627501 0.112531364 0.12947361 0.16601284 0.17675503 0.25247702
0.15960562 0.11962788 0.1316634 0.1611976 0.17152742 0.2563781
0.16427976 0.112791955 0.12961149 0.16529398 0.17306624 0.2549565
0.15768206 0.12637071 0.13386849 0.16196251 0.16626643 0.25384983
0.15626168 0.12293934 0.13550161 0.16331881 0.16819012 0.2537883
0.1596326 0.121414356 0.13156259 0.16039251 0.1704628 0.25653514
0.16387884 0.11341449 0.12954234 0.16422358 0.17295827 0.25598237
0.15796445 0.127367 0.13383277 0.16219005 0.16565523 0.25299048
0.15737453 0.12230895 0.13421234 0.16285665 0.16942641 0.25382107
0.16121775 0.11749335 0.13174683 0.1615969 0.17064542 0.25729972
0.16171923 0.115740985 0.13092318 0.16363151 0.17172872 0.25625637
0.15772855 0.12557307 0.13428524 0.16082649 0.16551289 0.25607374
0.16506064 0.11239521 0.12870485 0.16525626 0.17356135 0.2550217
0.16727306 0.11162333 0.12815087 0.16378869 0.17512858 0.25403544
0.16821328 0.10925984 0.12688008 0.16718858 0.1744711 0.25398713
0.16123983 0.119652875 0.13235311 0.16205864 0.17105964 0.25363597
0.16371804 0.11169157 0.12968682 0.16627903 0.17704971 0.2515748
0.16859773 0.11070727 0.127372 0.16412714 0.17533523 0.25386068
how do I interpret the results?
Hi. I'm referring to line #295 in run_bluebert.py
. I noticed that if the setting is test
, then the labels are all neutral
. What is the reason for this? According to the NLI data processors on places like HuggingFace, it seems it would be appropriate to set the labels to None
if it's test. Please let me know if I'm mistaken. Thanks!
bluebert/bluebert/run_bluebert.py
Line 295 in ccc828c
While running the inference task using the following command on MedNLI, I encounterred the following issue displayed below. Have downloaded the BlueBERT-Base, Uncased, PubMed+MIMIC-III model for this inference. Kindly appreciate advice on this error and how to run the inference on a custom NLI example (ie from a .txt file)
Command
python bluebert/run_bluebert.py
--do_train=true
--do_eval=false
--do_predict=true
--task_name="mednli"
--vocab_file=$BlueBERT_DIR/vocab.txt
--bert_config_file=$BlueBERT_DIR/bert_config.json
--init_checkpoint=$BlueBERT_DIR/bert_model.ckpt
--num_train_epochs=10.0
--data_dir=$DATASET_DIR
--output_dir=$OUTPUT_DIR
--do_lower_case=true
Logs
Traceback (most recent call last):
File "bluebert/run_bluebert.py", line 912, in
tf.app.run()
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/dingsihan/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/dingsihan/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "bluebert/run_bluebert.py", line 762, in main
per_host_input_for_training=is_per_host))
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_config.py", line 233, in init
super(RunConfig, self).init(**kwargs)
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/run_config.py", line 538, in init
compat_internal.path_to_str(model_dir))
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/run_config.py", line 944, in _get_model_dir
raise ValueError('model_dir should be non-empty.')
ValueError: model_dir should be non-empty.
Environment
Platform: Ubuntu 18.04
Tensorflow Version: 1.15.2
Hi, thank you for sharing the code!
I am trying to run Named Entity Recognition task but I didn't find the "train.tsv" or "devel.tsv" in the BC5CDR dataset. Instead, the train/devel/test data are in ".txt" format. If I change the '.txt' directly to ".tsv" and run, it shows keyerror:'clonidine.'
Could you tell me what exactly input format NER task needs? It will be greater if you can share the preprocessing code given title and abstract.
Thank you in advance
My repo has a dependency on the bluebert repo and I'm trying to specify the dependency using dependency_links in my setup.cfg file and using install_command in my tox.ini file. The install_command fails because the bluebert repo does not have a setup.py file. Would it be possible for you to create one? If not, would you know how I can clone the repo in my tox.ini file without error?
Hi
How to use this model for the prediction of labels on a new dataset?
Here is the command I am using :
python run_bluebert_ner.py --do_prepare=false --do_train=false --do_eval=false --do_predict=true --task_name="bc5cdr" --vocab_file=vocab.txt --bert_config_file=bert_config.json --init_checkpoint=model.ckpt-1516 --num_train_epochs=1.0 --do_lower_case=False --data_dir=no_label_dataset --output_dir=no_label_output/
I am just keeping "do_predict" true because I want to see the labels the model "model.ckpt-1516" predict after fine-tuning.
Here is the sample data without tags:
oxacalcitriol
suppresses
secondary
hyperparathyroidism
without
inducing
low
bone
turnover
in
dogs
with
renal
failure
Is it necessary to have labels on new dataset as well?
Thanks
Google recently released two new BERT models with Whole Word Masking strategy (BERT-Large(Base), Uncased (Whole Word Masking)). Do you have a plan to pre-train new NCBI models based on this new release?
Hi,
A download link for the BERT model is missing from the README file. I've copied the line where the link is missing below.
"The pre-trained NCBI BERT weights, vocab, and config files is ."
Thank you for your help.
Best
When I run the Elmo code multiple times on the same data, results vary significantly and surpass the results reported in the literature. What am I doing wrong?
The script I'm running:
python3 elmoft.py \
--task bc5cdr-chem \
--seq2vec boe \
--options_path /path/to/options.json \
--weights_path /path/to/weights.hdf5 \
--maxlen 128 \
--fchdim 500 \
--lr 0.001 \
--pdrop 0.5 \
--do_norm \
--norm_type batch \
--do_lastdrop \
--initln \
--earlystop \
--epochs 20 \
--bsize 64 \
--data_dir /path/to/data
Pre-trained model weights.hdf5
and options.json
were downloaded from:
ELMo PubMed AllenNLP
The code outputs the following F1 score for task bc5cdr-chem
(Literature report numbers around 91.5% for elmo)
accuracy: 0.9943132108
macro avg: 0.9489234576
weighted avg: 0.9941723561
The code outputs the following F1 score for task bc5cdr-dz
(Literature report numbers around 83.9% for elmo)
accuracy: 0.988988989
macro avg: 0.909805591
weighted avg: 0.9888870565
The datasets were downloaded from:
bert_data.zip
And two additional columns were added, so that the labels are in the column that the code expects.
Am I doing something wrong? Or is it a bug in the implementation?
Hi,
I have one issues for evaluating my pretrained model on the MedNLI dataset.
The provided MedNLI dataset is jsonl format, but the code MedNLIProcessor in run_bluebert.py read the tsv format dataset.
Could you tell me how to figure it out?
Best,
Thanks
hello,
just a question if there is code for trying out the tasks via pytorch saved model checkpoints as opposed to TF checkpoints that were used by the original BlueBert evals? I was initially thinking of converting from a saved pytorch checkpoint to onnx and then to TF, but that will result in a .pb file which doesn't include the weights necessary to load I don't think. If not, do you have suggestions?
thanks for your time.
Hello,
I would like to get some results of the sentence similarity and Named Entity Recognition. I am having a couple of documents to run on. How to use the pretrained models to run on the New data and receive outouts?
Thanks in Advance, Sorry for any inconvinience.
in run_bluebert_multi_labels.py on running it, I am getting the following error after writing data to tfrecord file.
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((28996, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.
Here is the traceback:
INFO:tensorflow:***** Running training *****
INFO:tensorflow: Num examples = 20137
INFO:tensorflow: Batch size = 32
INFO:tensorflow: Num steps = 1887
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:389: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
WARNING:tensorflow:From /home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:425: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
WARNING:tensorflow:From /home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:398: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:405: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow: name = input_ids, shape = (32, 128)
INFO:tensorflow: name = input_mask, shape = (32, 128)
INFO:tensorflow: name = is_real_example, shape = (32,)
INFO:tensorflow: name = label_ids, shape = (32, 19)
INFO:tensorflow: name = segment_ids, shape = (32, 128)
WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:172: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:411: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:492: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.
WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:359: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:673: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:546: The name tf.train.init_from_checkpoint is deprecated. Please use tf.compat.v1.train.init_from_checkpoint instead.
ERROR:tensorflow:Error recorded from training_loop: Shape of variable bert/embeddings/word_embeddings:0 ((28996, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "bluebert/run_bluebert_multi_labels.py", line 876, in <module>
main(args)
File "bluebert/run_bluebert_multi_labels.py", line 781, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
rendezvous.raise_errors()
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
six.reraise(typ, value, traceback)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/six.py", line 696, in reraise
raise value
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
saving_listeners=saving_listeners)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2709, in _call_model_fn
config)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2967, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1549, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1867, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "bluebert/run_bluebert_multi_labels.py", line 546, in model_fn
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
init_from_checkpoint_fn)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1684, in merge_call
return self._merge_call(merge_fn, args, kwargs)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1691, in _merge_call
return merge_fn(self._strategy, *args, **kwargs)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 286, in <lambda>
ckpt_dir_or_file, assignment_map)
File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 329, in _init_from_checkpoint
tensor_name_in_ckpt, str(variable_map[tensor_name_in_ckpt])
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((28996, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.
I think there is some vocab size error because my pretrained model has vocab size of 28996 while this other vocab of 30522 seems to be vocab size of BERT base model.
What is the resolution for such error.
?
hi
I would like to play with mt-bluebert-biomedical, but the download link only provide pt file, and the vocab and config files are missing, could you point to where I can get these two?
Thanks
In file run _bluebert_multi_label.py
, How to print metrics per step (atleast loss) instead of global_step/sec
and example/sec
.
Hi there,
I found that the number of phrases detected was different from the paper when I used conlleval.py to assess NER predictions on ShARe/CLEFE.
for example:
from conlleval import evaluate, metrics, report_notprint
# Test.tsv#L112-L118 in ShARe/CLEFE
text = """\
The O O
left B B
atrium I I
is O O
moderately O O
dilated I O
. O O
"""
seq = text.split('\n')
count = evaluate(seq)
print(metrics(count)[0])
print(''.join(report_notprint(count)))
The output is:
Metrics(tp=1, fp=0, fn=1, prec=1.0, rec=0.5, fscore=0.6666666666666666)
processed 7 tokens with 2 phrases; found: 1 phrases; correct: 1.
accuracy: 85.71%; precision: 100.00%; recall: 50.00%; FB1: 66.67
: precision: 100.00%; recall: 50.00%; FB1: 66.67 1
It seems to me that you treats "left atrium dilated" as 1 phrase in the paper, but the output of conlleval.py is different.
Are there any ways to handle this problem well?
Thanks!
Hello,
After running the sentence similarity command for Sentence Similarity. I've received an error saying
"No module named bert". However the bert folder is at the correct place. I was unable to figure out why I am receiving error on line 23 of "run_bluebert_sts.py". I tried installing the bert using "pip install bert". It then throws an error saying that "modeling.py: No such file or directory". Then I installed the bert tensorflow version using "pip install bert-tensorflow". This clears the problem and pops a new problem like this.
"tensorflow.python.framework.errors_impl.NotFoundError: /vocab.txt; No such file or directory"
Can you please update the "requirements.txt". I couldn't find the version mentioned for tensorflow, i.e; 1.12.1.
Thanks in Advance. Sorry for Inconvinience.
Hi, I just checked this repo out. It seems that there is no cased version. Do you have plan to release them as well? Thanks!
In the file elmoft.py line 114, the variable seqlen is used but it is not previously defined.
class ELMoClfHead(BaseClfHead):
def __init__(...):
...
if task_type == 'nmt':
...
self.norm = NORM_TYPE_MAP[norm_type](seqlen)
so the code crashes in the line 114:
self.norm = NORM_TYPE_MAP[norm_type](seqlen)
First of all, thank you @yfpeng and team for sharing your work.
I had a query regarding using your scripts for NER task on sample data. As mentioned in issue #14 , I set train and eval options as false. The model processes data but generates no labels in the output file.
Could you highlight with the help on one example on how the input file should be and what would be the expected output?
Note: For testing purpose I even picked one example from the train set itself but still no output was observed
Thanks :)
Hi,
I am trying to fine-tune BlueBERT for classifying a set of clinical notes into a binary task. I have set up by train.tsv
and dev.tsv
files as such:
1 1 a Assessment and Plan... <more notes here>
I was not sure whether this is the right format for BlueBERT, but for BERT, it seems that based on the following article: https://blog.insightdatascience.com/using-bert-for-state-of-the-art-pre-training-for-natural-language-processing-1d87142c29e7, the following format is followed for the tsv input data:
Column 1: An ID for the row (can be just a count, or even just the same number or letter for every row, if you don’t care to keep track of each individual example).
Column 2: A label for the row as an int. These are the classification labels that your classifier aims to predict.
Column 3: A column of all the same letter — this is a throw-away column that you need to include because the BERT model expects it.
Column 4: The text examples you want to classify.
However, when I run the following code:
python ../bluebert/bluebert/run_bluebert_multi_labels.py \
--task_name="hoc" \
--do_train=true \
--do_eval=true \
--do_predict=true \
--vocab_file=$BlueBERT_DIR/vocab.txt \
--bert_config_file=$BlueBERT_DIR/bert_config.json \
--init_checkpoint=$BlueBERT_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=4 \
--learning_rate=2e-5 \
--num_train_epochs=3 \
--num_classes=2 \
--num_aspects=2 \
--data_dir=$DATASET_DIR \
--output_dir=$OUTPUT_DIR \
--aspect_value_list="0,1"
I get the following error:
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x124253730>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': '/Users/sambamamba/Documents/SCPD/CS_230/Project/sywang/lowva_bluebert', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1245052e8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:Writing example 0 of 4957
Traceback (most recent call last):
File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 920, in <module>
tf.app.run()
File "/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 811, in main
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 400, in file_based_convert_examples_to_features
max_seq_length, tokenizer)
File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 366, in convert_single_example
label_id = label_map[example.label]
KeyError: '2'
Looking into run_bluebert_multi_labels.py
, it seems that the label_map
variable is populated based on the entry to the num_aspects
and aspect_value_list
flag arguments. On line 233 of this Python file, we see that get_labels
method is used to create the label_list
which is then fed into label_map
:
def get_labels(self):
"""See base class."""
label_list = []
# num_aspect=FLAGS.num_aspects
aspect_value_list = FLAGS.aspect_value_list # [-2,-1,0,1]
for i in range(FLAGS.num_aspects):
for value in aspect_value_list:
label_list.append(str(i) + "_" + str(value))
return label_list # [ {'0_-2': 0, '0_-1': 1, '0_0': 2, '0_1': 3,....'19_-2': 76, '19_-1': 77, '19_0': 78, '19_1': 79}]
which is fed into line 277:
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
The example of what label_map
keys would then be 0_-2
, 0_-1
, etc. I printed right before the line of the error ( line 365) and saw that
label_map is {'0_0': 0, '0_1': 1, '1_0': 2, '1_1': 3}
example.label is 2
So when we run label_id = label_map[example.label]
, we get a KeyError. So why is example.label
being fed these underscored keys? Am I missing something here?
I failed to torch.load the pytorch models downloaded from huggingface:
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
the error info is:
ValueError: invalid literal for int() with base 8: 'rebuild_'
did anyone meet this problem?
Do I need to perform the Named Entity Recognition and then prepare the data for relation extraction and then perform relation extraction to get results?
The title is the question. Do I have to run code in https://github.com/ncbi-nlp/BLUE_Benchmark?
Hi,
load_sts function require that the block number is larger than 8, I wonder why? can we just give one premise, one hyperthesis and one score, (three blocks)
I'm doing a binary sentence similarity where labels are 0 and 1. the prediction output is as follow:
MSE = 0.12034694
global_step = 0
label_ids = [0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0.
1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0.
1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1.]
loss = 0.12034694
pearson = 0.6646034
pred = [ 0.9968272 -0.03098304 0.8957741 -0.07417393 -0.09186892 -0.0707963
0.74726754 -0.05655669 -0.09204277 -0.03543881 -0.07745712 -0.05300986
1.1416211 -0.08503725 -0.09837753 1.004629 -0.07550014 0.00657438
-0.08694188 -0.02797052 -0.08749591 -0.08376077 0.00488611 -0.07453079
-0.04001749 0.9580759 0.61144847 -0.02444758 -0.06606578 1.0700202
-0.08189891 -0.09389639 -0.04471161 -0.07932074 -0.08922736 1.072242
1.1144217 -0.0899142 -0.05957434 -0.01848362 -0.08165789 0.56197506
-0.10288966 0.9589868 -0.08966823 -0.07423133 0.9501468 -0.08691276
1.0427929 -0.07580899 -0.08085545 0.05613842 -0.06668296 0.67278963
0.04689811 -0.08730051 -0.09488467 0.7494736 0.59106404 -0.05784546
-0.0580256 0.5586943 0.82942 -0.08266563 -0.08970116 -0.07241884
-0.08084895 -0.0888657 0.16364944 -0.08838011 -0.08021087 -0.07139066
0.98460495 -0.09568951 -0.08403315 -0.03191408 0.84516126 -0.07047645
1.04264 1.1047416 0.9219344 0.93681306 0.00817167 -0.08582229
-0.09332561 -0.05327708 -0.08006877 -0.06815267 0.08796047 -0.10083354
-0.08134227 -0.0519708 -0.07535361 0.02822088 0.8645804 -0.08838581
0.05759583 -0.09652802 -0.0544436 0.8467474 1.011137 -0.0152052
-0.09230338 -0.08920024 0.9547418 -0.09625152 -0.07814157 -0.05981593
-0.06737825 -0.0525138 -0.07601891 0.00535123 -0.09302492 -0.05335039
0.57089394 0.9735016 -0.07029892 0.9383386 0.17835245 0.07288147
-0.05812666 -0.09008455 0.16482374 -0.06855011 -0.07975283 -0.0688867
0.16806357 -0.08691715 0.8265008 -0.05552685 -0.04530346 0.9801875
0.9665445 -0.10243599 -0.09238719 -0.08140092 -0.07281174 -0.09341179
-0.08653723 -0.04425526 -0.04663768 -0.07175027 -0.05161241 -0.07474666
-0.08247717 -0.07625985 0.05558392 -0.09737069 -0.08582785 -0.08285176
-0.09085771 -0.08242864 -0.06997188 -0.09492967 0.87413186 0.00221197
-0.09681983 1.1069126 -0.07090654 1.0427476 0.97657245 -0.05734477
-0.06612358 0.17080042 0.04073562 0.8623907 -0.06221616 -0.07726647
-0.08040509 0.35656622 0.88446796 0.01673024 -0.09752481 -0.09414034
-0.06563986 -0.05257557 -0.08664538 -0.03824814 0.99862784 0.9537769
-0.0507925 1.0611311 0.26432222 0.02389601 -0.08002971 0.24677996
-0.04190464 -0.07924199 0.44772255 0.16013458 1.1142675 -0.06626779
0.11091595 1.0015993 0.98124903 -0.08817458 -0.0803092 -0.00456336
1.0019325 -0.09834503 -0.07607836 0.9602315 -0.050502 -0.09498988
0.93423295 -0.08353204 0.95852834 -0.08302109 -0.03645961 -0.0837692
-0.04907575 -0.08840061 -0.04175755 0.05482076 0.98270017 -0.05114298
-0.07228722 0.81660086 -0.07696462 -0.08263256 1.0464804 -0.08961527
0.01591448 0.03492247 -0.03415895 -0.07692334 0.7936482 0.98901486
1.0336974 -0.01263706 0.64612895 -0.07319017 -0.08374722 0.98839957
-0.0816884 -0.08701541 0.9753411 0.38509053 -0.08011929 -0.08158413
-0.08267076 -0.07939766 -0.0851294 -0.10770355 -0.04284238 -0.09182031
1.0836056 -0.07639952 -0.09889527 -0.01996168 -0.09211037 -0.07140023
-0.07940755 -0.08331279 -0.06124184 -0.08752528 -0.07155015 1.06396
-0.09301544 -0.07780191 0.18636224 1.0234824 -0.06206534 -0.10370414
0.20406811 -0.09179069 -0.08385491 -0.07036848 -0.08004359 1.04012
-0.08071671 0.8393969 0.0629826 -0.05980002 -0.09884399 -0.04910354
-0.06946485 -0.09015001 1.0906504 0.986099 -0.05425195 0.5622222
0.935292 -0.08033577 1.0642971 1.0911734 -0.08062124 0.7644436
0.87184227 -0.07042552 -0.08266561 0.9998966 -0.03840258 -0.08939464
1.009424 0.25307548 -0.09172264 -0.08039551 -0.07240216 1.0881265
0.0290037 0.9582196 1.0014933 -0.00588964 0.08343956 -0.10145007
1.1023728 1.0932642 -0.09266437 -0.09243488 -0.08602741 0.18427256
-0.08351617 0.9532236 1.0550426 -0.09006116 -0.08440115 0.9653421
-0.07703653 0.9551673 ]
spearman = 0.31011906
label_ids
seems to be the true labels from the test file. But I don't really now how to interpret the pred
list
@yfpeng
I have preprocessed MIMIC notes along with corresponding labels(multi-label classification task) in a pandas dataframe. How does sample data look like ? How to convert it into the format as required by the script bluebert/run_bluebert_multi_labels.py
. ?
Also what is aspect_value_list
and num_aspect
parameters?
In your manuscript, your described like this:
"We initialized BERT with pre-trained BERT provided by (Devlin et al., 2019). We then continue to pre-train the model, using the listed corpora".
Did you use BERT code completely re-train the NCBI abstract corpora? Or used BERT initial model and wordpiece strategy as bioBERT method?
There is no link to the paper in the README.md file.
BTW, awesome work!
tensorflow.python.framework.errors_impl.NotFoundError: /ChemProt_Corpus/chemprot_test_gs/test.tsv; No such file or directory
Line 675 of elmoft.py crashes when the list is empty
def eval(...):
...
elif task_type == 'nmt':
tkns_tnsr, lb_tnsr = zip(*[(sx.split(SC), list(map(int, sy.split(SC)))) for sx, sy in zip(tkns_tnsr, lb_tnsr) if ((type(sx) is str and sx != '') or len(sx) > 0) and ((type(sy) is str and sy != '') or len(sy) > 0)])
If the list is empty, i.e. tkns_tnsr, lb_tnsr = zip(*[])
, The code crashes
A sanity check (like the one in line 676) should be added before line 675
Hi,
is there a pre-trained model for the NER task?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.