ncbi-nlp / bluebert Goto Github PK

View Code? Open in Web Editor NEW

538.0 23.0 76.0 215 KB

BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).

Home Page: https://arxiv.org/abs/1906.05474

License: Other

Python 99.28% Shell 0.72%

bert bert-model natural-language-processing pubmed pubmed-abstracts language-model mimic-iii

bluebert's People

Contributors

Stargazers

Watchers

bluebert's Issues

Pre-training BERT on PUBMED

Hi,

Pre-training with BERT:
When you pre-trained BERT on PUBMED abstracts, did you feed BERT with sentences or did you feed it with the entire abstract section?

How to Interpret the results of Relation Extraction?

After finishing the relation extraction operation for Chemprot Data. I get one tensorflow record and one tsv file. which is as follows.

0.15294948 0.14690137 0.14512947 0.17709649 0.15383159 0.22409162
0.17051792 0.11182242 0.13569161 0.20062451 0.17994083 0.20140274
0.15868504 0.12534 0.13230026 0.16700414 0.17487246 0.24179816
0.15179233 0.15275687 0.1445464 0.18002915 0.15485005 0.21602511
0.1666515 0.112129934 0.12779407 0.16605248 0.17233992 0.25503212
0.17778598 0.1046794 0.120703295 0.16985807 0.1723527 0.25462052
0.16187875 0.11779226 0.13078773 0.16470376 0.17073172 0.25410575
0.16386983 0.11533519 0.13250493 0.16242501 0.17533559 0.25052953
0.16774821 0.11089808 0.13004696 0.16742839 0.17313573 0.25074264
0.17160876 0.10854226 0.12566698 0.16681188 0.17577308 0.251597
0.16189644 0.11701936 0.1346444 0.16580313 0.17096885 0.24966782
0.16152743 0.11757539 0.13271543 0.16380313 0.17377672 0.25060192
0.16656019 0.111814655 0.12732682 0.16468228 0.17510706 0.25450897
0.16650678 0.11021362 0.12661286 0.16736448 0.17467242 0.25462982
0.16048956 0.1193142 0.13183114 0.16296855 0.17130655 0.25409
0.1627501 0.112531364 0.12947361 0.16601284 0.17675503 0.25247702
0.15960562 0.11962788 0.1316634 0.1611976 0.17152742 0.2563781
0.16427976 0.112791955 0.12961149 0.16529398 0.17306624 0.2549565
0.15768206 0.12637071 0.13386849 0.16196251 0.16626643 0.25384983
0.15626168 0.12293934 0.13550161 0.16331881 0.16819012 0.2537883
0.1596326 0.121414356 0.13156259 0.16039251 0.1704628 0.25653514
0.16387884 0.11341449 0.12954234 0.16422358 0.17295827 0.25598237
0.15796445 0.127367 0.13383277 0.16219005 0.16565523 0.25299048
0.15737453 0.12230895 0.13421234 0.16285665 0.16942641 0.25382107
0.16121775 0.11749335 0.13174683 0.1615969 0.17064542 0.25729972
0.16171923 0.115740985 0.13092318 0.16363151 0.17172872 0.25625637
0.15772855 0.12557307 0.13428524 0.16082649 0.16551289 0.25607374
0.16506064 0.11239521 0.12870485 0.16525626 0.17356135 0.2550217
0.16727306 0.11162333 0.12815087 0.16378869 0.17512858 0.25403544
0.16821328 0.10925984 0.12688008 0.16718858 0.1744711 0.25398713
0.16123983 0.119652875 0.13235311 0.16205864 0.17105964 0.25363597
0.16371804 0.11169157 0.12968682 0.16627903 0.17704971 0.2515748
0.16859773 0.11070727 0.127372 0.16412714 0.17533523 0.25386068

how do I interpret the results?

Why are test labels always `'neutral'`?

Hi. I'm referring to line #295 in run_bluebert.py. I noticed that if the setting is test, then the labels are all neutral. What is the reason for this? According to the NLI data processors on places like HuggingFace, it seems it would be appropriate to set the labels to None if it's test. Please let me know if I'm mistaken. Thanks!

bluebert/bluebert/run_bluebert.py

Line 295 in ccc828c

label = self.get_labels()[-1]

ValueError: model_dir should be non-empty.

While running the inference task using the following command on MedNLI, I encounterred the following issue displayed below. Have downloaded the BlueBERT-Base, Uncased, PubMed+MIMIC-III model for this inference. Kindly appreciate advice on this error and how to run the inference on a custom NLI example (ie from a .txt file)

Command
python bluebert/run_bluebert.py
--do_train=true
--do_eval=false
--do_predict=true
--task_name="mednli"
--vocab_file=$BlueBERT_DIR/vocab.txt
--bert_config_file=$BlueBERT_DIR/bert_config.json
--init_checkpoint=$BlueBERT_DIR/bert_model.ckpt
--num_train_epochs=10.0
--data_dir=$DATASET_DIR
--output_dir=$OUTPUT_DIR
--do_lower_case=true

Logs
Traceback (most recent call last):
File "bluebert/run_bluebert.py", line 912, in
tf.app.run()
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/dingsihan/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/dingsihan/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "bluebert/run_bluebert.py", line 762, in main
per_host_input_for_training=is_per_host))
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_config.py", line 233, in init
super(RunConfig, self).init(**kwargs)
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/run_config.py", line 538, in init
compat_internal.path_to_str(model_dir))
File "/home/dingsihan/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/run_config.py", line 944, in _get_model_dir
raise ValueError('model_dir should be non-empty.')
ValueError: model_dir should be non-empty.

Environment
Platform: Ubuntu 18.04
Tensorflow Version: 1.15.2

Input data format for Named Entity Recognition

Hi, thank you for sharing the code!
I am trying to run Named Entity Recognition task but I didn't find the "train.tsv" or "devel.tsv" in the BC5CDR dataset. Instead, the train/devel/test data are in ".txt" format. If I change the '.txt' directly to ".tsv" and run, it shows keyerror:'clonidine.'

Could you tell me what exactly input format NER task needs? It will be greater if you can share the preprocessing code given title and abstract.
Thank you in advance

Need setup.py file for specifying dependency to bluebert repo

My repo has a dependency on the bluebert repo and I'm trying to specify the dependency using dependency_links in my setup.cfg file and using install_command in my tox.ini file. The install_command fails because the bluebert repo does not have a setup.py file. Would it be possible for you to create one? If not, would you know how I can clone the repo in my tox.ini file without error?

prediction on new dataset

Hi
How to use this model for the prediction of labels on a new dataset?

Here is the command I am using :

python run_bluebert_ner.py --do_prepare=false --do_train=false --do_eval=false --do_predict=true --task_name="bc5cdr" --vocab_file=vocab.txt --bert_config_file=bert_config.json --init_checkpoint=model.ckpt-1516 --num_train_epochs=1.0 --do_lower_case=False --data_dir=no_label_dataset --output_dir=no_label_output/

I am just keeping "do_predict" true because I want to see the labels the model "model.ckpt-1516" predict after fine-tuning.

Here is the sample data without tags:

oxacalcitriol
suppresses
secondary
hyperparathyroidism
without
inducing
low
bone
turnover
in
dogs
with
renal
failure

Is it necessary to have labels on new dataset as well?

Thanks

pretrain NCBI BERT based on new released WWM BERT model

Google recently released two new BERT models with Whole Word Masking strategy (BERT-Large(Base), Uncased (Whole Word Masking)). Do you have a plan to pre-train new NCBI models based on this new release?

link for downloading pretrained NCBI BERT model is missing

Hi,

A download link for the BERT model is missing from the README file. I've copied the line where the link is missing below.

"The pre-trained NCBI BERT weights, vocab, and config files is ."

Thank you for your help.

Best

Elmo results inconsistent

When I run the Elmo code multiple times on the same data, results vary significantly and surpass the results reported in the literature. What am I doing wrong?

The script I'm running:

python3 elmoft.py \
  --task bc5cdr-chem \
  --seq2vec boe \
  --options_path /path/to/options.json \
  --weights_path /path/to/weights.hdf5 \
  --maxlen 128 \
  --fchdim 500 \
  --lr 0.001 \
  --pdrop 0.5 \
  --do_norm \
  --norm_type batch \
  --do_lastdrop \
  --initln \
  --earlystop \
  --epochs 20 \
  --bsize 64 \
  --data_dir /path/to/data

Pre-trained model weights.hdf5 and options.json were downloaded from:
ELMo PubMed AllenNLP

The code outputs the following F1 score for task bc5cdr-chem (Literature report numbers around 91.5% for elmo)

accuracy: 0.9943132108
macro avg: 0.9489234576
weighted avg: 0.9941723561

The code outputs the following F1 score for task bc5cdr-dz (Literature report numbers around 83.9% for elmo)

accuracy: 0.988988989
macro avg: 0.909805591
weighted avg: 0.9888870565

The datasets were downloaded from:
bert_data.zip
And two additional columns were added, so that the labels are in the column that the code expects.

Am I doing something wrong? Or is it a bug in the implementation?

Evaluation on MedNLI dataset

Hi,

I have one issues for evaluating my pretrained model on the MedNLI dataset.

The provided MedNLI dataset is jsonl format, but the code MedNLIProcessor in run_bluebert.py read the tsv format dataset.

Could you tell me how to figure it out?

Best,
Thanks

running tasks using saved PYTORCH model checkpoints as opposed to BLUEBERT TF checkpoints

hello,
just a question if there is code for trying out the tasks via pytorch saved model checkpoints as opposed to TF checkpoints that were used by the original BlueBert evals? I was initially thinking of converting from a saved pytorch checkpoint to onnx and then to TF, but that will result in a .pb file which doesn't include the weights necessary to load I don't think. If not, do you have suggestions?
thanks for your time.

How to use on NER and STS for the New Data?

Hello,

I would like to get some results of the sentence similarity and Named Entity Recognition. I am having a couple of documents to run on. How to use the pretrained models to run on the New data and receive outouts?

Thanks in Advance, Sorry for any inconvinience.

ValueError:Shape of Embedding tensor not matching bert/embedding/word_embedding

in run_bluebert_multi_labels.py on running it, I am getting the following error after writing data to tfrecord file.

ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((28996, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.

Here is the traceback:

INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 20137
INFO:tensorflow:  Batch size = 32
INFO:tensorflow:  Num steps = 1887
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:389: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING:tensorflow:From /home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:425: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
WARNING:tensorflow:From /home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:398: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:405: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (32, 128)
INFO:tensorflow:  name = input_mask, shape = (32, 128)
INFO:tensorflow:  name = is_real_example, shape = (32,)
INFO:tensorflow:  name = label_ids, shape = (32, 19)
INFO:tensorflow:  name = segment_ids, shape = (32, 128)
WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:172: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:411: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:492: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:359: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /home/aditya/Projects/RD/ncbi_bluebert/bert/modeling.py:673: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From bluebert/run_bluebert_multi_labels.py:546: The name tf.train.init_from_checkpoint is deprecated. Please use tf.compat.v1.train.init_from_checkpoint instead.

ERROR:tensorflow:Error recorded from training_loop: Shape of variable bert/embeddings/word_embeddings:0 ((28996, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
  File "bluebert/run_bluebert_multi_labels.py", line 876, in <module>
    main(args)
  File "bluebert/run_bluebert_multi_labels.py", line 781, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/six.py", line 696, in reraise
    raise value
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2709, in _call_model_fn
    config)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2967, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1549, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1867, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "bluebert/run_bluebert_multi_labels.py", line 546, in model_fn
    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
    init_from_checkpoint_fn)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1684, in merge_call
    return self._merge_call(merge_fn, args, kwargs)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1691, in _merge_call
    return merge_fn(self._strategy, *args, **kwargs)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 286, in <lambda>
    ckpt_dir_or_file, assignment_map)
  File "/home/aditya/anaconda3/envs/RD/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 329, in _init_from_checkpoint
    tensor_name_in_ckpt, str(variable_map[tensor_name_in_ckpt])
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((28996, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader.

I think there is some vocab size error because my pretrained model has vocab size of 28996 while this other vocab of 30522 seems to be vocab size of BERT base model.

What is the resolution for such error.

Is it possible to use HugginFaces blubert for mt-bluebert?

missing the vocab and config in pretrained model

I would like to play with mt-bluebert-biomedical, but the download link only provide pt file, and the vocab and config files are missing, could you point to where I can get these two?
Thanks

metrics to be printed( alteast loss) per step in logging

In file run _bluebert_multi_label.py, How to print metrics per step (atleast loss) instead of global_step/sec and example/sec.

Are the count of phrases different from that stated in your paper on ShARe/CLEFE?

Hi there,
I found that the number of phrases detected was different from the paper when I used conlleval.py to assess NER predictions on ShARe/CLEFE.

for example:

from conlleval import evaluate, metrics, report_notprint

# Test.tsv#L112-L118 in ShARe/CLEFE
text = """\
The O O
left B B
atrium I I
is O O
moderately O O
dilated I O
. O O
"""
seq = text.split('\n')
count = evaluate(seq)
print(metrics(count)[0])
print(''.join(report_notprint(count)))

The output is:

Metrics(tp=1, fp=0, fn=1, prec=1.0, rec=0.5, fscore=0.6666666666666666)
processed 7 tokens with 2 phrases; found: 1 phrases; correct: 1.
accuracy: 85.71%; precision: 100.00%; recall: 50.00%; FB1: 66.67
: precision: 100.00%; recall: 50.00%; FB1: 66.67 1

It seems to me that you treats "left atrium dilated" as 1 phrase in the paper, but the output of conlleval.py is different.
Are there any ways to handle this problem well?

Thanks!

No module named "bert"

Hello,
After running the sentence similarity command for Sentence Similarity. I've received an error saying
"No module named bert". However the bert folder is at the correct place. I was unable to figure out why I am receiving error on line 23 of "run_bluebert_sts.py". I tried installing the bert using "pip install bert". It then throws an error saying that "modeling.py: No such file or directory". Then I installed the bert tensorflow version using "pip install bert-tensorflow". This clears the problem and pops a new problem like this.
"tensorflow.python.framework.errors_impl.NotFoundError: /vocab.txt; No such file or directory"

Can you please update the "requirements.txt". I couldn't find the version mentioned for tensorflow, i.e; 1.12.1.

Thanks in Advance. Sorry for Inconvinience.

Is there any cased version?

Hi, I just checked this repo out. It seems that there is no cased version. Do you have plan to release them as well? Thanks!

Finetuning ELMO. Variable seqlen is not defined

In the file elmoft.py line 114, the variable seqlen is used but it is not previously defined.

class ELMoClfHead(BaseClfHead):
    def __init__(...):
        ...
        if task_type == 'nmt':
            ...
            self.norm = NORM_TYPE_MAP[norm_type](seqlen)

so the code crashes in the line 114:
self.norm = NORM_TYPE_MAP[norm_type](seqlen)

NER Task on sample data

First of all, thank you @yfpeng and team for sharing your work.
I had a query regarding using your scripts for NER task on sample data. As mentioned in issue #14 , I set train and eval options as false. The model processes data but generates no labels in the output file.
Could you highlight with the help on one example on how the input file should be and what would be the expected output?

Note: For testing purpose I even picked one example from the train set itself but still no output was observed

Thanks :)

KeyError when running run_bluebert_multi_labels.py

Hi,

I am trying to fine-tune BlueBERT for classifying a set of clinical notes into a binary task. I have set up by train.tsv and dev.tsv files as such:

1	1	a	Assessment and Plan... <more notes here>

I was not sure whether this is the right format for BlueBERT, but for BERT, it seems that based on the following article: https://blog.insightdatascience.com/using-bert-for-state-of-the-art-pre-training-for-natural-language-processing-1d87142c29e7, the following format is followed for the tsv input data:

Column 1: An ID for the row (can be just a count, or even just the same number or letter for every row, if you don’t care to keep track of each individual example).
Column 2: A label for the row as an int. These are the classification labels that your classifier aims to predict.
Column 3: A column of all the same letter — this is a throw-away column that you need to include because the BERT model expects it.
Column 4: The text examples you want to classify.

However, when I run the following code:

python ../bluebert/bluebert/run_bluebert_multi_labels.py \
  --task_name="hoc" \
  --do_train=true \
  --do_eval=true \
  --do_predict=true \
  --vocab_file=$BlueBERT_DIR/vocab.txt \
  --bert_config_file=$BlueBERT_DIR/bert_config.json \
  --init_checkpoint=$BlueBERT_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=3 \
  --num_classes=2 \
  --num_aspects=2 \
  --data_dir=$DATASET_DIR \
  --output_dir=$OUTPUT_DIR \
  --aspect_value_list="0,1"

I get the following error:

/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x124253730>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': '/Users/sambamamba/Documents/SCPD/CS_230/Project/sywang/lowva_bluebert', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1245052e8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:Writing example 0 of 4957
Traceback (most recent call last):
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 920, in <module>
    tf.app.run()
  File "/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 811, in main
    train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 400, in file_based_convert_examples_to_features
    max_seq_length, tokenizer)
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 366, in convert_single_example
    label_id = label_map[example.label]
KeyError: '2'

Looking into run_bluebert_multi_labels.py, it seems that the label_map variable is populated based on the entry to the num_aspects and aspect_value_list flag arguments. On line 233 of this Python file, we see that get_labels method is used to create the label_list which is then fed into label_map:

def get_labels(self):
        """See base class."""
        label_list = []
        # num_aspect=FLAGS.num_aspects
        aspect_value_list = FLAGS.aspect_value_list  # [-2,-1,0,1]
        for i in range(FLAGS.num_aspects):
            for value in aspect_value_list:
                label_list.append(str(i) + "_" + str(value))
        return label_list  # [ {'0_-2': 0, '0_-1': 1, '0_0': 2, '0_1': 3,....'19_-2': 76, '19_-1': 77, '19_0': 78, '19_1': 79}]

which is fed into line 277:

label_map = {}
    for (i, label) in enumerate(label_list):
        label_map[label] = i

The example of what label_map keys would then be 0_-2, 0_-1, etc. I printed right before the line of the error ( line 365) and saw that

label_map is {'0_0': 0, '0_1': 1, '1_0': 2, '1_1': 3}
example.label is 2

So when we run label_id = label_map[example.label], we get a KeyError. So why is example.label being fed these underscored keys? Am I missing something here?

ERROR when torch.load("pytoch_model.bin") downloaded from huggingface

I failed to torch.load the pytorch models downloaded from huggingface:
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12

the error info is:
ValueError: invalid literal for int() with base 8: 'rebuild_'

did anyone meet this problem?

How to get Relation extraction from a plain text?

Do I need to perform the Named Entity Recognition and then prepare the data for relation extraction and then perform relation extraction to get results?

Where do I get the `train.tsv` file for the BC5CDR NER task?

The title is the question. Do I have to run code in https://github.com/ncbi-nlp/BLUE_Benchmark?

STS training

Hi,
load_sts function require that the block number is larger than 8, I wonder why? can we just give one premise, one hyperthesis and one score, (three blocks)

How to interpret STS output.

I'm doing a binary sentence similarity where labels are 0 and 1. the prediction output is as follow:

MSE = 0.12034694
global_step = 0
label_ids = [0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0.
 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1.]
loss = 0.12034694
pearson = 0.6646034
pred = [ 0.9968272  -0.03098304  0.8957741  -0.07417393 -0.09186892 -0.0707963
  0.74726754 -0.05655669 -0.09204277 -0.03543881 -0.07745712 -0.05300986
  1.1416211  -0.08503725 -0.09837753  1.004629   -0.07550014  0.00657438
 -0.08694188 -0.02797052 -0.08749591 -0.08376077  0.00488611 -0.07453079
 -0.04001749  0.9580759   0.61144847 -0.02444758 -0.06606578  1.0700202
 -0.08189891 -0.09389639 -0.04471161 -0.07932074 -0.08922736  1.072242
  1.1144217  -0.0899142  -0.05957434 -0.01848362 -0.08165789  0.56197506
 -0.10288966  0.9589868  -0.08966823 -0.07423133  0.9501468  -0.08691276
  1.0427929  -0.07580899 -0.08085545  0.05613842 -0.06668296  0.67278963
  0.04689811 -0.08730051 -0.09488467  0.7494736   0.59106404 -0.05784546
 -0.0580256   0.5586943   0.82942    -0.08266563 -0.08970116 -0.07241884
 -0.08084895 -0.0888657   0.16364944 -0.08838011 -0.08021087 -0.07139066
  0.98460495 -0.09568951 -0.08403315 -0.03191408  0.84516126 -0.07047645
  1.04264     1.1047416   0.9219344   0.93681306  0.00817167 -0.08582229
 -0.09332561 -0.05327708 -0.08006877 -0.06815267  0.08796047 -0.10083354
 -0.08134227 -0.0519708  -0.07535361  0.02822088  0.8645804  -0.08838581
  0.05759583 -0.09652802 -0.0544436   0.8467474   1.011137   -0.0152052
 -0.09230338 -0.08920024  0.9547418  -0.09625152 -0.07814157 -0.05981593
 -0.06737825 -0.0525138  -0.07601891  0.00535123 -0.09302492 -0.05335039
  0.57089394  0.9735016  -0.07029892  0.9383386   0.17835245  0.07288147
 -0.05812666 -0.09008455  0.16482374 -0.06855011 -0.07975283 -0.0688867
  0.16806357 -0.08691715  0.8265008  -0.05552685 -0.04530346  0.9801875
  0.9665445  -0.10243599 -0.09238719 -0.08140092 -0.07281174 -0.09341179
 -0.08653723 -0.04425526 -0.04663768 -0.07175027 -0.05161241 -0.07474666
 -0.08247717 -0.07625985  0.05558392 -0.09737069 -0.08582785 -0.08285176
 -0.09085771 -0.08242864 -0.06997188 -0.09492967  0.87413186  0.00221197
 -0.09681983  1.1069126  -0.07090654  1.0427476   0.97657245 -0.05734477
 -0.06612358  0.17080042  0.04073562  0.8623907  -0.06221616 -0.07726647
 -0.08040509  0.35656622  0.88446796  0.01673024 -0.09752481 -0.09414034
 -0.06563986 -0.05257557 -0.08664538 -0.03824814  0.99862784  0.9537769
 -0.0507925   1.0611311   0.26432222  0.02389601 -0.08002971  0.24677996
 -0.04190464 -0.07924199  0.44772255  0.16013458  1.1142675  -0.06626779
  0.11091595  1.0015993   0.98124903 -0.08817458 -0.0803092  -0.00456336
  1.0019325  -0.09834503 -0.07607836  0.9602315  -0.050502   -0.09498988
  0.93423295 -0.08353204  0.95852834 -0.08302109 -0.03645961 -0.0837692
 -0.04907575 -0.08840061 -0.04175755  0.05482076  0.98270017 -0.05114298
 -0.07228722  0.81660086 -0.07696462 -0.08263256  1.0464804  -0.08961527
  0.01591448  0.03492247 -0.03415895 -0.07692334  0.7936482   0.98901486
  1.0336974  -0.01263706  0.64612895 -0.07319017 -0.08374722  0.98839957
 -0.0816884  -0.08701541  0.9753411   0.38509053 -0.08011929 -0.08158413
 -0.08267076 -0.07939766 -0.0851294  -0.10770355 -0.04284238 -0.09182031
  1.0836056  -0.07639952 -0.09889527 -0.01996168 -0.09211037 -0.07140023
 -0.07940755 -0.08331279 -0.06124184 -0.08752528 -0.07155015  1.06396
 -0.09301544 -0.07780191  0.18636224  1.0234824  -0.06206534 -0.10370414
  0.20406811 -0.09179069 -0.08385491 -0.07036848 -0.08004359  1.04012
 -0.08071671  0.8393969   0.0629826  -0.05980002 -0.09884399 -0.04910354
 -0.06946485 -0.09015001  1.0906504   0.986099   -0.05425195  0.5622222
  0.935292   -0.08033577  1.0642971   1.0911734  -0.08062124  0.7644436
  0.87184227 -0.07042552 -0.08266561  0.9998966  -0.03840258 -0.08939464
  1.009424    0.25307548 -0.09172264 -0.08039551 -0.07240216  1.0881265
  0.0290037   0.9582196   1.0014933  -0.00588964  0.08343956 -0.10145007
  1.1023728   1.0932642  -0.09266437 -0.09243488 -0.08602741  0.18427256
 -0.08351617  0.9532236   1.0550426  -0.09006116 -0.08440115  0.9653421
 -0.07703653  0.9551673 ]
spearman = 0.31011906

label_ids seems to be the true labels from the test file. But I don't really now how to interpret the pred list
@yfpeng

Input data for Document classification

I have preprocessed MIMIC notes along with corresponding labels(multi-label classification task) in a pandas dataframe. How does sample data look like ? How to convert it into the format as required by the script bluebert/run_bluebert_multi_labels.py. ?
Also what is aspect_value_list and num_aspect parameters?

def eval(...):
        ...
        elif task_type == 'nmt':
               tkns_tnsr, lb_tnsr = zip(*[(sx.split(SC), list(map(int, sy.split(SC)))) for sx, sy in zip(tkns_tnsr, lb_tnsr) if ((type(sx) is str and sx != '') or len(sx) > 0) and ((type(sy) is str and sy != '') or len(sy) > 0)])

If the list is empty, i.e. tkns_tnsr, lb_tnsr = zip(*[]) , The code crashes

A sanity check (like the one in line 676) should be added before line 675

forward() got an unexpected keyword argument 'labels'

Hi! I am trying to use your model from huggingface for text classification following their examples and I find that I encounter this error
forward() got an unexpected keyword argument 'labels'

How do I suppose to train this model?
Here is what I was trying to do

Pre-trained model for NER

Hi,
is there a pre-trained model for the NER task?

ncbi-nlp / bluebert Goto Github PK

bluebert's People

Contributors

Stargazers

Watchers

Forkers

bluebert's Issues

Recommend Projects

Recommend Topics

Recommend Org