Git Product home page Git Product logo

umlsbert's People

Contributors

gmichalo avatar michaelyxwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

umlsbert's Issues

How to use pretrained UmlsBERT to get embeddings for all the UMLS terms?

I want to load the pretrained UmlsBERT model to generate vector representations/embeddings of all the medical terms in UMLS. Which library to use to load this model since it is a modified version of BERT? Also, which layer or combination of layers provides the best representation of the vectors?

How to use the pretrained UMLSBERT for NER task prediction

This is not a real issue about this repo. But want to know how to use the UmlsBERT for NER prediction on a bunch of clinical text - without any fine tuning or training of the pre-trained model. Is there a jupyter notebook that points to using the pre-trained model on a batch of clinical text for NER prediction?

Is it possible to have access to the NER part directly?

Hello and thank you for your great work.

I would like to test UmlsBERT on a self supervised model to learn sentence embeddings. I'm looking to simply use your checkpoint to do it. But is it possible for me to have access to the NER part? which given a tokenized sentence, classify each token in the 42 UMLS categories to then use the umls embeddings of BERT that you propose. I saw in the paper that you mention Ctakes to do it but I don't see where it appears in the repository.
The best for me would be to simply download your checkpoint (which I did) and then incorporate a part of your code to my other repo to test it. I looked at the NER jupyter file but I don't see the inference part of tokenized sentence with Ctakes.

I'll be so grateful for any advice on this.

Update : I have seen that vocab_updated.txt is in the repo and corresponds to the umls tags. So if I understand right, the sentences don't go trough a custom NER network (like LSTM or other) to get the classes but if a sentence contains tokens classified as tui they will have a special embeddings right?

Alexandre

run_language_modeling.py

Hello, I am facing issues running the run_language_modeling.py script when running the example for pretraining Bio_clinicalbert using this line:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

issue 1 - it said the tokenizer did not have an argument called max_len. this was the error:
'AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' '
Based on advice online, I updated It from 'tokenizer.max_len' to 'tokenizer.model_max_length' which seems to have resolved this issue

issue 2 - the current error message i am getting is:
'TypeError: init() got an unexpected keyword argument 'tui_ids''

When looking for answers to these online, I came across a comment on the huggingface transformers issue forum at huggingface/transformers#8739
They said - 'It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.'

Does this apply to the scrpit for UmlsBERT as well? If so, how can I access the updated script? If not, how can I resolve the tui_ids issue?

I am running the scripts on google colab. This is the complete output I get when i run:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

output:

2021-07-05 09:47:55.207129: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
07/05/2021 09:47:57 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
07/05/2021 09:47:57 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1/runs/Jul05_09-47-57_d7624bb0fdc5,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=150000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=clinicalBert-v1,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
save_on_each_node=False,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
/usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class AutoModelWithLMHead is deprecated and will be removed in a future version. Please use AutoModelForCausalLM for causal language models, AutoModelForMaskedLM for masked language models and AutoModelForSeq2SeqLM for encoder-decoder models.
FutureWarning,
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']

  • This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Traceback (most recent call last):
    File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 355, in
    main()
    File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 248, in main
    tui_ids=tui_ids) if training_args.do_train else None
    File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 136, in get_dataset
    tui_ids=tui_ids)
    TypeError: init() got an unexpected keyword argument 'tui_ids'

P.S. I am not an expert programmer so do let me know if I should provide any further information as this is the first time i'm submitting an issue.

Thank you.

Best,
Jaya

error with tui_ids

Hello, I have been trying to run the run_language_modeling.py script (even though it is deprecated, I found a workaround through some suggestions on the transformers git page). I get the following error:

Traceback (most recent call last):
File "./language-modeling/run_language_modeling.py", line 361, in
main()
File "./language-modeling/run_language_modeling.py", line 254, in main
tui_ids=tui_ids) if training_args.do_train else None
File "./language-modeling/run_language_modeling.py", line 142, in get_dataset
tui_ids=tui_ids)
TypeError: init() got an unexpected keyword argument 'tui_ids'

I am at a loss at how to fix this or what might be causing it? Has anyone else faced this and been able to fix it?

Thank you.

Jaya

Issue with pip install Requirements.txt

I am getting the following error. Python version is 3.7.10
ERROR: Could not find a version that satisfies the requirement faiss
ERROR: No matching distribution found for faiss

Error with the input arguments for `HfArgumentParser `

Hi,

Thank you very much for providing the code.

But when I ran this command,
python3 run_glue.py --output_dir ./models/medicalBert-v1 --model_name_or_path ../checkpoint/umlsbert --data_dir dataset/mednli/mednli --num_train_epochs 3 --per_device_train_batch_size 32 --learning_rate 1e-4 --do_train --do_eval --do_predict --task_name mnli --umls --med_document ./voc/vocab_updated.txt

I got the following error.

  File "/home/UmlsBERT-master/text-classification/run_glue.py", line 300, in <module>
    main()
  File "/home/UmlsBERT-master/text-classification/run_glue.py", line 78, in main
    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
  File "/home/UmlsBERT-master/text-classification/../src/transformers/hf_argparser.py", line 40, in __init__
    self._add_dataclass_arguments(dtype)
  File "/home/UmlsBERT-master/text-classification/../src/transformers/hf_argparser.py", line 72, in _add_dataclass_arguments
    elif hasattr(field.type, "__origin__") and issubclass(field.type.__origin__, List):
  File "/home/lui/anaconda3/lib/python3.9/typing.py", line 852, in __subclasscheck__
    return issubclass(cls, self.__origin__)
TypeError: issubclass() arg 1 must be a class

Perhaps the objects "ModelArguments, DataTrainingArguments, TrainingArguments" have not been initialized?

Thank you again!

Lui

deprecated script

Hello, thank you for developing this model. I am very keen on using it in my project. However, I have noticed that the run_language_modeling.py script does not run anymore, and the original script from PyTorch has been deprecated. It looks like they have replaced the old script with some new ones, but of course, they are missing the UMLS component which is key to running UmlsBERT. I wondered if you have created or plan to create a new version of this script. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.