naver / biobert-pretrained Goto Github PK

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

biobert-pretrained's Introduction

BioBERT Pre-trained Weights

This repository provides pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to our paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details.

Downloading pre-trained weights

Go to releases section of this repository or click links below to download pre-trained weights of BioBERT. We provide three combinations of pre-trained weights: BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC). Pre-training was based on the original BERT code provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows:

BioBERT-Base v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary)
BioBERT-Large v1.1 (+ PubMed 1M) - based on BERT-large-Cased (custom 30k vocabulary), NER/QA Results
BioBERT-Base v1.0 (+ PubMed 200K) - based on BERT-base-Cased (same vocabulary)
BioBERT-Base v1.0 (+ PMC 270K) - based on BERT-base-Cased (same vocabulary)
BioBERT-Base v1.0 (+ PubMed 200K + PMC 270K) - based on BERT-base-Cased (same vocabulary)

Make sure to specify the versions of pre-trained weights used in your works. If you have difficulty choosing which one to use, we recommend using BioBERT-Base v1.1 (+ PubMed 1M) or BioBERT-Large v1.1 (+ PubMed 1M) depending on your GPU resources. Note that for BioBERT-Base, we are using WordPiece vocabulary (vocab.txt) provided by Google as any new words in biomedical corpus can be represented with subwords (for instance, Leukemia => Leu + ##ke + ##mia). More details are in the closed issue #1.

Pre-training corpus

We do not provide pre-processed version of each corpus. However, each pre-training corpus could be found in the following links:

PubMed Abstracts1: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
PubMed Abstracts2: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
PubMed Central Full Texts: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/

Estimated size of each corpus is 4.5 billion words for PubMed Abstracts1 + PubMed Abstracts2, and 13.5 billion words for PubMed Central Full Texts.

Fine-tuning BioBERT

To fine-tunine BioBERT on biomedical text mining tasks using provided pre-trained weights, refer to the DMIS GitHub repository for BioBERT.

Citation

@article{10.1093/bioinformatics/btz682,
    author = {Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
    title = "{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}",
    journal = {Bioinformatics},
    year = {2019},
    month = {09},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btz682},
    url = {https://doi.org/10.1093/bioinformatics/btz682},
}

Contact information

For help or issues using pre-trained weights of BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee ([email protected]), or Sungdong Kim ([email protected]) for communication related to pre-trained weights of BioBERT.

biobert-pretrained's People

Contributors

Stargazers

Watchers

Forkers

geoslegend shubhampachori12110095 jmorken mikerossgithub yyht lucian-whu drorhilman nlngh mrazakhan mac-kim mandizhao faizanurrahman o0windseed0o bin2000 carrielui gkovaig enno-h bigodatamining zenleaf futong zorrotrying nimeoz seonjeonghwang ethanlovequeen wonjininfo lanyexiaosa raymonduoe derrickfwang jianbotang ryannetwork seoguypt ibotamon chenying99 sunnyreddy9299 natnaelt b1sounours lzanellac haritaparikh zhangdachuanfoodies stjordanis midhajatin shj1987 rajeshkp bendavis-chicago xuezhizeng bigdatavik qianrenjian reloadbrain willmaclean aspirincode beira-bf peterli1001 irreversibly yl2565 swang41 shunsunsun ghowoo lpfworld wh-forker china-challengehub fatancy2580 bengshaoye carvalhoamc mevol eecmyang hevohel nicole-hong liyueyan123 ccaakkee higuseonhye jeongyunlee smartgamer jinsunosu vdeeplearn shicheng-guo rtsengsv dylansppy jbarsotti rajan9519 catalyst-plus bcqguo binu-alexander nzwang genieus azumitoasneuet112358

biobert-pretrained's Issues

Any plan to have updated pretrained model?

any plan to have some update with latest pubmed data? Thanks. Shicheng

Are these cased or uncased models?

answer_start index in train data

In train data, I see that 'answer_start' of the answer(in the context paragraph) is always pointing to the first occurrence of the answer. for example. if the answer word - 'MethPed' is appearing multiple times in the context paragraph, answer_start(which is character index) is pointing to the first occurrence of 'MethPed'. is it more appropriate to point answer_start to that particular instance of 'MethPed' where that particular sentence(which 'MethPed' is present ) is very close to what was asked in the question? please let me know.

Biobert custom vocab

Hi,
I was wondering if you can help me with something. I understand the instinct to need a custom vocabulary for scientific works especially for some tasks. However, I have a couple of questions:

When you compare NER/QA Results you compare the bert-large vs bert-base architectures. So how can I be sure that the improved results are due to vocab and not arhitecture (bert-large is, well larger, 340M parameters vs 110M for the base model).
If you have a different vocabulary, I assume you first retrain the original BERT and then domain specific pre-train it on pubmed abstracts. My question is do you only change the vocabulary or also the tokenizer used in the original BERT? Also is this vocab specifically designed for biomedical texts, or a more robust version of the original one (meaning it can be still applied to general texts)

Domain Specific Pre-training Model

Hi,

I have run the code run_pretraining.py script on my domain specific data.

It seems like only checkpoints are saved. I have got two files 0000020.params and 0000020.states.

How can I save the model or get a model from .params and .states files in checkpoint folder so that I can use that model to get contextual embeddings.

Can someone please help me with this?

Using HuggingFace transformers library

I am trying to fine tune the BioBERT model using free-text laboratory data in a data safe haven.
The biobert-pretrained model was downloaded into a local directory.

I tried to load the model weights using the following code:
From transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained('/localdirectorypath')

I get an error message that no file named tf_model.h5 or pytorch_model.bin is found in the local directory.

Am I not able to use the HuggingFace transformers library with the biobert-pretrained model?

Regarding Relation Extraction (RE), does it mean it's classifying whether the two marked entities have the defined relations?

First of all, thank you for biobert!

I am very new to biobert.

When I first heard about the name 'relation extraction', I thought it would be something like this:

Input: Gene123 may be a predictor of disease123, but not disease456.
Output from biobert: gene123, disease123

Question 1: Is my understanding below right?

As I learned more about biobert, it seems to me that the idea is:

Input: ___ may be a predictor of ___, but not disease456.
Output: 1 (yes, the two masked tokens have the defined relation).

Or:

Input: ___ may be a predictor of disease123, but not ___
Output: 0 (no, the two masked tokens do not have the defined relation).

Question 2: If my understanding above is right, if I want to achieve the goal of inputting a sentence, and outputting a pair gene-disease, does it mean I need the following structure?

step 1: Entity Extraction of the genes and disease entities.
Output: gene123, disease123, disease456.

step 2: Try all the possible pairing of the two entities:
gene123-disease123,
gene123-disease456,
disease123-disease456

and classify if the pairs form the relation.

Output:
gene123-disease123: 1
gene123-disease456: 0
disease123-disease456: 0

Thank you for taking the time to answer my questions!

Problem with loading model

Hello. I have a problem with loading model. I think, i break down checksum. I download 1m pubmed dataset, unzip them and after that start next line from the repo:

python run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000.data-00000-of-00001 --num_train_epochs=10.0 --data_dir=$NER_DIR/ --output_dir=tmp/bioner/

Next, I have problem:

tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/eurvanov/python/biobert/data/biobert_v1.1_pubmed/model.ckpt-1000000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Using BioBERT in bert-as-service

Hi, this is a great work! How could I use this with bert-as-service: https://github.com/hanxiao/bert-as-service ?

The pre-trained weights seems not available in the google drive links provided.

The pre-trained weights seems not available in the Google Drive links provided. How can I download them?

BIOBERT corpus

Congratulations on the BIOBERT work. I am trying to train BIOBERT from scratch with slight modifications.
The vocab.txt is the original.
Please, can you provide me with help on getting the PubMed Abstracts, PMC full articles texts and the original BERT corpus? I can't seem to find a way in getting these files.

Are there Chinese version bio-bert pretrained model?

Are there Chinese version bio-bert pretrained model? If not, will it be made in the future?

KeyError when running NER on pretrained BioBERT model

Hello,

I am trying to run NER on the pretrained BioBERT model. I have tried using both biobert_v1.1_pubmed and biobert_large.

I first created the pytorch model using these steps

I then created and tested an nerPipeline:

My code:

from transformers import BertModel, BertTokenizer, BertConfig, pipeline

model=BertModel.from_pretrained('../../biobert_v1.1_pubmed')
tokenizer = BertTokenizer.from_pretrained('../../biobert_v1.1_pubmed')
config = BertConfig.from_pretrained('../../biobert_v1.1_pubmed')

nlp = pipeline(task = "ner", model = model, config = config, tokenizer = tokenizer, framework = "pt")

sequence = "some sequence of words"
test = nlp(sequence)
print(test)

Error:

Traceback (most recent call last):
  File "load_model.py", line 8, in <module>
    test = nlp(sequence)
  File "/home/dev/.local/lib/python3.6/site-packages/transformers/pipelines.py", line 794, in __call__
    if self.model.config.id2label[label_idx] not in self.ignore_labels:
KeyError: 97

License

Hi,

is it possible to add a License like Apache2.0 to the repo? So we can use these checkpoints for commercial purposes?
Thanks!

Can't download the pre-trained weight from "release" section

Hi Jinhyuk,

I have tried several times to download the files but failed.

Could you help me check whether the URL is still valid?

Maybe uploading these files to S3 is a good choice.

Thanks

Is the vocab.txt correct?

Just a general question I guess, but after inspecting the vocab.txt it doesn't seem to be particularly biomedically related (seems like its the old one) is this correct?

I'm trying to use these pretrained models in an experiment for NER, and I'd like to be able to acquire a distributional vector given a sequence of tokens (ideally bolting it into an existing Keras model, but I'm not set on that idea)

using pretrained biobert matrix

Hello,

I am attempting to use your pretrained biobert weight matrix in my code (BioBERT v1.1), but my code takes an input for a matrix like GLOVE.6B (https://nlp.stanford.edu/projects/glove/), which is a text file with the word followed by weights. I wonder if there a way to convert your pretrained weights into this format or if you have this format already.

Thanks very much in advance!

Load Biobert pre-trained weights into Bert model with Pytorch bert hugging face run_classifier.py code

These are the steps I followed to get Biobert working with the existing Bert hugging face pytorch code.

I downloaded the pre-trained weights 'biobert_pubmed_pmc.tar.gz' from the Releases page.
I ran this command to convert the tf checkpoint to pytorch model

python pytorch-pretrained-BERT/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path="biobert/pubmed_pmc_470k/biobert_model.ckpt.index" --bert_config_file="biobert/pubmed_pmc_470k/bert_config.json" --pytorch_dump_path="biobert/pubmed_pmc_470k/Pytorch/biobert.model"

This created a file 'biobert.model' in the specified path.

As mentioned in this link , I compressed 'biobert.model' created above and 'biobert/pubmed_pmc_470k/bert_config.json' together into a biobert_model.tar.gz
I then ran the run_classifier.py of hugging face bert with the following command, using the tar.gz created above.

python pytorch-pretrained-BERT/examples/run_classifier.py --data_dir="Data/" --bert_model="biobert_model.tar.gz" --task_name="qqp" --output_dir="OutputModels/Pretrained/" --do_train --do_eval --do_lower_case

I get the error

'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte'

in the line

tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

Am I doing something wrong?

I just wanted to run run_classifier.py code provided by hugging face with biobert pretrained weights in the same way that we run bert with it. Is there a way to do this?

Files for BioBERT tokenizer

In order to use Tokenizer from BioBERT, the program requires tokenizer files for BioBERT.

tokenizer = BertTokenizer.from_pretrained('BioBERT_DIR/BioBERT_tokenizer_files')

These are the files generated when one saves the developed tokenizer using the following command.

tokenizer.save_pretrained('./my_saved_biobert_model_directory/')

This should save files like,

The file names are,

added_token.json
special_tokens_map.json
tokenizer_config.json

However, I am not able to find these files from these pretrained BioBERT weights directory.

From this post, I understand that this is linked to issue #1. Does this mean, one needs to use tokenizer from BERT and not BioBERT? What BERT tokenizer will be compatible with BioBERT?

I will be grateful for your response.

Pre-trained BioBERT를 distilBERT처럼 사용하려면

안녕하세요.
먼저, 우수한 눈문과 결과로 대한민국인의 자부심을 느끼게 해주셔서 정말 감사드립니다.
문의 드릴 일이 있어 글을 남깁니다.
Pre-trained BioBERT도 distilBERT 처럼 사용이 가능한건가요?
일반적인 연구환경에서는 대용량의 컴퓨터와 GPU를 구비하기 힘들다는 것은 잘 아시리라 생각됩니다.
그래서 일반 연구원들이 BioBERT의 우수한 성능을 활용할 수 있는 방법은 없을가요?
감사합니다.

I cant open five links of fine-tuning BioBERT

can you check for it ?

Has Pre-training corpus chinese?

Has Pre-training corpus chinese? Because l found most of the pre-training corpus is english

How do you pre-process the PMC articles?

Hi, i have a question that the number of PMC articles is huge and the pre-process procedure requires sentences segmentation for paragraphs, so how do you finish your sentence segmentation quickly?

Failed to find any matching files for biobert-pretrained/biobert_v1.1_pubmed/biobert_model.ckpt

Hi biobert,
I got the error of not finding of file biobert_v1.1_pubmed/biobert_model.ckpt when I run the below command as written in README.md.
python run_ner.py \ --do_train=true \ --do_eval=true \ --vocab_file=$BIOBERT_DIR/vocab.txt \ --bert_config_file=$BIOBERT_DIR/bert_config.json \ --init_checkpoint=$BIOBERT_DIR/biobert_model.ckpt \ --num_train_epochs=10.0 \ --data_dir=$NER_DIR/ \ --output_dir=/tmp/bioner/
Would you please provide me the file biobert_model.ckpt or tell me how to fix the issue, thanks.

total time require for training

i am running "run_ner" file for training about 7 to 8 hours in colab and model won't show any progress

output of the training is like above is there any error?

Is there any plan to upload pretrained weight in a different format?

Thanks for uploading pretrained weight.
.ckpt seems to be model from tf.
Because I am torch user, so I tried to find how to convert tf weights to torch weight(something like weight for nn.Embedding), but I couldn't find out.

Is there any plan to upload weight in the form of, say, GloVe txt format?
Like, one row is consisted from word and weight.

If this could be done, I would do fine-tuning during training my model, for downstream tasks.
Thanks.

Question: Which part of PMC is used?

Do you use all of PMC or only the Open Access subset to pre-train the model?