This repository provides pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to our paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details.
Go to releases section of this repository, and download pre-trained weights of BioBERT. We provide three combinations of pre-trained BioBERT: BERT + PubMed, BERT + PMC, and BERT + PubMed + PMC. Pre-training was based on the original BERT code provided by Google, and details are described in our paper.
We do not provide pre-processed version of each corpus. However, each pre-training corpus could be found in the following links:
PubMed Abstracts1
: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/PubMed Abstracts2
: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/PubMed Central Full Texts
: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/
Estimated size of each corpus is 4.5 billion words for PubMed Abstracts1
+ PubMed Abstracts2
, and 13.5 billion words for PubMed Central Full Texts
.
To fine-tunine BioBERT on biomedical text mining tasks using provided pre-trained weights, refer to the DMIS GitHub repository for BioBERT.
For now, cite the Arxiv paper:
@article{lee2019biobert,
title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining},
author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
journal={arXiv preprint arXiv:1901.08746},
year={2019}
}
For help or issues using pre-trained weights of BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee
([email protected]
), or Sungdong Kim ([email protected]
) for communication related to pre-trained weights of BioBERT.