Git Product home page Git Product logo

blue_benchmark's Introduction

BLUE, the Biomedical Language Understanding Evaluation benchmark

***** New Aug 13th, 2019: Change DDI metric from micro-F1 to macro-F1 *****

***** New July 11th, 2019: preprocessed PubMed texts *****

We uploaded the preprocessed PubMed texts that were used to pre-train the NCBI_BERT models.

***** New June 17th, 2019: data in BERT format *****

We uploaded some datasets that are ready to be used with the NCBI BlueBERT codes.

Introduction

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.

Tasks

Corpus Train Dev Test Task Metrics Domain
MedSTS 675 75 318 Sentence similarity Pearson Clinical
BIOSSES 64 16 20 Sentence similarity Pearson Biomedical
BC5CDR-disease 4182 4244 4424 NER F1 Biomedical
BC5CDR-chemical 5203 5347 5385 NER F1 Biomedical
ShARe/CLEFE 4628 1075 5195 NER F1 Clinical
DDI 2937 1004 979 Relation extraction macro F1 Biomedical
ChemProt 4154 2416 3458 Relation extraction micro F1 Biomedical
i2b2-2010 3110 11 6293 Relation extraction F1 Clinical
HoC 1108 157 315 Document classification F1 Biomedical
MedNLI 11232 1395 1422 Inference accuracy Clinical

Sentence similarity

BIOSSES is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain. Here, we randomly select 80% for training and 20% for testing because there is no standard splits in the released data.

MedSTS is a corpus of sentence pairs selected from Mayo Clinics clinical data warehouse. Please visit the website to obtain a copy of the dataset. We use the standard training and testing sets in the shared task.

Named entity recognition

BC5CDR is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemical-disease relation task We use the standard training and test set in the BC5CDR shared task

ShARe/CLEF eHealth Task 1 Corpus is a collection of 299 deidentified clinical free-text notes from the MIMIC II database Please visit the website to obtain a copy of the dataset. We use the standard training and test set in the ShARe/CLEF eHealth Tasks 1.

Relation extraction

DDI extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts In our benchmark, we use 624 train files and 191 test files to evaluate the performance and report the macro-average F1-score of the four DDI types.

ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions and was used in the BioCreative VI text mining chemical-protein interactions shared task We use the standard training and test sets in the ChemProt shared task and evaluate the same five classes: CPR:3, CPR:4, CPR:5, CPR:6, and CPR:9.

i2b2 2010 shared task collection consists of 170 documents for training and 256 documents for testing, which is the subset of the original dataset. The dataset was collected from three different hospitals and was annotated by medical practitioners for eight types of relations between problems and treatments.

Document multilabel classification

HoC (the Hallmarks of Cancers corpus) consists of 1,580 PubMed abstracts annotated with ten currently known hallmarks of cancer We use 315 (~20%) abstracts for testing and the remaining abstracts for training. For the HoC task, we followed the common practice and reported the example-based F1-score on the abstract level

Inference task

MedNLI is a collection of sentence pairs selected from MIMIC-III. We use the same training, development, and test sets in Romanov and Shivade

Datasets

Some datasets can be downloaded at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1

Baselines

Corpus Metrics SOTA* ELMo BioBERT NCBI_BERT(base) (P) NCBI_BERT(base) (P+M) NCBI_BERT(large) (P) NCBI_BERT(large) (P+M)
MedSTS Pearson 83.6 68.6 84.5 84.5 84.8 84.6 83.2
BIOSSES Pearson 84.8 60.2 82.7 89.3 91.6 86.3 75.1
BC5CDR-disease F 84.1 83.9 85.9 86.6 85.4 82.9 83.8
BC5CDR-chemical F 93.3 91.5 93.0 93.5 92.4 91.7 91.1
ShARe/CLEFE F 70.0 75.6 72.8 75.4 77.1 72.7 74.4
DDI F 72.9 62.0 78.8 78.1 79.4 79.9 76.3
ChemProt F 64.1 66.6 71.3 72.5 69.2 74.4 65.1
i2b2 2010 F 73.7 71.2 72.2 74.4 76.4 73.3 73.9
HoC F 81.5 80.0 82.9 85.3 83.1 87.3 85.3
MedNLI acc 73.5 71.4 80.5 82.2 84.0 81.5 83.8

P: PubMed, P+M: PubMed + MIMIC-III

SOTA, state-of-the-art as of April 2019, to the best of our knowledge

Fine-tuning with ELMo

We adopted the ELMo model pre-trained on PubMed abstracts to accomplish the BLUE tasks. The output of ELMo embeddings of each token is used as input for the fine-tuning model. We retrieved the output states of both layers in ELMo and concatenated them into one vector for each word. We used the maximum sequence length 128 for padding. The learning rate was set to 0.001 with an Adam optimizer. We iterated the training process for 20 epochs with batch size 64 and early stopped if the training loss did not decrease.

Fine-tuning with BERT

Please see https://github.com/ncbi-nlp/ncbi_bluebert.

Citing BLUE

@InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: 
               An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
}

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number K99LM013001-01.

We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available. We would like to thank Geeticka Chauhan for providing thoughtful comments.

Disclaimer

This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.

blue_benchmark's People

Contributors

cskyan avatar yfpeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blue_benchmark's Issues

Download MedSTS?

Hello do you know where I can download the MedSTS dataset from? I don't see anything about downloading the data on their actual website. I could really use the data as I'm trying to train a sentence similarity model for coronavirus searches.

Thanks

Is the i2b2-2010 dataset used?

Hi,

I'm working on recreating the datasets and I think there's a discrepency from the NCBI BERT github code, and the benchmark code. The i2b2 processing code uses the 2010 dataset, while the readme in the NCBI BERT dataset seems to use the 2012 i2b2 data. Looking at the code to run on the i2b2 data there are no mentions of the labels used in the processing code, and the task seems to have changed from relation extraction to named entity recognition. The paper also discusses i2b2 as a relation extraction dataset, is there code available for modeling this task?

I'm also a bit confused why the processing code replaces tokens in the input text with special tokens like @problem$. This could be part of the task, but it would seem to me that keeping those tokens would provide important information.

Thank you for your help.

Best,
Oliver

create_chemprot_bert.py does not reproduce the train.tsv, dev.tsv and test.tsv files

My issue is that I could not reproduce train.tsv, dev.tsv and test.tsv of Chemprot in the bert_data.zip.

Let me take generating train.tsv as an example. I ran create_chemprot_bert.py to generate the file train.tsv with the following command line:

python blue/bert/create_chemprot_bert.py data/ChemProt/original data/ChemProt

(data/Chemprot/original is obtained from data_v0.2.zip at this link). However, the number of lines in the generated train.tsv is 19, 019, which is very different from the number of lines (19, 461) in the train.tsv in the bert_data.zip at the same link.

For example, "23580446.T4.T33" is found in my train.tsv file but not in the provided train.tsv file, and "23261590.T1.T21" is not found in my train.tsv but is found in the provided train.tsv. Notably, all cases that are mismatched in either file have the 'false' relation label.

How to modify the datasets by myself?

Hi there, thanks for this biomedical benchmark. I'm really new to BERT model and trying a need project on DDI-2013 corpus. I download the preprocessed data from your repo and found @Drug$ in sentences. Should I change them to [MASK] for BERT?

I'm going to fine-tune by myself(also learning pytorch at the moment), I'm not sure how to deal with the data content, can you help me?

Unknown Library in create_clef_bert.py

Hi,

There's an unknown library in blue/bert/create_clef_bert.py called ppathlib

I wasn't able to find an import for this function, and I'm not sure where it comes from. Is this a bug in the code, or am I missing something?

Thank you for your help.

Best,
Oliver

if __name__ == '__main__': data_path = ppathlib.data() / 'bionlp2019/data/ShAReCLEFEHealthCorpus/Origin'

Why ChemProt is evaluated under micro F1?

Dear authors,

Happy new year! Thanks for sharing these datasets.

I am a bit confused with the ChemProt dataset, where the micro-average F1 is used for evaluation. However, in this dataset (bert_data/ChemProt/), each entity pair (row) in a sentence only contains one relation label, so it is a multi-class classification task that should be evaluated by macro-average F1 or weighted-average F1. I also see the original paper [1] indeed uses the macro-average F1 as the evaluation metric. Did you regard all relation labels of a sentence as a single instance during your evaluation in this benchmark? Why?

[1] Chem-Prot: Peng et al. 2018. Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database: the journal of biological databases and curation, 2018.

Your prompt response will be highly appreciated.

Best,
Zaiqiao

BIOSSES URL are dead

Dear maintainers,

The URLs for the train, dev and test files of BIOSSES dataset are dead. Do you still have the files ? I would like to publish the dataset on Zenodo and HuggingFace Hub to perpetuate the corpus.

Regards,

Email : [email protected]

Missing files in DDI corpus

Hello, I iterated over the provided DDI files (in the Original folder) and founds 45 test set relations to be missing. All the missing relations start with 'DrugDDI' (index field). Could you provide the original files for these as well?

The missing indices are as follows:
'DrugDDI.d21928724.s0.p0', 'DrugDDI.d21928724.s0.p1', 'DrugDDI.d21928724.s0.p2', 'DrugDDI.d21928724.s2.p0', 'DrugDDI.d21928724.s2.p1', 'DrugDDI.d21928724.s2.p2', 'DrugDDI.d21928724.s3.p0', 'DrugDDI.d21928724.s3.p1', 'DrugDDI.d21928724.s3.p10', 'DrugDDI.d21928724.s3.p11', 'DrugDDI.d21928724.s3.p12', 'DrugDDI.d21928724.s3.p13', 'DrugDDI.d21928724.s3.p14', 'DrugDDI.d21928724.s3.p15', 'DrugDDI.d21928724.s3.p16', 'DrugDDI.d21928724.s3.p17', 'DrugDDI.d21928724.s3.p18', 'DrugDDI.d21928724.s3.p19', 'DrugDDI.d21928724.s3.p2', 'DrugDDI.d21928724.s3.p20', 'DrugDDI.d21928724.s3.p21', 'DrugDDI.d21928724.s3.p22', 'DrugDDI.d21928724.s3.p23', 'DrugDDI.d21928724.s3.p24', 'DrugDDI.d21928724.s3.p25', 'DrugDDI.d21928724.s3.p26', 'DrugDDI.d21928724.s3.p27', 'DrugDDI.d21928724.s3.p3', 'DrugDDI.d21928724.s3.p4', 'DrugDDI.d21928724.s3.p5', 'DrugDDI.d21928724.s3.p6', 'DrugDDI.d21928724.s3.p7', 'DrugDDI.d21928724.s3.p8', 'DrugDDI.d21928724.s3.p9', 'DrugDDI.d21928724.s6.p0', 'DrugDDI.d21928724.s6.p1', 'DrugDDI.d21928724.s6.p2', 'DrugDDI.d21928724.s6.p3', 'DrugDDI.d21928724.s6.p4', 'DrugDDI.d21928724.s6.p5', 'DrugDDI.d21928724.s6.p6', 'DrugDDI.d21928724.s6.p7', 'DrugDDI.d21928724.s6.p8', 'DrugDDI.d21928724.s6.p9', 'DrugDDI.d21928724.s8.p0'

How to split a dataset into Train and Dev on MedSTS.

Hi there, thanks for your great biomedical benchmark.
I am now working on implementing this benchmark using transformers for my study.

I acquired the ClinicalSTS dataset with permission to use it from the 1st author.
The ClinicalSTSclinicalSTS.train.txt in the dataset appears to contain 750 cases.
I couldn't find any processing code for it in this repository.

How did you divide this into 675 for Train and 75 for Dev?
Could you tell me more about it?

Thank you and best regards,
Wada

Share the splits of HoC dataset

Hi,

Thanks for sharing the codes and releasing these benchmark datasets.

I have two questions about the HOC dataset, and I will highly appreciate it if you could answer them for me.

  • I notice that you use 315 (~20%) abstracts for testing and the remaining abstracts for training. But I could not find the data split in this repository. May I ask if you could share me with these data splits?So that I can make a more fair comparison?

  • I see that the original Hoc dataset contains both the document level and sentence level labels, so may I ask which one is this work evaluate under?

Thank you for your great work. I am looking forward to hearing from you soon.

Zaiqiao

Workflow for creating train/dev/test datasets

Hi I'm not sure if this is an issue so much as a workflow question, so apologies in advance if it doesn't fit here.

But, what scripts should be run to create all the different train/dev/test datasets? I see a bash script for creating some test sets, but not for creating training sets, those do seem to have python scripts though. Is there code for unifying this workflow?

Best,
Oliver

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.