Git Product home page Git Product logo

drug-combo-extraction's Introduction

Drug Combination Extraction

Want to help researchers and clinicians plan treatments for acute medical conditions? Want to contribute to the health community by reducing drug research times? You came to the right place! This project created a dataset of drugs that go together well according to the biomedical literature (we've alse created appropriate solid baseline models). To participate, you will need to train a model on the data and when given a new sentence, predict which of the drugs in it combine together, and whether they combine in a positive/beneficial way. To participate take a look at our Leaderboard


Dependencies

Create a Conda environment:

conda create --name drug_combo python=3.8.5
conda activate drug_combo

On this virtual environment, install all required dependencies via pip:

pip install -r requirements.txt

Dataset

Our dataset splits are data/final_train_set.jsonl and data/final_test_set.jsonl.


Models and Code

Pretrained Baseline Model

You can find our strongest off-the-shelf model for this task on Huggingface.

Training

To reproduce or tweak the baseline model above, you can train your own with our provided scripts. We recommend training on a GPU machine. We trained our models on machines with a 15GB Nvidia Tesla T4 GPU running Ubuntu 18.04.

Single command to train a relation extractor based on PubmedBERT:

python scripts/train.py \
    --model-name pubmedbert_2021 \
    --pretrained-lm microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
    --num-train-epochs 10 \
    --lr 2e-4 \
    --batch-size 18 \
    --training-file data/final_train_set.jsonl \
    --test-file data/final_test_set.jsonl \
    --context-window-size 400 \
    --max-seq-length 512 \
    --label2idx data/label2idx.json \
    --seed 2022 \
    --unfreezing-strategy final-bert-layer

Full training script options:

python scripts/train.py \
                --model-name            MODEL_NAME
                [--pretrained-lm        PRETRAINED_LM (defaults to "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext") \
                [--num-train-epochs     NUM_TRAIN_EPOCHS (defaults to 10)] \
                [--lr                   LEARNING RATE (defaults to 2e-4)] \
                [--batch-size           BATCH SIZE (defaults to 18)] \
                [--training-file        data/final_train_set.jsonl] \
                [--test-file            data/final_test_set.jsonl] \
                [--context-window-size  CONTEXT LENGTH (defaults to 400)] \
                [--max-seq-length       512] \
                [--seed                 RANDOM SEED (defaults to 2021)] \
                [--unfreezing-strategy  UNFREEZING STRATEGY (defaults to final-bert-layer)]

Testing and Evaluation

Three commands to evaluate the relation extractor based on PubmedBERT:

python scripts/test_only.py \
    --checkpoint-path checkpoints_pubmedbert_2022 \
    --test-file ${HOME_DIR}/drug-synergy-models/data/final_test_set.jsonl \
    --batch-size 100 \
    --outputs-directory checkpoints_pubmedbert_2022/outputs/ \
    --seed 2022
(cd scripts && ./eval.sh ../data/final_test_set.jsonl ../checkpoints_pubmedbert_2022/outputs/predictions.jsonl > ../checkpoints_pubmedbert_2022/outputs/eval.txt)

Full options for testing and evaluation scripts:

Testing:
python scripts/test_only.py
            [--checkpoint-path      PATH TO CHECKPOINT CREATED IN TRAINING (checkpoints_${MODEL_NAME})] \
            [--test-file            data/final_test_set.jsonl] \
            [--batch-size           TEST BATCH SIZE (100)] \
            [--outputs-directory    OUTPUT DIRECTORY (checkpoints_${MODEL_NAME}/outputs/)] \
            [--seed                 RANDOM SEED (defaults to 2021)]

Evaluation (using exact-match or partial-match metrics):
(cd scripts && ./eval.sh \
            ${HOME_DIR}/drug-synergy-models/data/final_test_set.jsonl \
            checkpoints_${MODEL_NAME}/outputs/predictions.jsonl) \
            ${OPTIONAL_OUTPUT_PATH}

Analysis

We can now run analysis to capture model behavior along different aspects, using the output of several different models at different seeds

python scripts/produce_gold_jsonl.py ${HOME_DIR}/drug-synergy-models/data/final_test_set.jsonl ${HOME_DIR}/drug-synergy-models/data/final_test_rows.jsonl

python scripts/bucketing_analysis.py --pred-files \
    $MODEL_ONE_OUTPUT/outputs/predictions.jsonl \
    ... \
    $MODEL_N_OUTPUT/outputs/predictions.jsonl \
    --gold-file ${HOME_DIR}/drug-synergy-models/data/final_test_rows.jsonl \
    --bucket-type {arity OR relations_seen_in_training} \
    [--exact-match]

To train 8 models with different foundation models (SciBERT, PubmedBert, etc), run:

./scripts/launch_trainings_with_foundation_models.sh

Domain-Adaptive Pretraining

You can find our strongest domain-adapted contextualized encoder (PubmedBERT

To perform domain-adaptive pretraining yourself, unzip the pretraining data we have prepared in our data directory: continued_pretraining_large_lowercased_train.txt.tgz and continued_pretraining_large_lowercased_val.txt.tgz (these text files are 166M and 42M unzipped).

Then do:

git clone https://github.com/huggingface/transformers.git
cd examples/pytorch/language-modeling/

Then

python run_mlm.py \
    --model_name_or_path microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
    --train_file $PATH_TO/continued_pretraining_large_lowercased_train.txt \
    --validation_file $PATH_TO/continued_pretraining_large_lowercased_val.txt \
    --do_train \
    --do_eval \
    --output_dir ~/continued_pretraining_directory_pubmedbert_10_epochs \
    --max_seq_length 512                                      \
    --overwrite_output_dir

Cite Our Paper

If you use the data or models from this work in your own work, cite A Dataset for N-ary Relation Extraction of Drug Combinations.

@inproceedings{Tiktinsky2022ADF,
    title = "A Dataset for N-ary Relation Extraction of Drug Combinations",
    author = "Tiktinsky, Aryeh and Viswanathan, Vijay and Niezni, Danna and Meron Azagury, Dana and Shamay, Yosi and Taub-Tabib, Hillel and Hope, Tom and Goldberg, Yoav",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.233",
    doi = "10.18653/v1/2022.naacl-main.233",
    pages = "3190--3203",
}

drug-combo-extraction's People

Contributors

viswavi avatar aryehgigi avatar hillelt avatar

Stargazers

 avatar RickRen avatar Akash Mishra avatar  avatar  avatar Jeff Carpenter avatar  avatar Jeff Hammerbacher avatar Jingqi Kang avatar Niclas F. Sturm avatar  avatar 爱可可-爱生活 avatar malteos avatar John Giorgi avatar  avatar Anthony Yazdani avatar Yi Lu avatar Ce Zheng avatar Alexandre Nicastro avatar

Watchers

 avatar  avatar  avatar jonathan m borchardt avatar Allen Institute for Artificial Intelligence avatar  avatar  avatar

drug-combo-extraction's Issues

how to get the smallest set of disjoint relations?

how to use a greedy heuristic of choosing the smallest set of disjoint relations whose union covers as many drug entities as possible in the sentence?
Excuse me, I'm confused about this issue, could you please tell me how this process works? If possible, could you give a simple example? thanks

about leaderboard

Hello, I recently tested the results of this data set on my model. I want to know if the f1 value I calculated is accurate. Should I submit the prediction file to your leaderboard page.
By the way, the prediction results of type 0 you mentioned in the leaderboard can be omitted. I see that your model still retains some prediction results of type 0 when doing the test. If I want to submit, shall I delete or retain them like you?

environment error

I tried to run the training command, but encountered this error:
from torchmetrics.utilities.data import get_num_classes as _get_num_classes
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.