This repository contains the code for a downstream classification task related to models trained using https://github.com/amin-nejad/mimic-text-generation. The task is a multilabel classification problem predicting the phenotypes exhibited by a curated subset of patients from MIMIC-III as per https://github.com/sebastianGehrmann/phenotyping.
Follow the steps below:
- Install the environment
conda env create -f environment.yml
- Run the files sequentially. This assumes you have already run the text generation models and generated synthetic data
- You may find it useful to run the training script as follows:
nohup python 02_train.py > train.out &
- You must install BioBERT separately as instructed below
In order to be able to use BioBERT, follow these instructions:
- Download the BioBERT v1.1 (+ PubMed 1M) model (or any other model) from the BioBERT repo
- Extract the downloaded file, e.g. with
tar -xvzf biobert_v1.1_pubmed.tar.gz
- Convert the BioBERT model TensorFlow checkpoint to a PyTorch and PyTorch-Transformers compatible one:
pytorch_transformers bert biobert_v1.1_pubmed/model.ckpt-1000000 biobert_v1.1_pubmed/bert_config.json biobert_v1.1_pubmed/pytorch_model.bin
- Rename config
mv biobert_v1.1_pubmed/bert_config.json biobert_v1.1_pubmed/config.json
- Rename folder
mv biobert_v1.1_pubmed/ biobert/
If everything has worked, this python code snippet shouldn't give any errors:
from pytorch_transformers import BertModel
model = BertModel.from_pretrained('biobert')