digitalepidemiologylab / covid-twitter-bert Goto Github PK

Pretrained BERT model for analysing COVID-19 Twitter data

License: MIT License

Shell 5.33% Python 47.08% TeX 8.51% Jupyter Notebook 39.07%

bert-model twitter-data pretrained-models twitter twitter-sentiment-analysis

covid-twitter-bert's Introduction

COVID-Twitter-BERT

COVID-Twitter-BERT (CT-BERT) is a transformer-based model pretrained on a large corpus of Twitter messages on the topic of COVID-19. The v2 model is trained on 97M tweets (1.2B training examples).

When used on domain specific datasets our evaluation shows that this model will get a marginal performance increase of 10–30% compared to the standard BERT-Large-model. Most improvements are shown on COVID-19 related and on Twitter-like messages.

This repository contains all code and references to models and datasets used in our paper as well as notebooks to finetune CT-BERT on your own datasets. If you end up using our work, please cite it:

Martin Müller, Marcel Salathé, and Per E Kummervold. 
COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter. 
arXiv preprint arXiv:2005.07503 (2020).

Colab

For a demo on how to train a classifier on top of CT-BERT, please take a look at this Colab. It finetunes a model on the SST-2 dataset. It can also easily be modified for finetuning on your own data.

Using Huggingface (on GPU)

Using Tensorflow 2.2 (on TPUs) - ⚠️ Currently not working due to 2.3 incompatibility ⚠️

Usage

If you are familiar with finetuning transformer models, the CT-BERT-model is available both as an downloadable archive, in TFHub and as a module in Huggingface.

Version	Base model	Language	TF2	Huggingface	TFHub
v1	BERT-large-uncased-WWM	en	TF2 Checkpoint	Huggingface	TFHub
v2	BERT-large-uncased-WWM	en	TF2 Checkpoint	Huggingface	TFHub

Huggingface

You can load the pretrained model from huggingface:

from transformers import BertForPreTraining
model = BertForPreTraining.from_pretrained('digitalepidemiologylab/covid-twitter-bert-v2')

You can predict tokens using the built-in pipelines:

from transformers import pipeline
import json

pipe = pipeline(task='fill-mask', model='digitalepidemiologylab/covid-twitter-bert-v2')
out = pipe(f"In places with a lot of people, it's a good idea to wear a {pipe.tokenizer.mask_token}")
print(json.dumps(out, indent=4))
[
    {
        "sequence": "[CLS] in places with a lot of people, it's a good idea to wear a mask [SEP]",
        "score": 0.9998226761817932,
        "token": 7308,
        "token_str": "mask"
    },
    ...
]

TF-Hub

import tensorflow_hub as hub

max_seq_length = 96  # Your choice here.
input_word_ids = tf.keras.layers.Input(
  shape=(max_seq_length,),
  dtype=tf.int32,
  name="input_word_ids")
input_mask = tf.keras.layers.Input(
  shape=(max_seq_length,),
  dtype=tf.int32,
  name="input_mask")
input_type_ids = tf.keras.layers.Input(
  shape=(max_seq_length,),
  dtype=tf.int32,
  name="input_type_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/digitalepidemiologylab/covid-twitter-bert/1", trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids])

Finetune CT-BERT using our scripts

The script run_finetune.py can be used for training a classifier. This code depends on the official tensorflow/models implementation of BERT under tensorflow 2.2/Keras.

In order to use our code you need to set up:

A Google Cloud bucket
A Google Cloud VM running Tensorflow 2.2
A TPU in the same zone as the VM also running Tensorflow 2.2

If you are a researcher you may apply for access to TPUs and/or Google Cloud credits.

Install

Clone the repository recursively

git clone https://github.com/digitalepidemiologylab/covid-twitter-bert.git --recursive && cd covid-twitter-bert

Our code was developed using tf-nightly but we made it backwards compatible to run with tensorflow 2.2. We recommend using Anaconda to manage the Python version:

conda create -n covid-twitter-bert python=3.8
conda activate covid-twitter-bert

Install dependencies

pip install -r requirements.txt

Prepare the data

Split your data into a training set train.tsv and a validation set dev.tsv with the following format:

id      label   text
1224380447930683394     label_a       Example text 1
1224380447930683394     label_a       Example text 2
1220843980633661443     label_b       Example text 3

Place these files into the folder data/finetune/originals/<dataset_name>/(train|dev).tsv (using your own dataset_name).

You can then run

cd preprocess
python create_finetune_data.py \
  --run_prefix test_run \
  --finetune_datasets <dataset_name> \
  --model_class bert_large_uncased_wwm \
  --max_seq_length 96 \
  --asciify_emojis \
  --username_filler twitteruser \
  --url_filler twitterurl \
  --replace_multiple_usernames \
  --replace_multiple_urls \
  --remove_unicode_symbols

This will generate TF record files in data/finetune/run_2020-05-19_14-14-53_517063_test_run/<dataset_name>/tfrecords.

You can now upload the data to your bucket:

cd data
gsutil -m rsync -r finetune/ gs://<bucket_name>/covid-bert/finetune/finetune_data/

Start finetuning

You can now finetune CT-BERT on this data using the following command

RUN_PREFIX=testrun                                  # Name your run
BUCKET_NAME=                                        # Fill in your buckets name here (without the gs:// prefix)
TPU_IP=XX.XX.XXX.X                                  # Fill in your TPUs IP here
FINETUNE_DATASET=<dataset_name>                     # Your dataset name
FINETUNE_DATA=<dataset_run>                         # Fill in dataset run name (e.g. run_2020-05-19_14-14-53_517063_test_run)
MODEL_CLASS=covid-twitter-bert
TRAIN_BATCH_SIZE=32
EVAL_BATCH_SIZE=8
LR=2e-5
NUM_EPOCHS=1

python run_finetune.py \
  --run_prefix $RUN_PREFIX \
  --bucket_name $BUCKET_NAME \
  --tpu_ip $TPU_IP \
  --model_class $MODEL_CLASS \
  --finetune_data ${FINETUNE_DATA}/${FINETUNE_DATASET} \
  --train_batch_size $TRAIN_BATCH_SIZE \
  --eval_batch_size $EVAL_BATCH_SIZE \
  --num_epochs $NUM_EPOCHS \
  --learning_rate $LR

Training logs, run configs, etc are then stored to gs://<bucket_name>/covid-bert/finetune/runs/run_2020-04-29_21-20-52_656110_<run_prefix>/. Among tensorflow logs you will find a file called run_logs.json containing all relevant training information

{
    "created_at": "2020-04-29 20:58:23",
    "run_name": "run_2020-04-29_20-51-10_405727_test_run",
    "final_loss": 0.19747886061668396,
    "max_seq_length": 96,
    "num_train_steps": 210,
    "eval_steps": 103,
    "steps_per_epoch": 42,
    "training_time_min": 6.77958079179128,
    "f1_macro": 0.7216383309465823,
    "scores_by_label": {
      ...
    },
    ...
}

Run the script 'sync_bucket_data.py' from your local computer to download all the training logs to data/<bucket_name>/covid-bert/finetune/<run_names>

python sync_bucket_data.py --bucket_name <bucket_name>

Datasets

In our preliminary study we have evaluated our model on five different classification datasets

Dataset name	Num classes	Reference
COVID Category (CC)	2	Read more
Vaccine Sentiment (VS)	3	See ➡️
Maternal vaccine Sentiment (MVS)	4	[not yet public]
Stanford Sentiment Treebank 2 (SST-2)	2	See ➡️
Twitter Sentiment SemEval (SE)	3	See ➡️

If you end up using these datasets, please make sure to properly cite them.

Pretrain

A documentation of how we created CT-BERT can be found here.

How do I cite COVID-Twitter-BERT?

You can cite our preprint:

@article{muller2020covid,
  title={COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter},
  author={M{\"u}ller, Martin and Salath{\'e}, Marcel and Kummervold, Per E},
  journal={arXiv preprint arXiv:2005.07503},
  year={2020}
}

Martin Müller, Marcel Salathé, and Per E. Kummervold. 
COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter.
arXiv preprint arXiv:2005.07503 (2020).

Acknowledgement

Thanks to Aksel Kummervold for creating the COVID-Twitter-Bert logo.
The model have been trained using resources made available by TPU Research Cloud (TRC) and Google Cloud COVID-19 research credits.
The model was trained as a collaboration between Martin Müller, Marcel Salathé and Per Egil Kummervold.
PK received funding from the European Commission for the call H2020-MSCA-IF-2017 and the funding scheme MSCA-IF-EF-ST for the VACMA project (grant agreement ID: 797876).
MM and MS received funding through the Versatile Emerging infectious disease Observatory grant as a part of the European Commissions Horizon 2020 framework programme (grant agreement ID: 874735).
The research was supported with Cloud TPUs from Google’s TPU Research Cloud and Google Cloud Credits in the context of COVID-19-related research”

Authors

Martin Müller ([email protected])
Per Egil Kummervold ([email protected])

covid-twitter-bert's People

Contributors

Stargazers

Watchers

covid-twitter-bert's Issues

Add link to preprint in readme

Mention the preprint ( https://arxiv.org/abs/2005.07503 ) in readme.

No proper encodings for covid-related terms

I have just checked encodings that autotokenizer produces. It seems that for words "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" it produces more than one token, while tokenizer produces one token for 'conventional' words like apple.
E.g.

from transformers import  AutoTokenizer
tokenizer =  AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert-v2", do_lower_case=True)
tokenizer(['wuhan', "covid","coronavirus","sars-cov-2","apple","city"], truncation=True, padding=True, max_length=512)

Result:

{'input_ids': [[101, 8814, 4819, 102, 0, 0, 0, 0, 0], [101, 2522, 17258, 102, 0, 0, 0, 0, 0], [101, 21887, 23350, 102, 0, 0, 0, 0, 0], [101, 18906, 2015, 1011, 2522, 2615, 1011, 1016, 102], [101, 6207, 102, 0, 0, 0, 0, 0, 0], [101, 2103, 102, 0, 0, 0, 0, 0, 0]]}.

As you can see, there are two encoded values for 'wuhan', "covid","coronavirus" ([8814, 4819],[2522, 17258],[ 21887, 23350] accordingly), while one id for apple and city (as it should be - [ 6207] and [2103]).

I have also checked tokenizer dictionary (vocab.txt) from https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2/tree/main
and there are no such terms as "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" (as mentioned in the readme - https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2).

I wonder why model does not recognize covid-related terms and how do I make the model 'understand' these terms? It seems that poor performance of models in my specific case (web texts that mention covid only once) may be related to this issue

Failed to download dataset

Hey there,
I am trying to download the SST-2 dataset, but it shows 403 error.
The error message is below:
{
"error": {
"code": 403,
"message": "Permission denied. Could not perform this operation"
}
}

Masked worked prediction returning "unused" tokens

Hi, I am trying to run basic masked word prediction in pytorch transformers to compare BERT-large-uncased-WWM and COVID-Twitter-BERT for a publication.

from transformers import pipeline, AutoModel, AutoTokenizer
pipe = pipeline(task='fill-mask', framework='pt',
                 model="bert-large-uncased-whole-word-masking",
                 device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")

returns meaningful results as:

[{'sequence': "[CLS] this is the best thing i've done in my life. [SEP]",
  'score': 0.45353376865386963,  'token': 2589},
 {'sequence': "[CLS] this is the best thing i've experienced in my life. [SEP]",
  'score': 0.2302728146314621,  'token': 5281},
 {'sequence': "[CLS] this is the best thing i've seen in my life. [SEP]",
  'score': 0.0811614915728569,  'token': 2464},
 {'sequence': "[CLS] this is the best thing i've felt in my life. [SEP]",
  'score': 0.06349574029445648,  'token': 2371},
 {'sequence': "[CLS] this is the best thing i've had in my life. [SEP]",
  'score': 0.058649078011512756,  'token': 2018}]

However

model = AutoModel.from_pretrained(pretrained_model_name_or_path='digitalepidemiologylab/covid-twitter-bert', from_tf=True)
tokenizer = AutoTokenizer.from_pretrained('digitalepidemiologylab/covid-twitter-bert', do_lower_case=True)
pipe = pipeline(task='fill-mask', framework='pt',
                 model=model,
                 tokenizer=tokenizer,
                 device=0)
pipe(f"This is the best thing I've {pipe.tokenizer.mask_token} in my life.")

returns

[{'sequence': "[CLS] this is the best thing i've [unused751] in my life. [SEP]",
  'score': 0.012442857027053833,  'token': 756},
 {'sequence': "[CLS] this is the best thing i've [unused803] in my life. [SEP]",
  'score': 0.00925508514046669,  'token': 808},
 {'sequence': "[CLS] this is the best thing i've [unused465] in my life. [SEP]",
  'score': 0.009094784036278725,  'token': 470},
 {'sequence': "[CLS] this is the best thing i've [unused490] in my life. [SEP]",
  'score': 0.008908418007194996,  'token': 495},
 {'sequence': "[CLS] this is the best thing i've [unused91] in my life. [SEP]",
  'score': 0.008854263462126255,  'token': 92}]

Would really appreciate help here.

License?

Hi all,

Thanks for sharing a great paper and accompanying code! I would love to re-use some of your text processing functions elsewhere (probably as part of a PR to an open source library), however you haven't defined a license file for the repo. Without it:

all rights are reserved and it is not Open Source or Free. You cannot modify or redistribute this code without explicit permission from the copyright holder.

Would you consider a more open source friendly license? :)

Download Twitter-Sentiment SemEval

Hi,

I want to try to reproduce your results on the SemEval 2016 Task 4 dataset (http://alt.qcri.org/semeval2016/task4/index.php?id=data-and-tools) but the download link provided give only the Ids of the tweets. I tried to download its with the Twitter API, but it appears there are a lot of 404 Errors when trying to get the tweets.

Do you have any solution ?

CUDA out of memory Error from loading CT-Bert directly from Huggingface

I was trying to fine tune CT-Bert in my own data and I have created a very simple classifier on top of CT-Bert. I used the
`from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")

model = AutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")`
as suggested.
However, in training the model in a cluster with 10GB of memory I am getting this error:

RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 10.73 GiB total capacity; 9.54 GiB already allocated; 19.56 MiB free; 9.86 GiB reserved in total by PyTorch)

I tried same classifier with bert-base-uncased model. I did not encounter this problem.
So I was wondering if the model is too large for a gpu cluster or should I do something extra?

Failed to convert the TensorFlow model to a PyTorch model

Hi,
Thank you for the wonderful model and repository in these much needed time.

I tried to convert covid-twitter-bert TensorFlow model, which was downloaded from TF2 Checkpoint to a PyTorch model using Hugging Face TensorFlow checkpoint conversion (Used transformers version 3.0.0). But, I could not succeed due to the following error occurred.

INFO:transformers.modeling_bert:Skipping _CHECKPOINTABLE_OBJECT_GRAPH
Traceback (most recent call last):
  File "/usr/local/bin/transformers-cli", line 32, in <module>
    service.run()
  File "/usr/local/lib/python3.6/dist-packages/transformers/commands/convert.py", line 78, in run
    convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
  File "/usr/local/lib/python3.6/dist-packages/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py", line 36, in convert_tf_checkpoint_to_pytorch
    load_tf_weights_in_bert(model, config, tf_checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 124, in load_tf_weights_in_bert
    assert pointer.shape == array.shape
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'BertForPreTraining' object has no attribute 'shape'

Same conversion method was used with BERT-base (cased_L-12_H-768_A-12) TensorFlow model and it converted successfully. Therefore, I assume this error is specific to covid-twitter-bert model.
Comparing with other BERT TensorFlow models, I found that covid-twitter-bert TensorFlow model missing a .meta file and I suspect it can be a possible reason for this error.

If this error is occurred due to missing .meta file, can you please provide this file? Otherwise, do you have any solution for this issue or can you make a PyTorch model available?

Logic behind [UNK] token for pronouns

Thanks a lot for the nice work!

What would be the logic behind masking pronouns with an unknown token of [UNK]. This seems to be a major deviation from standard BERT models.

For example:

tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert", do_lower_case=False)
tokenizer.tokenize('She is cool!')

outputs
['[UNK]', 'is', 'cool', '!']

while

tokenizer2 = AutoTokenizer.from_pretrained("bert-large-cased", do_lower_case=False)
tokenizer2.tokenize('She is cool!')

outputs
['She', 'is', 'cool', '!']

A few observations.

I had a run through with this model classifying comments on a petition that had a lot of traction with antivaxers. The results where pretty mixed.

Some observations.
The model really needs an initial classifier to be run trained to ask a more basic question "Is this comment about covid at all?" I found it gave negative or positive classifiers to comments that really didnt apply (Ie one stating something to the effect of "I havent seen my family in a year because of border restrictions" which I , or the algorithm, really has no way of evaluating for truthfulness).

This is to be expected, as the model is for classifying covid statements and thus has no real frame of reference for dealing with statements that arent actually about covid per se.

What I DID notice, is of those misclassified statements, it seems to be honing in on the hostility of the statements. As the petition is largely about dropping australian border restrictions and vaccine mandates, its to be expected that a large number of signers are antivax activists who have had a a tendency to be somewhat aggressive. That has me wondering if the model is actually responding to the tone of 'voice' in the comments, producing strong "false" or "misleading" signals if the input text is aggressive in nature?

edit: Oh and further context, the petition was one claiming to be "West australian doctors", a quick plugging in of names into the registar of medical practicioners revealed that the majority of signers are not actually medical practicioners (and worse, theres some evidence that of those that are, at least some might have had their names entered onto the petition without consent w/ data coming from the registry of practicioners) or are practitioners from non-medical fields like chiropractors and other pseudoscience professions, so I'm not sure how that impacts on the result. Perhaps the algorithm is picking up untruthfulness signals that I'm missing myself.

Failed to find data adapter

Hi, I've been trying to use CT-bert on my local machine and am running into this issue when using my own data set. I'm using a modified version of the GPU notebook provided here. The issue is:

And the code I'm using to load the dataset is as follows:

I don't understand why the issue is cropping up since everything looks alright.
My system specs are:
OS: Windows 10
GPU: RTX 3080
CUDA: 11.3
CudNN: 8.2
Python: 3.9
TF: 2.5
Any help will be greatly appreciated since I've spent most of my christmas break on this :D

the model architecture of COVID-Twitter-BERT v2 MNLI

Hi, I have a question about the training procesure of COVID-Twitter-BERT v2 MNLI. When it fine-tunes on MNLI dataset, does it use cross-encoder architecture and cross-entripy loss as objective?

TypeError

Trying to run the colab given in the readme file. Getting this

any idea to resolve?

Strategies for downstream NER tasks with more label

Hi,

It is a very good paper to explore the twitter related COVID info. I am wondering is there any strategies that could be provided for downstream NER tasks that contains more label other than those mentioned in the paper? For instance, we could like to explore NER label such as 'treatment', 'problem', etc.

Thanks!

Make one over CORD-19

How about another pretrained model over CORD-19? It would be more scientifically relevant. Thanks.

Loss

Hello,

Thank you for releasing a fine-tuned BERT model.
Could I ask what the loss is capturing, i.e., whether the model is trained on mask language modelling (MLM), next sentence prediction (NSP) or sentence order prediction (SOP)?
Thank you!

COVID Category (CC) dataset has 4 invalid formats for Tweet ID

For example line 754 is:
1.22065E+18,category_news

This breaks the download (hydration) unfortunately.

What are the tokens for Usernames and Hastags in the BERT vocabulary?

In the paper, it is mentioned "Each tweet was pseudonymized by replacing all Twitter usernames with a common text token. A similar procedure was performed on all URLs to web pages. We also replaced all unicode emoticons with textual ASCII representations (e.g. 😄 for ☺️ ) using the Python emoji library"

But it is not exactly clear which exact tokens are used in-place of usernames and URLs. Are they documented anywhere?