labteral / ernie Goto Github PK

Simple State-of-the-Art BERT-Based Sentence Classification with Keras / TensorFlow 2. Built with HuggingFace's Transformers.

License: Apache License 2.0

Shell 0.22% Python 99.36% Dockerfile 0.41%

bert roberta albert distilbert tensorflow keras tensorflow2 nlp sentence-classification transformers

ernie's Introduction

BERT's best friend.

Installation

Ernie requires Python 3.6 or higher.

pip install ernie

Fine-Tuning

Sentence Classification

from ernie import SentenceClassifier, Models
import pandas as pd

tuples = [
    ("This is a positive example. I'm very happy today.", 1),
    ("This is a negative sentence. Everything was wrong today at work.", 0)
]
df = pd.DataFrame(tuples)

classifier = SentenceClassifier(
    model_name=Models.BertBaseUncased,
    max_length=64,
    labels_no=2
)
classifier.load_dataset(df, validation_split=0.2)
classifier.fine_tune(
    epochs=4,
    learning_rate=2e-5,
    training_batch_size=32,
    validation_batch_size=64
)

Prediction

Predict a single text

text = "Oh, that's great!"

# It returns a tuple with the prediction
probabilities = classifier.predict_one(text)

Predict multiple texts

texts = ["Oh, that's great!", "That's really bad"]

# It returns a generator of tuples with the predictions
probabilities = classifier.predict(texts)

Prediction Strategies

If the length in tokens of the texts is greater than the max_length with which the model has been fine-tuned, they will be truncated. To avoid losing information you can use a split strategy and aggregate the predictions in different ways.

Split Strategies

SentencesWithoutUrls. The text will be splitted in sentences.
GroupedSentencesWithoutUrls. The text will be splitted in groups of sentences with a length in tokens similar to max_length.

Aggregation Strategies

Mean: the prediction of the text will be the mean of the predictions of the splits.
MeanTopFiveBinaryClassification: the mean is computed over the 5 higher predictions only.
MeanTopTenBinaryClassification: the mean is computed over the 10 higher predictions only.
MeanTopFifteenBinaryClassification: the mean is computed over the 15 higher predictions only.
MeanTopTwentyBinaryClassification: the mean is computed over the 20 higher predictions only.

from ernie import SplitStrategies, AggregationStrategies

texts = ["Oh, that's great!", "That's really bad"]
probabilities = classifier.predict(
    texts,
    split_strategy=SplitStrategies.GroupedSentencesWithoutUrls,
    aggregation_strategy=AggregationStrategies.Mean
)

You can define your custom strategies through AggregationStrategy and SplitStrategy classes.

from ernie import SplitStrategy, AggregationStrategy

my_split_strategy = SplitStrategy(
    split_patterns: list,
    remove_patterns: list,
    remove_too_short_groups: bool,
    group_splits: bool
)
my_aggregation_strategy = AggregationStrategy(
    method: function,
    max_items: int,
    top_items: bool,
    sorting_class_index: int
)

Save and restore a fine-tuned model

Save model

classifier.dump('./model')

Load model

classifier = SentenceClassifier(model_path='./model')

Interrupted Training

Since the execution may break during training (especially if you are using Google Colab), you can opt to secure every new trained epoch, so the training can be resumed without losing all the progress.

classifier = SentenceClassifier(
    model_name=Models.BertBaseUncased,
    max_length=64
)
classifier.load_dataset(df, validation_split=0.2)

for epoch in range(1, 5):
    if epoch == 3:
        raise Exception("Forced crash")

    classifier.fine_tune(epochs=1)
    classifier.dump(f'./my-model/{epoch}')

last_training_epoch = 2

classifier = SentenceClassifier(model_path=f'./my-model/{last_training_epoch}')
classifier.load_dataset(df, validation_split=0.2)

for epoch in range(last_training_epoch + 1, 5):
    classifier.fine_tune(epochs=1)
    classifier.dump(f'./my-model/{epoch}')

Autosave

Even if you do not explicitly dump the model, it will be autosaved into ./ernie-autosave every time fine_tune is successfully executed.

ernie-autosave/
└── model_family/
    └── timestamp/
        ├── config.json
        ├── special_tokens_map.json
        ├── tf_model.h5
        ├── tokenizer_config.json
        └── vocab.txt

You can easily clean the autosaved models by invoking clean_autosave after finishing a session or when starting a new one.

from ernie import clean_autosave
clean_autosave()

Supported Models

You can access some of the official base model names through the Models class. However, you can directly type the HuggingFace's model name such as bert-base-uncased or bert-base-chinese when instantiating a SentenceClassifier.

See all the available models at huggingface.co/models.

Additional Info

Accesing the model and tokenizer

You can directly access both the model and tokenizer objects once the classifier has been instantiated:

classifier.model
classifier.tokenizer

Keras `model.fit` arguments

You can pass Keras arguments of the model.fit method to the classifier.fine_tune method. For example:

classifier.fine_tune(class_weight={0: 0.2, 1: 0.8})

ernie's People

Contributors

Stargazers

Watchers

ernie's Issues

Pandas requirement

Hi,

I'm trying to install ernie and it seems to require an inconsistent version of pandas.

https://pasteboard.co/IXXIry1.png

Can you advise on what to do?

This runs on colab just fine.

It also throws errors that it cannot find TensorFlow even though tf is installed.

AttributeError: type object 'SplitStrategies' has no attribute 'GroupedSentencesWithoutUrls'

from ernie import SplitStrategies, AggregationStrategies

texts = ["Oh, that's great!", "That's really bad"]

probabilities = classifier.predict(texts,
split_strategy=SplitStrategies.GroupedSentencesWithoutUrls,
aggregation_strategy=AggregationStrategies.Mean)

Saved_model issue

Hi, after saving the model in the folder and for loading using these sentences:

from ernie import SentenceClassifier, Models
classifier=SentenceClassifier('../input/model-predictions/ernie-autosave/bert/1592945713203/')

I am getting a biased error, each and every time I am loading the saved model in this way but predictions seem to be always 1 even for 0 class?

Look into this issue.

KeyError: 0 : In classifier.load_dataset

How to define the training data for custom dataset loading from the data frame ?
How to define the parameters for multiclass classification ?

classifier = SentenceClassifier(model_name=Models.BertBaseUncased, max_length=128, labels_no=2)
classifier.load_dataset(train ,validation_split=0.2)
classifier.fine_tune(epochs=4, learning_rate=2e-5, training_batch_size=32, validation_batch_size=64)

error

KeyError                                  Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-17537c46bdcd> in <module>
      1 classifier = SentenceClassifier(model_name=Models.BertBaseUncased, max_length=128, labels_no=2)
----> 2 classifier.load_dataset(train ,validation_split=0.2)
      3 classifier.fine_tune(epochs=4, learning_rate=2e-5, training_batch_size=32, validation_batch_size=64)

/opt/anaconda3/lib/python3.7/site-packages/ernie/ernie.py in load_dataset(self, dataframe, csv_path, validation_split)
    251             raise NotImplementedError
    252 
--> 253         sentences = list(dataframe[0])
    254         labels = dataframe[1].values
    255 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2798             if self.columns.nlevels > 1:
   2799                 return self._getitem_multilevel(key)
-> 2800             indexer = self.columns.get_loc(key)
   2801             if is_integer(indexer):
   2802                 indexer = [indexer]

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

Kfolds?

How would you finetune, if you want to use kfolds?

using Keras Lr Finder with Ernie

Thanks for this great package.
While finding the optimal learning rate using Keras Lr Finder , how could this be incorporated.

from keras_lr_finder import LRFinder
classifier=SentenceClassifier(model_name=Models.BertBaseUncased,max_length=256, labels_no=2)
classifier.load_dataset(train1,validation_split=0.1)
lr_finder = LRFinder(classifier)
lr_finder.find(classifier, 0.0001, 1, 5, 1)

This gives an error . Could you please suggest an alternative for using this or any other way for getting the optimal learning rate.

Language model fine-tuning

Nice work!

I often start out with much more unlabelled than labelled data. Is it possible to do masked language model fine-tuning (without the classification head) to start with on the full set of data before adding the classifier?

If not, would a second best approach be to do it iteratively i.e. train on the small amount of labelled data, predict for the unlabelled data, fine tune on the labels & predictions and then re-train just on the labelled data?

AutoModel cannot be imported

When I try to import ernie, I got the following error:

from transformers import (
ImportError: cannot import name 'AutoModel' from 'transformers' (/home/janpaulus/miniconda3/envs/ernie/lib/python3.7/site-packages/transformers/__init__.py)

There seems to be a problem with the AutoModel class which isn't essential for the ernie.py file (at least for my usage).

My workaround for the ernie.py file is the "removal" of the AutoModel import:

import tensorflow as tf
import numpy as np
from transformers import (
    AutoTokenizer,
    #AutoModel,
    TFAutoModelForSequenceClassification,
)

Maybe the transformers version 2.4.1 isn't the right one?

faster models

Thank you for the convenient interface for HuggingFace.

In batch mode, DistilBertBaseUncased scores 21 texts/second with a modest CPU, but I would like to score millions of texts, so I would like to compare other models such as SqueezeBERT and MobileBERT. Would you be willing to add support for some more models?

Other language supporting plan?

Hi there, fist thank you for your great work. It looks like this project only supports English now, is there any plan for supporting other languages such as Chinese or French?

Support for custom model

Hi,

Thanks for a nice library. Is it possible to use a custom BERT (huggingface) model to be used in sentence classification pipeline?

Not having enough data even with training_batch_size=1

Running it with distillbert produced the following error.

Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs

I tried

setting a steps_per_epoch kwarg in classifier.fine_tune but this kwarg was rejected because steps_per_epoch argument got multiple arguments in classifier.fine_tune. I haven't looked into the code yet but do you know how I can fix the "out of data" issue for Dilstilbert?
also changed mytraining_batch_size=1 but it still didn't work. It should work with a really small training batch size since I am only processing one example at a time, it should have enough data...

Originally posted by @surya-narayanan in #5 (comment)

Train using >1 df columns

In load_dataset.py, sentences and labels are hardcoded to take first and second column of input dataframe. Is there a way to use Ernie if I have >1 features?

sentences = list(dataframe[dataframe.columns[0]])
labels = dataframe[dataframe.columns[1]].values

Different incompatible Licenses in Package

In your project root you have the LICENSE file which is apache-2.0
But in your setup.py file here and here, you have the license as gnu-gpl-3.

Those are incompatible and it would be great if you clarified.

I'm also trying to use this in nexB/scancode-results-analyzer and would you explain why transformers is locked at an old version?