Git Product home page Git Product logo

eda_nlp's Introduction

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Conference

For a survey of data augmentation in NLP, see this repository/this paper.

This is the code for the EMNLP-IJCNLP paper EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.

A blog post that explains EDA is [here].

Update: find an external implementation of EDA in Chinese [here].

By Jason Wei and Kai Zou.

Note: Do not email me with questions, as I will not reply. Instead, open an issue.

We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. These are a generalized set of data augmentation techniques that are easy to implement and have shown improvements on five NLP classification tasks, with substantial improvements on datasets of size N < 500. While other techniques require you to train a language model on an external dataset just to get a small boost, we found that simple text editing operations using EDA result in good performance gains. Given a sentence in the training set, we perform the following operations:

  • Synonym Replacement (SR): Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
  • Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
  • Random Swap (RS): Randomly choose two words in the sentence and swap their positions. Do this n times.
  • Random Deletion (RD): For each word in the sentence, randomly remove it with probability p.

drawing

Average performance on 5 datasets with and without EDA, with respect to percent of training data used.

Usage

You can run EDA any text classification dataset in less than 5 minutes. Just two steps:

Install NLTK (if you don't have it already):

Pip install it.

pip install -U nltk

Download WordNet.

python
>>> import nltk; nltk.download('wordnet')

Run EDA

You can easily write your own implementation, but this one takes input files in the format label\tsentence (note the \t). So for instance, your input file should look like this (example from stanford sentiment treebank):

1   neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present 
0   it is a visual rorschach test and i must have failed 
0   the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers
...

Now place this input file into the data folder. Run

python code/augment.py --input=<insert input filename>

The default output filename will append eda_ to the front of the input filename, but you can specify your own with --output. You can also specify the number of generated augmented sentences per original sentence using --num_aug (default is 9). Furthermore, you can specify different alpha parameters, which approximately means the percent of words in the sentence that will be changed according to that rule (default is 0.1 or 10%). So for example, if your input file is sst2_train.txt and you want to output to sst2_augmented.txt with 16 augmented sentences per original sentence and replace 5% of words by synonyms (alpha_sr=0.05), delete 10% of words (alpha_rd=0.1, or leave as the default) and do not apply random insertion (alpha_ri=0.0) and random swap (alpha_rs=0.0), you would do:

python code/augment.py --input=sst2_train.txt --output=sst2_augmented.txt --num_aug=16 --alpha_sr=0.05 --alpha_rd=0.1 --alpha_ri=0.0 --alpha_rs=0.0

Note that at least one augmentation operation is applied per augmented sentence regardless of alpha (if greater than zero). So if you do alpha_sr=0.001 and your sentence only has four words, one augmentation operation will still be performed. Of course, if one particular alpha is zero, nothing will be done. Best of luck!

Citation

If you use EDA in your paper, please cite us:

@inproceedings{wei-zou-2019-eda,
    title = "{EDA}: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks",
    author = "Wei, Jason  and
      Zou, Kai",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1670",
    pages = "6383--6389",
}

Experiments

The code is not documented, but is here for all experiments used in the paper. See this issue for limited guidance.

eda_nlp's People

Contributors

brunobastosg avatar jasonwei20 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eda_nlp's Issues

Liencese of this repo

Hi, thank you for the great repo. Could you add a license to it, like MIT or Apache 2.0? Thank you very much.

About the test dataset

I am curious about your data process. I mean do you split the dataset to train and test datasets, then augment the train_dataset or augment all datasets firstly then split the dataset?
Because the only difference between these two processes is whether the test datasets include the augment data.

How to change alpha?

Hi again,

It works like a charm!

Just a quick question, how do you change the alpha at runtime (as in an argument of the command). As I seen from augment.py:

import argparse
ap = argparse.ArgumentParser()
ap.add_argument("--input", required=True, type=str, help="input file of unaugmented data")
ap.add_argument("--output", required=False, type=str, help="output file of unaugmented data")
ap.add_argument("--num_aug", required=False, type=int, help="number of augmented sentences per original sentence")
args = ap.parse_args()

This does not seem to be possible.

Cheers,
M

Something about the best parameters

Hello Jason Wei, great paper great idea and I read your paper about Easy Data Augmentation.

I'm trying to implement your experiment result, but I don't know how you find the best alpha and num_aug.

In Figure3 and 4 from the paper, you draw diagrams of different alpha and different num_aug, so how did you choose alpha when you test num_aug, and how did you choose num_aug when you test different alpha?

I checked the code and I find you set
"alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.1, p_rd=0.15"
in Figure 4, and
'size_data_f1/1_tiny': [16, 16, 16, 16, 16],size_data_f1/2_small': [16, 16, 16, 16, 16],'size_data_f1/3_standard': [8, 8, 8, 8, 4],size_data_f1/4_full': [8, 8, 8, 8, 4]
in Figure 3.

Could you please explain how you set these parameters? thanks !!

empty range for randrange()

Hello! Thank you for sharing your code.

I got this error on one of my datasets, is this a known problem? I've checked the text file and there are no empty (zero-length or whitespace-only) lines.

Traceback (most recent call last):
  File "code/augment.py", line 55, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 44, in gen_eda
    aug_sentences = eda(sentence, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=num_aug)
  File "/home/user/Desktop/eda_nlp/code/eda.py", line 193, in eda
    a_words = random_insertion(words, n_ri)
  File "/home/user/Desktop/eda_nlp/code/eda.py", line 153, in random_insertion
    add_word(new_words)
  File "/home/user/Desktop/eda_nlp/code/eda.py", line 160, in add_word
    random_word = new_words[random.randint(0, len(new_words)-1)]
  File "/home/stc/miniconda3/lib/python3.7/random.py", line 222, in randint
    return self.randrange(a, b+1)
  File "/home/stc/miniconda3/lib/python3.7/random.py", line 200, in randrange
    raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))

Run augmentation on my dataset, unsure how to procceed.

What should I do in order to use my own dataset for the experiments? I placed my dataset in "data" folder, I augmented it but I don't know what to do after that. Are there any specific commands that I should use?

Thank you in advance!

Random insertion is not excluding words from the stop words list

Hi, thanks for the code repository and the paper!

I think that the idea behind Easy Data Augmentation is helpful and I am planning to port/adapt it such that it is usable for the German language as well.

Based on your paper random insertion is done in the following way:

Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.

However, by looking your implementation stop words are not excluded:

def random_insertion(words, n):
	new_words = words.copy()
	for _ in range(n):
		add_word(new_words)
	return new_words`
def add_word(new_words):
	synonyms = []
	counter = 0
	while len(synonyms) < 1:
		random_word = new_words[random.randint(0, len(new_words)-1)]
		synonyms = get_synonyms(random_word)
		counter += 1
		if counter >= 10:
			return
	random_synonym = synonyms[0]
	random_idx = random.randint(0, len(new_words)-1)
	new_words.insert(random_idx, random_synonym)
def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9):
	
	sentence = get_only_chars(sentence)
	words = sentence.split(' ')
	words = [word for word in words if word is not '']
	num_words = len(words)
	
	augmented_sentences = []
	num_new_per_technique = int(num_aug/4)+1
	n_sr = max(1, int(alpha_sr*num_words))
	n_ri = max(1, int(alpha_ri*num_words))
	n_rs = max(1, int(alpha_rs*num_words))
         
       .........

	#ri
	for _ in range(num_new_per_technique):
		a_words = random_insertion(words, n_ri)
		augmented_sentences.append(' '.join(a_words))

        .........

Do you know how this affects the final results? Thanks!

Performance on trec dataset drops significantly after using eda

model: BertForSequenceClassification
train_set size: 120 of 5452, using sklearn.model_selection.StratifiedShuffleSplit to keep classes corresponds to the original distribution of train set:

def split(self, examples, test_size, train_size, n_splits=2, split_idx=0):
    label_map = {"ENTY": 0, "DESC": 1, 'LOC': 2, 'ABBR': 3, 'NUM': 4, 'HUM': 5}
    labels = [label_map[e.label] for e in examples]
    kf = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=C.get()['seed'])
    kf = kf.split(list(range(len(examples))), labels)
    for _ in range(split_idx + 1):  # split_idx equal to cv_fold. this loop is used to get i-th fold
        train_idx, valid_idx = next(kf)
    train_dev_set = np.array(examples)
    return list(train_dev_set[train_idx]), list(train_dev_set[valid_idx])

valid_set size: 180
test_set size: all(500)
alpha: 0.1

augment code:

labels = labels.repeat(n_aug)
aug_texts = [eda(text, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=n_aug - 1) for text in texts]
assert len(labels)==len(aug_texts)

I have checked that the labels and augmented texts are matched correctly.

result when n_aug=16 (the average of three experiments) :

  • before augment: 0.7975
  • after augment: 0.7657

result when n_aug=8 (the average of three experiments) :

  • before augment: 0.813
  • after augment: 0.7044

It's very confusing. In my experiment, it seems that eda performs well when texts are long (such as imdb dataset), but has poor performce in datasets like trec and sst5. Did I make any mistakes in the experiments setting?

How to use this?

Hi there,

First of all, great paper. I had thought of similar solutions for DA on text, but I'm glad someone put all of them together!

However, I can't seem to run. First of all, the readme mentions python code/1_data_process.py but there is no such file.

By adding a_, b_, etc suffixes, I get the following errors

Using TensorFlow backend.
Traceback (most recent call last):
  File "code/a_1_data_process.py", line 28, in <module>
    gen_sr_aug(train_orig, output_file, alpha, n_aug)
  File "/myworkingdirectory/eda_nlp/code/methods.py", line 173, in gen_sr_aug
    writer = open(output_file, 'w')
FileNotFoundError: [Errno 2] No such file or directory: 'size_data_f1/1_tiny/cr/train_sr_0.05.txt'

and similar to every possible suffix.

Thanks!

Moderation of corpus might be required

While trying out your code, which was infact very helpful in generating a lot of training data for my model, I found one of the generated sentence to be out of place.

Provided sentence:
5 let me start a task

Output:
5 let me start antiophthalmic factor a task
5 let me start a task
5 lashkar e taiba me start a task
5 let me kickoff a task
5 let task start a me
5 let me start a task
5 let me start a task
5 task me start a let
5 let me start a labor
5 antiophthalmic factor let me start a task
5 let me start a task

Text marked in bold is a terrorist organization. You can find the details in the link below.
https://en.wikipedia.org/wiki/Lashkar-e-Taiba

If possible can you please consider removing that name from synonyms of word "let"
random word for let : lashkar e taiba

Parameters Used:
--num_aug=10 --alpha=0.01

Tips for Non-English Augmentation

Hi, I hope you're doing great. I've been using your code with English text for a while and now I need to implement it for Persian Language. (hopefully with minimal change!) Your work is truly impressive! Could you kindly provide some advice on customizing your code for Persian and what I need to change? Your insights would be invaluable.

Thanks a lot,

What is the role of label here?

From README:

You can easily write your own implementation, but this one takes input files in the format label\tsentence (note the \t). So for instance, your input file should look like this (example from stanford sentiment treebank):

What does label signifies here. What to do in the case of more than 2 classes?

Need Code for paper "Good-Enough Example Extrapolation"

Hi Jason!
Sorry to interrupt you. I can't contact you via email. I have to try this place.

I am very interested in your EMNLP paper "Good-Enough Example Extrapolation", which provides me lots of inspirations.

When reading the paper, I have some questions :

  1. You mentioned that " implement GE3 at this final max-pooled hidden layer, which has size 768. That is, the hidden-space augmentation method only updates classifier weights after the BERT encoder", do you mean the weights of transformer are frozen during training? This is a very important detail when I reimplement your paper.
  2. GE3 needs to average the hidden vectors of all samples in the same class. So how to implement it in a mini-batch training? Or did you implement the GE3 in a two-stage way: First use BERT to get all vectors, and use GE3 for feature augmentation, then use a simple classifier to train on top of these features?
  3. Could you please provide the source code? I am new to this area and I really want to study this method by code.

I would appreciate it if you could answer my questions and provide the source code. In fact, I am also quite interested in data augmentation and have cited your EDA and other works in my paper and working papers. I look forward to communicate with you! Thanks a lot.

Question/Suggestions

Hi! I would like to say thank you for your contribution to the community. I've been going over your code, and have a few questions and potentially suggestions:

  1. For the method "synonym_replacement()", why do you substitute all instance of a word for the selected synonym? For example, if you have the sentence "A B C B", and the select B=X as the synonym substitution, you will always end up with "A X C X". There's no possibility to generate "A B C X" or "A X C B" - was that intentional? If not, that could be a potential improvement.

1.1) Another suggestion for "synonym_replacement()": Iterate through random indices of the list rather than the elements themselves. This will allow you to overwrite the words with their synonyms directly in the "new_words" list rather than creating a new list each time a substitution is made. Creating a new list can be expensive, especially if it's large or you're running it thousands of times.

  1. For the method "add_words()", you repeatedly select random indices as opposed to creating a random ordering of the indices - is there a reason for that? With the current code, there's the possibility that it selects the same index every time, and never adds a word even if the list contains a word with synonyms. Alternatively, you could shuffle a list of indices, and iterate through the list until you find one with a synonym. This would ensure you're not redundantly getting synonyms for the same word over and over, and ensure you always insert a word when the input list contains a word with synonyms.

  2. For the method "random_swap()", a check to ensure the word last has at least 2 words would prevent unnecessary computation in single-word cases (small thing, but something I added in my code)

  3. For the method "swap_words()", I basically have the same question as (2). In addition, I'm pretty sure you can greatly simplify the method to:

def swap_word(new_words):
    index1, index2 = random.sample(range(len(new_words)), 2)
    new_words[index1], new_words[index2] = new_words[index2], new_words[index1]
    return new_words

Please correct me if the above method doesn't do the same thing - it's what I'm using in my code and I'd prefer to not have missed something :)

  1. I'm intrigued by this code snippet:
        # this is stupid but we need it, trust me
        sentence = ' '.join(new_words)
        new_words = sentence.split(' ')

Could you tell me what was happening that made this necessary? I'm sure you encountered some edge case that this fixed, but I'm curious what that was.

  1. For the method "eda()", when the user specifies "num_aug=0", what are you trying to do in the "else" block at the end of the method? The output ends up being anywhere from 0 to num_aug sentences, depending on chance. Statistically, you will get on average 1 augmentation back, but this isn't guaranteed.

This is all meant as constructive criticism, so I hope you don't feel I'm being negative towards your work. I'd love to hear your feedback on these questions and thoughts; the work I'm doing is highly related to this topic, so I'm curious about some of the decisions you made. Thanks!

BERT + EDA ?

Will Random Swap (RS) and Random Deletion (RD) work well for BERT, as BERT is besed on contextual pre-training.

Thank you very much. @jasonwei20

Removal of apostrophes, hyphens and things.

So in eda.py you remove several things like:
line = line.replace("’", "")
line = line.replace("'", "")
line = line.replace("-", " ")

And I was wondering why is that? Cause while this augmentation method improved my results dramatically I now need to somehow get data back in which let's the bot learn that "I'm" is the same as "I am" etc, as the data now only ever includes "im".
Is this some limitation of WordNet or something?

Will the model be attacked by the adversarial examples?

In Chinese text corpus, we can generate some adversarial examples by random insertion(RI), random deletion(RD) or synonym replacement(SR). I am wondering whether EDA method will cause the model such text classifier to be attacked by the adversarial examples generated by RI, RD or SR like EDA does.
Can you explain this? Because I did some experiments and they show a decrease in performance.
Thank you very much!

Possibility of masking some tokens?

I am working on a problem with heavily imbalanced datasets. I want to use this tool to augment the positive class in my dataset. The problem is some of the tokens are critical to the problem I am trying to solve and I would like to mask those tokens for this tool. Is that currently possible?

Semi-supervised

In the semi-supervised field, first perform eda on labeled data
If I use 500 pieces of data in the fine-tuning stage, will the use of the eda method improve the results? (the bert model is officially trained)
thank you very much!

experiments section

@jasonwei20 , hi, thank you very much for your project, and I've been waiting for your update. If you can update, thank you very sincerely.

Supported languages.

Does anyone has list of supported languages for using this module? I couldn't find it in original paper.

Meaning of 0 and 1

Hi,

can you please explain to me why some sentences in dataset have 0 in front, and some have 1.

Thank you.

Augmenting non-english datasets

Hello,

this idea of implementing such data augmentations in such an easy script is superb. I would like to use it in some of my applications. My main concern is that my dataset is on Portuguese-BR, and I would like to know what should I do for adapting your code.

There are already some embeddings trained in Portuguese (link), by having them, is it easy to adjust your code?

Again, congrats for the great work.

Strange occurrences with I'm and It's (apostrophes)

So I augmented my dataset and found some strange things happening:

  • I'm standing on it
  • It's in front of me
    became
  • i m standing on it
  • it s in front of me
    I don't know why it was split like that, and in fact I can't even reproduce this now anymore. At first I thought that I had some weird apostrophes, but my tests don't confirm that.

Basically I am creating this issue to ask if anyone had something like this happen or knows a possible cause?

Mechanism to choose between EDA tasks

Is there a mechanism to choose between the type of augmentation that I wish to apply to my data. Example - Sometimes, you might just want to apply Synonym Replacement (SR) and Random Deletion (RD), while ignoring the other two techniques (Random Insertion & Random Swap) as it may completely change the label. One dataset I could think of is Corpus of Linguistic Acceptability(CoLA), where RI & RS I believe will change the target label.

In the current implementation the passed argument alpha is applied equally for each of the augmentation. Passing alpha for each of the technique as command line argument individually would allow fine grained control & will help achieve the desired task.

A little suggestion about error exception

Hello! Thank you for all your contributions on the eda. It is pretty cool.
However, I think if a line contains only numbers which is usual in conversations like asking for cell phone numbers, the code could catch the exception. The way to add it is to judge it in eda() after get only chars.
It is only a suggestion, and thank you for your tools anyway.

Confirmation abt data augmentation

Hi @jasonwei20 thanks for the great work!

I want to confirm my understanding of data augmentation in your paper:
In the experiment (RNN and CNN), do you ONLY use the output (the augmented data, train_aug_st.txt) for training? Or do you mix them (original training data + augmented training data from the original) and used them for training?
In other words, does your training data in the experiment (experiment +EDA) include the original training data?

Thanks.

ValueError: empty range for randrange() (0,0, 0)

I have processed the data according to the data format you said,Here are my running scripts and errors

python code/augment.py --input=train_50w.en --output=train_50w._augmented.txt --num_aug=1 --alpha_sr=0.05 --alpha_rd=0.05 --alpha_ri=0 --alpha_rs=0.05

Traceback (most recent call last):
File "code/augment.py", line 75, in
gen_eda(args.input, output, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, alpha_rd=alpha_rd, num_aug=num_aug)
File "code/augment.py", line 64, in gen_eda
aug_sentences = eda(sentence, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, p_rd=alpha_rd, num_aug=num_aug)
File "/home/tool/eda_nlp-master/code/eda.py", line 201, in eda
a_words = random_swap(words, n_rs)
File "/home/tool/eda_nlp-master/code/eda.py", line 130, in random_swap
new_words = swap_word(new_words)
File "/home/tool/eda_nlp-master/code/eda.py", line 134, in swap_word
random_idx_1 = random.randint(0, len(new_words)-1)
File "/home/miniconda3/envs/eda/lib/python3.6/random.py", line 221, in randint
return self.randrange(a, b+1)
File "/home/miniconda3/envs/eda/lib/python3.6/random.py", line 199, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (0,0, 0)

Languages supported

I've seen the documentation refer to input text in English.

Does it scale to other languages too? Or what do you recommend for supporting other languages?

interesting

this is stupid but we need it, trust me

sentence = ' '.join(new_words)
new_words = sentence.split(' ')

We can't get the 3 improvement rate

Hi,
We tried PC datasets and subj datasets with number of 500, and run the e_2_rnn_baseline. py and aug. py in experiment 'e'. And our augmentation number is 16. However ,the results are not stable, sometimes lower than baseline, and we didn't get the 3 improvement rate. We want to know what parameters you use in your experiments. Thanks a lot !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.