Git Product home page Git Product logo

uda's Introduction

Unsupervised Data Augmentation

Overview

Unsupervised Data Augmentation or UDA is a semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks.

With only 20 labeled examples, UDA outperforms the previous state-of-the-art on IMDb trained on 25,000 labeled examples.

Model Number of labeled examples Error rate
Mixed VAT (Prev. SOTA) 25,000 4.32
BERT 25,000 4.51
UDA 20 4.20

It reduces more than 30% of the error rate of state-of-the-art methods on CIFAR-10 with 4,000 labeled examples and SVHN with 1,000 labeled examples:

Model CIFAR-10 SVHN
ICT (Prev. SOTA) 7.66±.17 3.53±.07
UDA 4.31±.08 2.28±.10

It leads to significant improvements on ImageNet with 10% labeled data.

Model top-1 accuracy top-5 accuracy
ResNet-50 55.09 77.26
UDA 68.78 88.80

How it works

UDA is a method of semi-supervised learning, that reduces the need for labeled examples and better utilizes unlabeled ones.

What we are releasing

We are releasing the following:

  • Code for text classifications based on BERT.
  • Code for image classifications on CIFAR-10 and SVHN.
  • Code and checkpoints for our back translation augmentation system.

All of the code in this repository works out-of-the-box with GPU and Google Cloud TPU.

Requirements

The code is tested on Python 2.7 and Tensorflow 1.13. After installing Tensorflow, run the following command to install dependencies:

pip install --user absl-py

Image classification

Preprocessing

We generate 100 augmented examples for every original example. To download all the augmented data, go to the image directory and run

AUG_COPY=100
bash scripts/download_cifar10.sh ${AUG_COPY}

Note that you need 120G disk space for all the augmented data. To save space, you can set AUG_COPY to a smaller number such as 30.

Alternatively, you can generate the augmented examples yourself by running

AUG_COPY=100
bash scripts/preprocess.sh --aug_copy=${AUG_COPY}

CIFAR-10 with 250, 500, 1000, 2000, 4000 examples on GPUs

GPU command:

# UDA accuracy: 
# 4000: 95.68 +- 0.08
# 2000: 95.27 +- 0.14
# 1000: 95.25 +- 0.10
# 500: 95.20 +- 0.09
# 250: 94.57 +- 0.96
bash scripts/run_cifar10_gpu.sh --aug_copy=${AUG_COPY}

SVHN with 250, 500, 1000, 2000, 4000 examples on GPUs

# UDA accuracy:
# 4000: 97.72 +- 0.10
# 2000: 97.80 +- 0.06
# 1000: 97.77 +- 0.07
# 500: 97.73 +- 0.09
# 250: 97.28 +- 0.40

bash scripts/run_svhn_gpu.sh --aug_copy=${AUG_COPY}

Text classifiation

Run on GPUs

Memory issues

The movie review texts in IMDb are longer than many classification tasks so using a longer sequence length leads to better performances. The sequence lengths are limited by the TPU/GPU memory when using BERT (See the Out-of-memory issues of BERT). As such, we provide scripts to run with shorter sequence lengths and smaller batch sizes.

Instructions

If you want to run UDA with BERT base on a GPU with 11 GB memory, go to the text directory and run the following commands:

# Set a larger max_seq_length if your GPU has a memory larger than 11GB
MAX_SEQ_LENGTH=128

# Download data and pretrained BERT checkpoints
bash scripts/download.sh

# Preprocessing
bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}

# Baseline accuracy: around 68%
bash scripts/run_base.sh --max_seq_length=${MAX_SEQ_LENGTH}

# UDA accuracy: around 90%
# Set a larger train_batch_size to achieve better performance if your GPU has a larger memory.
bash scripts/run_base_uda.sh --train_batch_size=8 --max_seq_length=${MAX_SEQ_LENGTH}

Run on Cloud TPU v3-32 Pod to achieve SOTA performance

The best performance in the paper is achieved by using a max_seq_length of 512 and initializing with BERT large finetuned on in-domain unsupervised data. If you have access to Google Cloud TPU v3-32 Pod, try:

MAX_SEQ_LENGTH=512

# Download data and pretrained BERT checkpoints
bash scripts/download.sh

# Preprocessing
bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}

# UDA accuracy: 95.3% - 95.9%
bash train_large_ft_uda_tpu.sh

Run back translation data augmentation for your dataset

First of all, install the following dependencies:

pip install --user nltk
python -c "import nltk; nltk.download('punkt')"
pip install --user tensor2tensor==1.13.4

The following command translates the provided example file. It automatically splits paragraphs into sentences, translates English sentences to French and then translates them back into English. Finally, it composes the paraphrased sentences into paragraphs. Go to the back_translate directory and run:

bash download.sh
bash run.sh

Guidelines for hyperparameters:

There is a variable sampling_temp in the bash file. It is used to control the diversity and quality of the paraphrases. Increasing sampling_temp will lead to increased diversity but worse quality. Surprisingly, diversity is more important than quality for many tasks we tried.

We suggest trying to set sampling_temp to 0.7, 0.8 and 0.9. If your task is very robust to noise, sampling_temp=0.9 or 0.8 should lead to improved performance. If your task is not robust to noise, setting sampling temp to 0.7 or 0.6 should be better.

If you want to do back translation to a large file, you can change the replicas and worker_id arguments in run.sh. For example, when replicas=3, we divide the data into three parts, and each run.sh will only process one part according to the worker_id.

General guidelines for setting hyperparameters:

UDA works out-of-box and does not require extensive hyperparameter tuning, but to really push the performance, here are suggestions about hyperparamters:

  • It works well to set the weight on unsupervised objective 'unsup_coeff' to 1.
  • Use a lower learning rate than pure supervised learning because there are two loss terms computed on labeled data and unlabeled data respecitively.
  • If your have an extremely small amount of data, try to tweak 'uda_softmax_temp' and 'uda_confidence_thresh' a bit. For more details about these two hyperparameters, search the "Confidence-based masking" and "Softmax temperature control" in the paper.
  • Effective augmentation for supervised learning usually works well for UDA.
  • For some tasks, we observed that increasing the batch size for the unsupervised objective leads to better performance. For other tasks, small batch sizes also work well. For example, when we run UDA with GPU on CIFAR-10, the best batch size for the unsupervised objective is 160.

Acknowledgement

A large portion of the code is taken from BERT and RandAugment. Thanks!

Citation

Please cite this paper if you use UDA.

@article{xie2019unsupervised,
  title={Unsupervised Data Augmentation for Consistency Training},
  author={Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Minh-Thang and Le, Quoc V},
  journal={arXiv preprint arXiv:1904.12848},
  year={2019}
}

Please also cite this paper if you use UDA for images.

@article{cubuk2019randaugment,
  title={RandAugment: Practical data augmentation with no separate search},
  author={Cubuk, Ekin D and Zoph, Barret and Shlens, Jonathon and Le, Quoc V},
  journal={arXiv preprint arXiv:1909.13719},
  year={2019}
}

Disclaimer

This is not an officially supported Google product.

uda's People

Contributors

chiragjn avatar github30 avatar michaelpulsewidth avatar philipp-eisen avatar varunnair18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

uda's Issues

Have you output the graph to tensorboard?

I open the tensorboard but found nothing. It said "No graph definition files were found."
I just want to confirm whether I need to write the summery writer by myself or you have it somewhere int here?

code to reproduce the ImageNet results

I enjoyed reading the paper and very promising results. Are you planning to share the code of ImageNet experiments in semi-supervised setting ? It would be a great help to reproduce the results.

occur error when training image

I0927 03:16:41.333719 140651592308160 tpu_estimator.py:2160] examples/sec: 186.336
INFO:tensorflow:global_step/sec: 5.45765
I0927 03:16:41.516547 140651592308160 tpu_estimator.py:2159] global_step/sec: 5.45765
INFO:tensorflow:examples/sec: 174.645
I0927 03:16:41.516915 140651592308160 tpu_estimator.py:2160] examples/sec: 174.645
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
(0) Data loss: truncated record at 49210074
[[node IteratorGetNext (defined at main.py:544) ]]
[[IteratorGetNext/_1677]]
(1) Data loss: truncated record at 49210074
[[node IteratorGetNext (defined at main.py:544) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for u'IteratorGetNext':
File "main.py", line 588, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "main.py", line 583, in main

Hi, How to solve this problem? please help

Hyperparameters for CIFAR10

Hi,

Can you please share the hyperparameters you used for obtaining the performance reported in the paper (error of 5.27)?

Hyperparams in ImageNet experiment?

Thank you for sharing the code!
I have several questions on hyper-parameter settings for ImageNet experiment.

  1. What is learning rate schedule, optimizer for the ImageNet experiment? (including initial LR, total number of epochs or iterations, when to drop LR, and LR decay rate)
  2. What is the batch size (and LR) for the experiments?
     2-1. Batch size for 10% supervised settings
     2-2. Batch size for semi-supervised settings (labeled set and unlabeled set)
  3. Loss balancing? (coefficients for entropy loss, unsupervised loss and supervised loss)

parameter settings for SVHN

Hello!If I run SVHN dataset on GPU, which value should I set for train_steps, I use train_steps of 200000, It seems that the model is overfitting.The best accuracy is 96.43 at step 110000.
The following is parameter settings during my training:

task_name=svhn
python main.py
--use_tpu=False
--do_eval_along_training=True
--do_train=True
--do_eval=True
--task_name=${task_name}
--sup_size=1000
--unsup_ratio=5
--data_dir=data/proc_data/${task_name}
--model_dir=ckpt/svhn_gpu
--max_save=20
--curr_step=0
--iterations=10000
--train_steps=200000
$@

Running in Python 3.6 and 2.7 causes contextlib error

Hello @qizhex ,

I tried running the image directory (CIFAR10) of UDA using the latest repository and ran into the following error:

image

Changing my environment into Python 2.7 avoids this error, but instead produces a different one. I thought it would be more relevant to post this issue instead because more members of the community are using python 3.6+ than 2.7.

The data was successfully downloaded using the provided scripts, but training on my GPU does not work. What would you recommend doing?

Noisy data generated by back translation

Very interesting work and thanks for sharing the code!

I am very interested in translation-based augmentation. I have generated some examples by running the run.sh, but some noisy ones are found and listed as follows:

(1) in forward generation; the input "could i get the address , phone number , and postcode of yu garden ?" and the output "The hotel is small location, the location is ideal and the food is fantastic.",

(2)in forward generation; the input "hi , i 'm looking for a nice german restaurant ." and the output "I was at listening to my room and we were even coming in the main area from 9 weeks. I also liked this hotel, this is a great boutique hotel."

(3)in forward generation; the input "i do n't care ." and the output "Sinon pour la plupart, je ne pense pas qu'il y ait un tel problème qui se pose à vous. Je n'ai pas l'intention de le faire."

Do you have any suggestions to avoid these errors?

Thanks!

Number of augmented data in text classification tasks

Can you provide some more details to the augmented data used in text classification tasks?
e.g. for each labeled text, how many texts do you augment using back translation and what is the total number of augmented data used in each text classification task?

Thank you!

IMDb task hyperparameters about 'uda_softmax_temp' and 'uda_confidence_thresh'

Hi, I'm trying to re-implement UDA for IMDb task.

I have some curiosity about hyperparameters.
(About this results)
image

When I saw the script 'run_base_uda.sh' there is no 'uda_softmax_temp' and 'uda_confidence_thresh'.
Also, 'uda_softmax_temp' and 'uda_confidence_thresh' are -1(default) at main.py

Is IMDb task does not use sharpening prediction and only use TSA method??

In image/preprocess.py

record_writer.write(example.SerializeToString()) is repeated on lines 107 and 114 in the save_tfrecord() function.

can't concat str to bytes

When I ran to this:
bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}
It gives me the error:
(tf) lab01@ai-lab01:/data/projects/uda/text$ bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}
Traceback (most recent call last):
File "preprocess.py", line 573, in
app.run(main)
File "/data/miniconda3/envs/tf/lib/python3.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/data/miniconda3/envs/tf/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "preprocess.py", line 540, in main
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
File "/data/projects/uda/text/utils/tokenization.py", line 69, in init
self.vocab = load_vocab(vocab_file)
File "/data/projects/uda/text/utils/tokenization.py", line 39, in load_vocab
token = reader.readline()
File "/data/miniconda3/envs/tf/lib/python3.7/codecs.py", line 558, in readline
data = self.read(readsize, firstline=True)
File "/data/miniconda3/envs/tf/lib/python3.7/codecs.py", line 500, in read
data = self.bytebuffer + newdata
TypeError: can't concat str to bytes
Traceback (most recent call last):
File "preprocess.py", line 573, in
app.run(main)
File "/data/miniconda3/envs/tf/lib/python3.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/data/miniconda3/envs/tf/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "preprocess.py", line 540, in main
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
File "/data/projects/uda/text/utils/tokenization.py", line 69, in init
self.vocab = load_vocab(vocab_file)
File "/data/projects/uda/text/utils/tokenization.py", line 39, in load_vocab
token = reader.readline()
File "/data/miniconda3/envs/tf/lib/python3.7/codecs.py", line 558, in readline
data = self.read(readsize, firstline=True)
File "/data/miniconda3/envs/tf/lib/python3.7/codecs.py", line 500, in read
data = self.bytebuffer + newdata
TypeError: can't concat str to bytes
Traceback (most recent call last):
File "preprocess.py", line 573, in
app.run(main)
File "/data/miniconda3/envs/tf/lib/python3.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/data/miniconda3/envs/tf/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "preprocess.py", line 540, in main
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
File "/data/projects/uda/text/utils/tokenization.py", line 69, in init
self.vocab = load_vocab(vocab_file)
File "/data/projects/uda/text/utils/tokenization.py", line 39, in load_vocab
token = reader.readline()
File "/data/miniconda3/envs/tf/lib/python3.7/codecs.py", line 558, in readline
data = self.read(readsize, firstline=True)
File "/data/miniconda3/envs/tf/lib/python3.7/codecs.py", line 500, in read
data = self.bytebuffer + newdata
TypeError: can't concat str to bytes

sup loss and unsup loss is not the same order of magnitude

INFO:tensorflow:step: 10(global step: 10) sample/sec: 24.330 loss: 0.606 top-1: 0.750 unsup_loss: 0.010
INFO:tensorflow:step: 20(global step: 20) sample/sec: 23.123 loss: 0.757 top-1: 0.500 unsup_loss: 0.021
INFO:tensorflow:step: 30(global step: 30) sample/sec: 23.740 loss: 0.830 top-1: 0.250 unsup_loss: 0.019
INFO:tensorflow:step: 40(global step: 40) sample/sec: 24.192 loss: 0.708 top-1: 0.500 unsup_loss: 0.016
INFO:tensorflow:step: 50(global step: 50) sample/sec: 24.254 loss: 0.987 top-1: 0.250 unsup_loss: 0.017
INFO:tensorflow:step: 60(global step: 60) sample/sec: 23.358 loss: 0.733 top-1: 0.750 unsup_loss: 0.018

the loss is sup_loss + unsup_loss,it seems the sup_loss and unsup_loss is not the same order of magnitude, is it reasonable?
Or it means I should change uda_coeff from 1 to 10?
Or any other suggestion?

UDA for Sequence Classification problems (like NER)

Hi,

Regarding UDA for NLP-SequenceTagging like NER. Any ideas for what kind of augmentation methods would work for them?

e.g. BackTranslation is tough to use for NER because,

  1. we would need to find some way to project labels from the original tokens to the augmented tokens (which might even be different in number)
  2. in general, for each augmented token, its not clear which original token it should be compared against (especially since the number of tokens itself might change)

Thanks.

failed at example_parsing_ops.cc:240

2019-10-06 15:22:11.087313: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087317: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087426: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087477: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087437: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087465: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087480: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087536: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_type_ids. Can't parse serialized Example.
2019-10-06 15:22:11.087566: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: input_ids. Can't parse serialized Example.

tensor2tenosr report an error when i run run.sh in back_translate

when i run code run.sh in back_translate, tensor2tensor report an error:

Traceback (most recent call last):
File "/anaconda3/envs/simple/bin/t2t-decoder", line 17, in
tf.app.run()
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/anaconda3/envs/simple/bin/t2t-decoder", line 12, in main
t2t_decoder.main(argv)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/bin/t2t_decoder.py", line 186, in main
hp = create_hparams()
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/bin/t2t_decoder.py", line 68, in create_hparams
problem_name=FLAGS.problem)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/utils/hparams_lib.py", line 58, in create_hparams
add_problem_hparams(hparams, problem_name)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/utils/hparams_lib.py", line 99, in add_problem_hparams
p_hparams = problem.get_hparams(hparams)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/data_generators/problem.py", line 524, in get_hparams
self.get_feature_encoders(data_dir)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/data_generators/problem.py", line 510, in get_feature_encoders
self._encoders = self.feature_encoders(data_dir)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/data_generators/text_problems.py", line 199, in feature_encoders
encoder = self.get_or_create_vocab(data_dir, None, force_get=True)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/data_generators/text_problems.py", line 244, in get_or_create_vocab
encoder = text_encoder.SubwordTextEncoder(vocab_filepath)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/data_generators/text_encoder.py", line 491, in init
self._load_from_file(filename)
File "/anaconda3/envs/simple/lib/python3.6/site-packages/tensor2tensor/data_generators/text_encoder.py", line 939, in _load_from_file
raise ValueError("File %s not found" % filename)
ValueError: File checkpoints/vocab.translate_enfr_wmt32k.32768.subwords not found

download_cifar10.sh

Hello!
In download_cifar10.sh,what's the meaning of variable j, the correspoding code is following:

aug_copy_end=$( expr $aug_copy - 1)
for i in seq 0 $aug_copy_end;
do
for j in seq 0 12;
do
wget $url_prefix/unsup-$i.tfrecord.$j
done
done

Code for Implementing RandAugment

Hello,

Congratulations on being accepted as an ICLR 2020 Paper! I noticed in the conference version of the paper, you use RandAugment instead of AutoAugment to generate copies of the unlabeled data.

Will you be uploading the code necessary to duplicate that technique in this repo soon? That would be very useful for the research community as it is much more computationally efficient than running AutoAugment and achieves comparable results.

Thanks!

Data Augmentation

Autoaugment list 25 policies for cifar10, there are more than 25 policies in UDA code.Why?

how to choose learning rate

Thank you very much for your work.
In the readme, you give information as follows, but how to choose learning rate, is there any practical refference?

Use a lower learning rate than pure supervised learning because there are two loss terms computed on labeled data and unlabeled data respecitively.

Issue in text word_level_augumentation

Hi,

It seems that the "show_example" in 195 line doesn't need any more.
And another question: in Appendix B, what's the value of p in p(C − TFIDF(x_i))/Z_1?

Thank you very much!

Not able to run run_base.sh in text section

When I try to run the bash script $bash scripts/run_base.sh --max_seq_length=128.
I am getting the following error -
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
The server configuration in which I am running is as follows-

Python version - 2.7.16
tensorflow-base - 1.13.1
tensorflow-estimator - 1.13.0
tensorflow-gpu - 1.13.1
cudatoolkit - 10.0.130
cudnn - 7.6.0

Could you tell me what are the required setup for running this code?

UDA for Regression Problem

Hi, thanks for this great research paper. I have a physics based regression problem (~6 input features and 1 response variable) with only 35 data points from the real world lab test results. We have a 1D simulation tool with which we can generate any number of low fidelity artificial data points. Unfortunately the low fidelity artificial synthesised data points are not having the same distribution as the real world lab test results and there are differences. I understand the concept of labels may not apply to a regression problem, but Can we use your UDA technique to make the low fidelity data (unlabeled data) to appear more like the real world lab test (labelled ) data?

Question about "3.3 Domain-relevance Data Filtering"

Hello,

In your paper, you state that you used out-of-domain data as such:

"We use our baseline model trained on the in-domain data to infer the labels of data in a large out of-domain dataset and pick out the examples (equally distributed among classes) that the model is most confident about."

What did you mean by picking out examples that are equally distributed among classes? My guess is that you created an unlabeled set from the JFT dataset of 1.3M images that were "class-balanced" according to the predictions by the baseline model?

Offline vs Online Augmentation

I am trying to write my own implementation for CIFAR-10 (4k). But, I do not get similar performance. I still get Error more than 10%.
How much difference does offline augmentation make as compared to online augmentation?

releasing other text classification models, datasets & unlabeled corpus

Hi, thanks for releasing the paper & code.

I have tried the IMDb text classification task and UDA achieved quite promising improvements.
Will you release your models and datasets for the other text classification tasks, especially the unlabeled corpus?
So that your work will be easier to follow and have a larger impact.

Thanks

Python 2? Really?

The code is tested on Python 2.7 and Tensorflow 1.13.

Why use Python 2 for this and not Python 3? Python 2 is less than six months away from expiration. No one prefers to use Python 2 anymore.

Compare with XLNet too

BERT is now far from the leading model e.g. on the GLUE Benchmark. Why not test with XLNet-Large for more a relevant comparison? Outside of Google, all else being equal, people would prefer to use the better model.

Other augmentation methods for images

I have a couple of questions regarding the augmentation method for images.

  1. Have you tried other augmentation tools such as imgaug? If not, do you think it'll work?

  2. Is it a good idea to use pictures of the same object which are taken from different viewpoints as an alternative for the augmentation?

Thank you in advance

Bert baseline variation in result

For IMDB semi-supervised case:

  • Given that baseline gets trained on (unaugmented) 20 examples, it doesn't make sense to me to train it for 3000 (+300 warm up) steps with 32 batch size (looking at run_base.sh). Am I missing something?
  • I ran evaluation every 3 steps (+3 warm up), and found following mildly surprising result. (Bert base, max_seq_length=512, train_batch_size=6)
    • 3 steps (~1 epoch):
      train_loss = 0.77
      eval_classify_loss = 0.69
      eval_classify_accuracy = 0.56

    • 6 steps (~2 epoch):
      train_loss = 0.61
      eval_classify_loss = 0.82
      eval_classify_accuracy = 0.56

    • 9 steps (~3 epoch):
      train_loss = 0.50
      eval_classify_loss = 0.74
      eval_classify_accuracy = 0.55

    • 12 steps (~4 epoch):
      train_loss = 0.44
      eval_classify_loss = 0.65
      eval_classify_accuracy = 0.59

    • 15 steps (~5 epoch):
      train_loss = 0.18
      eval_classify_loss = 0.48
      eval_classify_accuracy = 0.78

    • 18 steps (~6 epoch):
      train_loss = 0.08
      eval_classify_loss = 0.61
      eval_classify_accuracy = 0.75

    • 21 steps (~6 epoch):
      train_loss = 0.02
      eval_classify_loss = 0.79
      eval_classify_accuracy = 0.77
      .
      .
      .

    • 21 steps (~6 epoch):
      train_loss = 0.0009
      eval_classify_loss = 0.93
      eval_classify_accuracy = 0.82


Although the best accuracy is 82%, it has overfitted. Before overfitting the best accuracy is 78% (15 steps). This is more than the reported accuracy of 72.56% (Table 1), but I suspect there is a lot of variation. How were these results reported, did you take mean/median of a few experiments? How many steps was the BERT base trained for?

Colab notebook with the output:
https://colab.research.google.com/drive/1ysEZeLFmZDUy39ip1J1VLDKLhyUNMss3

Back-translation model source

I noticed that the run.sh script for Back-translation passes some Checkpoints to t2t (e.g.: checkpoints/enfr/model.ckpt-500000). If I want to use another t2t problem for doing the back-translation instead of translate_enfr_wmt32k I would need to provide other checkpoints, right? How are these checkpoints generated? Is there any source code available for the original Machine Translation model used?

Also, is it possible to not use tensor2tensor and use another model to generate the paraphrases? I think that as long as I keep the output files structure it shouldn't be an issue. The reason I'm asking is that I want to try UDA with an Spanish dataset using a ES-EN back-translation augmentation techinque. But unfortunately tensor2tensor doesn't include it.

Thanks.

Issue in Word Level Augmentation (standalone)

In the script file prepro.sh, aug_ops=bt-0.9 (which is for sentence level augmentation)
I would like to the run the word level augmentation (standalone).
What is the aug-ops to be used in that case or the token probability?
No change observed in trying it out on custom dataset.

AUG_COPY for SVHN

Hello! In your experiment, aug_copy for SVHN dataset is still 100?

What's the recommended uniform/tf_idf token_prob in official experiments

./text/augmentation/word_level_augment.py line 238 and 244

The value of token_prob did not pointed out in paper.
How large should I set to this hyperparameter?

232 def word_level_augment(
233     examples, aug_ops, vocab, data_stats):
234   """Word level augmentations. Used before augmentation."""
235   if aug_ops:
236     if aug_ops.startswith("unif"):
237       tf.logging.info("\n>>Using augmentation {}".format(aug_ops))
238       token_prob = float(aug_ops.split("-")[1])
239       op = UnifRep(token_prob, vocab)
240       for i in range(len(examples)):
241         examples[i] = op(examples[i])
242     elif aug_ops.startswith("tf_idf"):
243       tf.logging.info("\n>>Using augmentation {}".format(aug_ops))
244       token_prob = float(aug_ops.split("-")[1])
245       op = TfIdfWordRep(token_prob, data_stats)
246       for i in range(len(examples)):
247         examples[i] = op(examples[i])
248   return examples

I have try aug_ops='tf_idf-0.9', the eval_classify_accuracy = 0.49931693 after 10000 steps training.

Out of memory error on custom text classification data with 49 classes. GPU device - GeForce RTX 2080 Ti. 10986MiB memory.

I am getting this error when I am trying to execute run_base_uda.sh custom classification data.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[7168,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node gradients/bert_1/encoder/layer_7/output/LayerNorm/moments/SquaredDifference_grad/sub (defined at /media/uda/text/bert/optimization.py:168) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

It was mentioned in the readme that the code runs fine for 11GB device with max_seq_len=128 and batch_size=8. Any suggestions/instructions to execute the code on GPU would be helpful.

the meaning of batch size is not very clear

BERT hyperparameters. Following the common BERT fine-tuning procedure, we keep a dropout
rate of 0.1, and try learning rate of 1e-5, 2e-5 and 5e-5 and batch size of 32 and 128. We also tune the number of steps ranging from 30 to 100k for various data sizes

  1. what is the meaning of the batch size?
    is it means the total batch size of sup and unsup, or just means the sup train batch size?

  2. when compare uda and normal training, the hyperparam should use total batch size,or just training batch size? I mean if one param is based on batch size 32, and another one is based one batch size 16, and i use sup bs 16 and unsup bs 16 for uda, which param i should use?

Share similar idea with CVPR 19 paper

Hi,

Very interesting work by applying the data augmentation to semi-supervised learning. But I think the idea is quite similar to our CVPR 19 paper [1], which also tries to align the original unlabelled sample and it augmented sample. Would you please cite our paper in your updated manuscript? Thanks very much.

[1] Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In CVPR 2019.

How to get reproducible results

I tried training the model several times and it gave me different results. I tried putting tf.set_random_seed and but still didn't work. Any idea? Thanks!

NotFoundError: IMDB_raw/csv/train.csv not found

Hello,

after downloading the IMDB data and the BERT model an error pops up when running the preprocessing script:
NotFoundError: data/IMDB_raw/csv/train.csv; No such file or directory.

It seems the structure of the data downloaded via the bash script is different than the file path specified in /scripts/prepro.sh. I do have a IMDB_raw folder containing a train_id_list.txt file, as well as an aclImdb folder, however no train.csv file.

Did I miss a step for processing the data to the correct format or is there an issue when downloading the data?

Any help is appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.