kakaobrain / fast-autoaugment Goto Github PK

Official Implementation of 'Fast AutoAugment' in PyTorch.

License: MIT License

Python 100.00%

deep-learning convolutional-neural-networks pytorch augmentation image-classification computer-vision distributed cnn automl automated-machine-learning

fast-autoaugment's People

Contributors

Stargazers

Watchers

Forkers

hyzcn mycampusio sublee dreadlord1984 trungpx qing0991 pdudzic123 codeaudit shankar0206 xy0806 wenerhg insmod-he cherishgbzhu tomarraj008 zytx121 luojianp chaoso paradoxzw zsef123 abdelpakey jiyuxuan926 phamcuong92 leo-xxx yangsenwxy hiyoung-asr xjtushujun kimyoon-young linhduongtuan iamweiweishi hdony wang3702 sungbinlim zhjpqq anguoyang huangxiansong yuanxing-syy ch-shin shinhbw seongl dbcool sxdkxgwan jaemin93 peterzhousz dominickzhang jizongfox shiyongde soonhwan-kwon naxda swansealeo ray-lee-94 kirilev jjack27 sytelus mowayao databill86 feiward keyu-tian peipei-pig mulkong michael-wzhu jianpengz lmpan zhouyao4321 hsouporto oryondark crazyvertigo irentang qixiuai wenqingchu zoukunlin chaos1992 junsangpark zoeleeee dckim-o kevenlee dash29 rahulsingh50 brstar96 tranvnhan lliai gabeochieng nsevc peter-mcconnell hub-er prassanna-ravishankar 5663015 janetwise hwidong-na pgsrv ming1993li peternara alldbi hhwer avinash-chouhan tanimutomo vitvicky lianzhouhui wintersurvival r03943158 cfld

fast-autoaugment's Issues

Questions about search and retrain.

@ildoonet @sublee Hi, it is so nice of you guys to release the search code and will be much appreciated. While running with your search code and retrain with found policies, I still got some problems and hopefully you can help me figure them out.

Some bugs in the search code after policy search, seems like the policy cannot be fed into the final retrain stage？
About your search process, for me it looks like during each trial you select 5 sub-policies containing 2 operations, while in your eval_tta, you just sampled 5 val_loaders by randomly select 1 sub-policy from your 5 sub-policies and report the avg acc? I'm wondering what's the intuition behind this?
After running search with your code, we can get a policy list with around 750 sub-polices. I'm wondering how many sub-policies you used to get the results reported in the paper.
I've retrained the model with randomly selected polices, p and magnitude many times, I can consistently get the comparable error rates with the results trained by best policies. This makes me very confused, so I'm very much desired to listen to your opinions.

Looking forward to your reply
Thanks again.

Search policy on my custom dataset

Hi, the code seems good for research purposes. There is not much documentation, but that's ok.
I think I've managed to understand what it does and how it works.

Now I need to use this package on my new custom datasets (none of the default torchvision.datasets) for production purposes.

Any idea on how to run the search.py with custom dataset?

Thanks

Packages versions

Having prepared environment.yml with:

name: fast-aa
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.6.9
  - pytorch=1.2.0
  - torchvision=0.4.0
  - cudatoolkit=10
  - pip
  - pip:
      - git+https://github.com/wbaek/theconf@de32022f8c0651a043dc812d17194cdfd62066e8
      - git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git@08f7d5e
      - git+https://github.com/ildoonet/pystopwatch2.git
      - git+https://github.com/hyperopt/hyperopt.git
      - pretrainedmodels
      - gorilla
      - tabulate
      - pandas
      - tqdm
      - tensorboardx
      - sklearn
      - ray
      - psutil
      - setproctitle
      - requests

and using search.py with:

python FastAutoAugment/search.py -c confs/wresnet40x2_cifar.yaml

I keep receiving warnings or errors on wrong or missing package, e.g.

HyperOptSearch  DeprecationWarning: This class has been moved.

Could you be able to share validated package versions ?

A question regarding the `RandomCrop`

Hi,

Thank you for the work. Just wanna share a tiny issue that confuses me.

fast-autoaugment/FastAutoAugment/data.py

Lines 313 to 314 in 2424224

 if width == original_width and height == original_height: 

 return self._fallback(img) # https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/preprocessing.py#L102

The function seems to fall back on central crop once the condition is satisfied, however in the original efficientet code, the function only falls back after max_attempts when the condition is satisfied.

It would be great if you could kindly take a look. Thank you.

the search hangs

Hi, thank you for the work first.
But when I started a searching experiment on a ray cluster just using command python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ... without modifying the code much, it would be hanging. Was there anything wrong?

Training on multiple GPUs

I'd like to train with this config: https://github.com/kakaobrain/fast-autoaugment/blob/master/confs/pyramid272_cifar10_b64.yaml

with multiple GPUs (as training on one takes about 600 hours). Is there any parameters adjustment to that config that you recommend to make?

Can fast auto-augmentation be used for segmentation task?

Backpropagate through PILLOW operations

Dear authors! I am having a question about back-propagation through operation of PILLOW library. How to pass probability parameter p through PIL operations and learn this parameter?

Question about Sample Pairing operation

I did not find the Sample Pairing operation listed in the policies found in CIFAR. From FastAutoAugment/augmentations.py, I also noticed that Related codes for Sample Pairing operation were marked as commented-out code. Do these mean that in the experiments, this operation is not included in the search space? I will be grateful if you could give me some suggestions.

Which ResNet50 did you use?

Hi, I want to run your code for ImageNet, but it seems the ResNet-50 implementation is missing.

In fast-autoaugment/FastAutoAugment/networks/__init__.py,

from pretrainedmodels import models
...
model = models.resnet50(num_classes=num_class, pretrained=None)

but pretrainedmodels is not uploaded yet. Is this ResNet50 from torchvision or your original implementation?

GPUs for searching

Hello, I see in the serarching code setting num-search 200 and resources_per_trial 1. Does this denote that fast autoaugment needs 200 gpus for searching?

Using your code I couldn't achieve the acc you upload.

I trained Imagenet using 32 GPUs via horovod (8V100*4) but got acc 77.1% which was much less than 78.6% reported in your paper by running:
python train.py -c confs/resnet50_imagenet_b4096.yaml --aug fa_reduced_imagenet --horovod
Moreover,as your yaml config,lr type should be multistep(adjust_learning_rate_resnet) which can be seen in train.py,but i saw cosine lr decay adopted during my test via your code.
Waiting for your reply ,thx.

Questions about the initialization of the Ray server.

Hi, I've tried to run the code for searching policies, but I have trouble in initializing the Ray server.
It seems that there is someting wrong with the code "ray.init(redis_address=args.redis)" in search.py Line 164

Traceback (most recent call last):
File "search.py", line 166, in
ray.init(redis_address=args.redis)
File "/home/xcq/anaconda3/envs/pytorch-video/lib/python3.6/site-packages/ray/worker.py", line 1425, in init
redis_address = services.address_to_ip(redis_address)
File "/home/xcq/anaconda3/envs/pytorch-video/lib/python3.6/site-packages/ray/services.py", line 145, in address_to_ip
ip_address = socket.gethostbyname(address_parts[0])
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)

Any suggestion about what I could be doing wrong/ how to workaround this?
ray version==0.6.5

Two questions about fast-autoaugment (k-fold data, and training)

Hello, I have some questions.

In the paper, it is written that train dataset (D_train) is divided into k folds and divided into D_M and D_A based on ratio, and policy search process is operated for each fold. However, in the "data.py" code, I think that the data size of all folds is not divided into k.

Also, when training for D_train by integrating all policies at the end, it is implemented not by applying all policies to D_train, but by randomly selecting one of the subpolicies that synthesized the entire policies, performing transform, and learning.

I am wondering if I am misunderstanding these two things.

Thank you.

Why not updating batch norm parameters?

According to the following lines, your code doesn't update batch norm parameters. What is the real reason for that?

params_without_bn = [params for name, params in model.named_parameters() if not ('_bn' in name or '.bn' in name)]

Hopes for your search code , I think it will be very useful to this world

Crashes in torch

Hello,
running the code I encounter this message from the torch implementation.

what(): owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 INTERNAL ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr:
I used the suggested versions. Do you have some advice?
Thank you.

parameter settings different with paper reported.

@ildoonet , I refer to the paper.

At the search stage, #epochs =90. while in your code is 270. If we set the #epochs=90, how to set the lr policy?
At the search stage, #training samples=6000. while in your code, for reduced_imagenet, the #training samples seems almost 50000.

Can you explain this? And which one should I follow?

Many thanks.

requirement 'ascii' codec can't decode byte 0xec

Hello!
I use python3.6.5
I install use pip install git+https://github.com/wbaek/theconf.git
I get:
Collecting git+https://github.com/wbaek/theconf.git
Cloning https://github.com/wbaek/theconf.git to /tmp/pip-req-build-pun46x94
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-req-build-pun46x94/setup.py", line 19, in
long_description = fp.read()
File "/opt/conda/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 449: ordinal not in range(128)

how to gather the trained models on different workers?

HI, @ildoonet thanks for the work first, but during running the code python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ... on a Ray cluster, I find the header worker can't gather the models trained on worker nodes for the subsequent searching policy stage e.g. the main process on header node will throw exception: No such file or directory: '/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold0.model. So can you tell me how did you train end-to-end using python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ... on the Ray cluster which has multiple nodes working in parallel?

NameError: name 'args' is not defined

Hi, I've tried to run the code for searching policies, but there is no way I can make it run on a single machine with several GPUs and the problem seems to be with Ray. I do initialize the ray server correctly, but apparently the trouble is with the train_and_eval function.

Any suggestion about what I could be doing wrong/ how to workaround this?

2020-08-15 10:18:09,387 ERROR worker.py:1717 -- Possible unhandled error from worker: ray_worker (pid=13478, host=macaron) File "FastAutoAugment/search.py", line 66, in train_model result = train_and_eval(None, dataroot, cv_ratio_test, cv_fold, save_path=save_path, only_eval=skip_exist) File "/home/gim282/data_augmentation/good/FastAutoAugment/train.py", line 123, in train_and_eval add_filehandler(logger, args.save + '.log') NameError: name 'args' is not defined

Why can the algorithm work?

Intuitively, the Optimizer can choose NO Augment to achieve higher validate accuracy. Why can it work? Looking forward to your answer

ImportError: cannot import name 'Config' from 'theconf'

I installed the package 'theconf' from here: https://github.com/ildoonet/theconf. However, there is no class named 'Config'. Could you help figure it out? Many thanks!

A question about search.py

Hi , man I want to excute seach.py in cifar 10 , I notice that fa_reduced_cifar10 is the result you got , but in wresnet40x2_cifar10_b512.yaml, there alreayd exist a key aug: fa_reduced_cifar10 , should I delete it and type with nothing ?

New dataset example

Could you provide an example of how to use ur technique to find the best policy for another dataset or maybe for CIFAR.

Thanks for your work!

there is an error occurs while exec search.py. could anybody help me?

1.cp search.py ../search.py

2.in the directory. ..../fast-autoaugment
exec the flowing command:
python search.py -c confs/wresnet40x2_cifar10_b512.yaml

got the error.
46%|████████████████████████████████████████████████████████████████████████▎ | 91/200 30:06<00:30, 3.54it/s, cv1=200, cv2=200, cv3=90, cv4=200, cv5=200
(pid=31751) 0200]: 80%|████████ | 16/20 [00:04<00:01, 3.63it/s, loss=0.299, top1=0.905, top5=0.997]
(pid=31751) 0200]: 90%|█████████ | 18/20 [00:04<00:00, 4.60it/s, loss=0.298, top1=0.905, top5=0.997]
46%|████████████████████████████████████████████████████████████████████████▎ | 91/200 30:07<00:30, 3.54it/s, cv1=200, cv2=200, cv3=90, cv4=200, cv5=200
[*test 0000/0200]: 100%|██████████| 20/20 [00:06<00:00, 3.31it/s, loss=0.298, top1=0.904, top5=0.997]
(pid=31751) 2019-12-26 15:58:24,892 ERROR worker.py:433 -- SystemExit was raised from the worker
(pid=31751) Traceback (most recent call last):
(pid=31751) File "python/ray/_raylet.pyx", line 711, in ray._raylet.task_execution_handler
(pid=31751) File "python/ray/_raylet.pyx", line 694, in ray._raylet.execute_task
(pid=31751) SystemExit: 0
170500096it [30:00, 94677.34it/s]
46%|████████████████████████████████████████████████████████████████████████▎

Reproduce on CIFAR-10 with WRN40x2

I tried the following script with the found policy on CIFAR10 as described in README.

python FastAutoAugment/train.py -c confs/wresnet40x2_cifar10_b512.yaml --aug fa_reduced_cifar10 --dataset cifar10

With the script, I achieved 92.89% accuracy (i.e., 7.11% error rate) on the test set. However, the 3.7% error is reported in README. I think this gap is too large, so it seems a bug.

How to fix it?

(My PyTorch version is 1.1.0 and machine has 4 Titan Xp GPUs.)

How to get 6,000 reduced imagenet samples

The provided code first subsample 6,000 images, but then filter out irrelevant classes. Therefore, the resulting dataset is much smaller:

https://github.com/kakaobrain/fast-autoaugment/blob/master/FastAutoAugment/data.py#L132-L163

One code issue to confirm

In the code search.py approx L259 ,it is final_policy_set.extend(final_policy), I think it should be final_policy_set.append(final_policy)。

how to run the search.py, why search.py load ckpt before training?

when i use
python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ...
i got result as

[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold0.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold1.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold2.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold3.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold4.model'

code is error in search.py line 186 like

       for cv_idx in range(cv_num):
                try:
                    latest_ckpt = torch.load(paths[cv_idx])
                    if 'epoch' not in latest_ckpt:
                        epochs_per_cv['cv%d' % (cv_idx + 1)] = C.get()['epoch']
                        continue
                    epochs_per_cv['cv%d' % (cv_idx+1)] = latest_ckpt['epoch']
                except Exception as e:
                    continue

why the code load the ckpt before training?

where to find or generate those model ckpt?

The result on ResNet18 is frustrating

Hi Ildoo, thank you for sharing great code.

I tried ResNet18 on ImageNet, but the result is not good. Have you ever experimented on ResNet18, or do you have any suggestions?

Hyperparameters: SGD with linear lr=0.4, batch=1024, weight decay=1e-4, epochs=120.

method	top1
baseline	70.68
fast aa	70.22

The learning curves:

Some questions about the paper and code

Hi~ I have some questions about the paper and code:

Questions about the code

What does "tta" mean in the function eval_tta() (in search.py)？
Why for _ in range(1): are used in some places like search.py and class Augmentation in data.py?

Questions about the algorithm

It seems that CIFAR-10 dataset does not have an official valid set, so cross-validation is often used. If a dataset already has its valid set, can we just use training set as $D_M$ (defined in paper) and use valid set as $D_A$ directly? If so, can we search policies without the process of cross-validation?
Section 3.2.1 of the paper says:

our goal is to improve the generalization ability by searching the augmentation policies that match the density of $D_{train}$ with density of augmented $D_{valid}$

However, the algorithm of ffa just seems to fit the trained model. This algorithm may only pick
augmented data that can make model get high score easily, and these 'easy' augmented data may not match training data? Is there any theoretical guarantee that this algorithm can work?

Inconsistency (maybe) between code and paper

Section 3.1 of the paper says:

$\mathcal{T}$ indicates a set of augmented images of dataset D transformed by every sub-policies $\tau \in \mathcal{T}$

However, in class Augmentation in data.py, policy = random.choice(self.policies) is used, so only one of five policies will be used during searching test time augmentation policies. policy in the code is the same as sub-policy right? But this method is actually used in AutoAugment, not faa?

It seems that you choose best top N policies from every fold according figure 2 and code:

So I think line 7 and 8 should be in the first loop, not the second?

Thanks very much if you can offer some help!

Cannot reproduce your seach

Inaccurate "accuracy" when testing uploaded model.

I run testing code by using your provided model.
Cifar10
[Wide-ResNet-28-10 | 3.9 | 3.1 | 2.6 | 2.7 / 2.7 | Download]
Cifar100
[Wide-ResNet-28-10 | 18.8 | 18.4 | 17.1 | 17.3 / 17.3 | Download]
But I can't get the paper's results? Is something wrong ?
Looking forward to your reply, thank you~
The results of cifar10 are below:
[2020-11-12 06:10:48,729] [Fast AutoAugment] [WARNING] tag not provided, no tensorboard log.
[2020-11-12 06:10:48,730] [Fast AutoAugment] [INFO] ./FAA_Paper_models/cifar10_wresnet28x10_top1.pth file found. loading...
[2020-11-12 06:10:48,934] [Fast AutoAugment] [INFO] checkpoint epoch@10
[2020-11-12 06:10:48,941] [Fast AutoAugment] [INFO] optimizer.load_state_dict+
[2020-11-12 06:10:48,950] [Fast AutoAugment] [INFO] evaluation only+
[2020-11-12 06:11:57,150] [Fast AutoAugment] [INFO] done.
[2020-11-12 06:11:57,151] [Fast AutoAugment] [INFO] model: {'type': 'wresnet28_10'}
[2020-11-12 06:11:57,151] [Fast AutoAugment] [INFO] augmentation: fa_reduced_cifar10
[2020-11-12 06:11:57,151] [Fast AutoAugment] [INFO]
{
"loss_train": NaN,
"loss_valid": 0.0,
"loss_test": NaN,
"top1_train": 0.09475160256410256,
"top1_valid": 0.0,
"top1_test": 0.0909,
"top5_train": 0.4834735576923077,
"top5_valid": 0.0,
"top5_test": 0.4729,
"epoch": 0
}
[2020-11-12 06:11:57,152] [Fast AutoAugment] [INFO] elapsed time: 0.021 Hours
[2020-11-12 06:11:57,152] [Fast AutoAugment] [INFO] top1 error in testset: 0.9091
[2020-11-12 06:11:57,152] [Fast AutoAugment] [INFO] ./FAA_Paper_models/cifar10_wresnet28x10_top1.pth

Stuck after iteration

After the iterative search in the parameter space is completed, it will get stuck, and there is no error message (399 is the last iteration).
iter 397 ma=0.509 OrderedDict([('RUNNING', 1), ('TERMINATED', 198), ('PENDING', 1), ('PAUSED', 0), ('ERROR', 0)]
2021-05-07 16:49:31,787 WARNING logger.py:126 -- Couldn't import TensorFlow - disabling TensorBoard logging.
2021-05-07 16:49:31,787 WARNING logger.py:220 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping.
iter 398 ma=0.509 OrderedDict([('RUNNING', 2), ('TERMINATED', 198), ('PENDING', 0), ('PAUSED', 0), ('ERROR', 0)]
2021-05-07 16:49:48,651 INFO ray_trial_executor.py:178 -- Destroying actor for trial search_par_resnet50_fold1_ratio0.4_200_cv_fold=1,cv_ratio_test=0.4,dataroot=_home_ccf_project_SB_PAR_data_rapv2_,level_0_0=0.77372,level_0_1=0.45162,level_1_0=0.00049368,level_1_1=0.39083,level_2_0=0.46218,level_2_1=0.69141,level_3_0=0.0028208,level_3_1=0.27047,level_4_0=0.65674,level_4_1=0.84919,num_op=2,num_policy=5,policy_0_0=3,policy_0_1=7,policy_1_0=0,policy_1_1=10,policy_2_0=13,policy_2_1=7,policy_3_0=5,policy_3_1=12,policy_4_0=11,policy_4_1=1,prob_0_0=0.40213,prob_0_1=0.39349,prob_1_0=0.47788,prob_1_1=0.63856,prob_2_0=0.6497,prob_2_1=0.50779,prob_3_0=0.58183,prob_3_1=0.30122,prob_4_0=0.62576,prob_4_1=0.92233,save_path=_home_ccf_project_fastautoaugment_models_par_resnet50_ratio0.4_fold1.model. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
iter 399 ma=0.509 OrderedDict([('RUNNING', 1), ('TERMINATED', 199), ('PENDING', 0), ('PAUSED', 0), ('ERROR', 0)]

I found that if the following errors are reported before the iteration is completed, it will get stuck. If there are no errors, it can continue to next stage.
iter 364 ma=0.509 OrderedDict([('RUNNING', 2), ('TERMINATED', 181), ('PENDING', 17), ('PAUSED', 0), ('ERROR', 0)
(pid=45772) WARNING: Logging before InitGoogleLogging() is written to STDERR
(pid=45772) E0507 16:44:34.685359 45846 raylet_client.cc:345] IOError: [RayletClient] Connection closed unexpectedly. [RayletClient] Failed to push profile events.
ray==0.6.5
python=3.6.9
tensorflow not install
centos7

Problems running search.py

Hi,thank you for your work.

Here are my problems：

（1） from watch import PyStopwatch
ImportError: cannot import name 'PyStopwatch' from 'watch' (D:\Anaconda3\lib\site-packages\watch_init_.py)

（2）

[2021-07-24 14:43:40,018] [Fast AutoAugment] [INFO] initialize ray...
2021-07-24 14:43:43,768 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
[2021-07-24 14:43:55,301] [Fast AutoAugment] [INFO] search augmentation policies, dataset=cifar10 model=wresnet40_2
[2021-07-24 14:43:55,301] [Fast AutoAugment] [INFO] ----- Train without Augmentations cv=5 ratio(test)=0.4 -----
['C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold0.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold1.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold2.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold3.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold4.model']
0%| | 0/2 [00:00<?, ?it/s]2021-07-24 14:43:55,344 WARNING worker.py:1123 -- The actor or task with ID a67dc375e60ddd1affffffffffffffffffffffff01000000 cannot be scheduled right now. It requires {GPU: 4.000000}, {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
0%| | 0/2 [18:40<?, ?it/s]

The version of ray I use is 1.4 and runs under win10.
Can you describe the running environment in detail.

Thank you.

D_M and D_A portion

Could you please tell me the portion of D_M and D_A in each fold? Are they evenly splitted?

ValueError in search.py

When I run search.py, I got an error by register_trainable

ValueError: Unknown argument found in the Trainable function. The function args must include a 'config' positional parameter. Any other args must be 'checkpoint_dir'. Found: ['augs', 'rpt']

Any ideas how to fix this?

Questions abut the search process

Hi, @ildoonet

Thanks for your great work, and it has inpired me a lot. Recently, I'm trying to reproduce the search results. When I utilize the ray.tune.hyperoptsearch method as the search method, I cannot get higher accuracy after augmenting the validation data compared with "without augmentation". However, as mentioned in your paper, the results should be better.

Is the phenomenon is normal? If not, how can you utilize the ray package to implement the search process?

Looking forward to your reply.

Thanks.

IMAGENET url not found

In Imagenet.py, Imagenet files are allocated through pre-defined URL. But, If I've tried to run the Imagenet.py, the pre-defined URL is not found.
Any suggestion Using Imagenet URL??
Thank you!

This line is different in the paper.

fast-autoaugment/FastAutoAugment/archive.py

Line 106 in e79d0c7

[('Brightness', 0.9, 6), ('Color', 0.2, 6)],

loss metric in search.py is off, augmentation policies for CIFAR10/SVHN seem random.

Hello,

I have been trying FAA in my application.

I noticed that in search.py, line 116, you're basically taking the minimum loss over all the losses for each image.

Since your loss function defined in line 95 has no reduction, losses ends up being a vector (num_policies*batch_size,). Therefore you just get the minimum loss for a single image as your minimum loss for your metrics. Hence, if you are truly minimizing loss (as it says you do in the paper) using this code, then you're basically just getting mostly random noise as your reward_attr since there will probably always be at least one very good prediction giving low loss.

This may help explain why your CIFAR10 and SVHN policies are basically random. (I am using the policies from archive.py here:

Each augmentation appears roughly the same number of times, with almost uniform distributions of strength and probability of each. What's plotted is the normalized probability of each augmentation ((number of times it appears/total augmentations) * mean probability of the augmentation) on the y axis, vs average strength of each augmentation on the x axis. The same graph is true for SVHN.

On imagenet, you can see the distribution does seem a little bit less random - perhaps this is because loss will be a little more meaningful since there are 1000 classes so the minimum loss will be a slightly less noisy reward signal

By contrast, the augmentations from AutoAug:

This makes more sense to me: there should be some terrible operations that don't get used much, and some that are valuable and are used more...the fact they are roughly equal for cifar 10 is surprising.

So question: did you use top_1_valid to get the policies in archive? Or minus_loss? If it's the latter, was the code that is published being used?

Thanks!
Sean

using search.py to find out autoaugment

[2020-08-03 21:23:54,603] [Fast AutoAugment] [INFO] processed in 76.2692 secs [2020-08-03 21:23:54,603] [Fast AutoAugment] [INFO] ----- Search Test-Time Augmentation Policies ----- search_cifar10_wresnet40_2_fold0_ratio0.1 Traceback (most recent call last): File "search.py", line 230, in <module> algo = HyperOptSearch(space, max_concurrent=4*20, reward_attr=reward_attr) TypeError: __init__() got an unexpected keyword argument 'reward_attr' [*test 0000/0010]: 100%|██████████| 79/79 [00:01<00:00, 50.82it/s, loss=0.459, top1=0.848, top5=0.994, loss_ema=0.423]
when I use python search.py -c confs/wresnet40x2_cifar.yaml --aug default there are some errors. and i want to know where i can see the autoaugment policy i searched.

cutout on ImageNet

@ildoonet From config files, it seems there is no cutout on search stage on reduced_imagenet and eval stage on ImageNet. Would you like to confirm/share the detailed information about cutout on ImageNet?

I wanna repeat your algorithm for search policies but don't understand some moments in paper.

I create this algorithm for voice-antispoofing problem.

Answer please for next some questions because i did not understand this from your paper (Fast AutoAugment):

Do you choose for imagenet only 120 classes (120 * 1000 = 120 000 pictures) and from this you randomly choose 6000 samples? Do you train all algorithm only on 6000 pictures? And what is the ratio Dm:Da in StratifiedShuffleSplit (test_size parameter)?
Is this continuous training on each step of algorithm? Or each iterations just started from random initialize model? 90 epoch (in your paper for resnet on imagenet) - it is on each training Dm set?
From each fold you get best 10 policies (each contain 5 sub-policies). And after each iteration of algorithm you choose 50 policies? It is too much! How do you choose best policies of this 50? And how do you apply this many policies in train? Are you choose it in random?
And how do you choose best policies on the end of the training?

I hope I described it clearly.
With love, Makarov Rostislav.

kornia integration

hi guys! nice paper.

was wondering if you considered using kornia.augmentation for your project ? I believe it can help to differentiate over the whole set of augmentation operators.

This would help pretty much also for us to test in real use cases robustness of our API.

Thanks in advance,
Edgar

Would you like to provide the training confs of CIFAR-100?

Thanks for the publication of codes.
In order to evaluate the performance of CIFAR-100, we need the hyper-parameters for the training of CIFAR-100. Would you like to provide the training confs of CIFAR-100?

Code doesn't run

This code is broken it seems. there were many minor bugs that I fixed but now i see this error when running search.py. I have space left

OSError: [Errno 28] No space left on device
2020-03-13 03:06:59,2529ERROR trial_runner.py:345 -- Trial Runner checkpointing failed. 78), ('PAUSED', 0), ('ERROR', 3)])
Traceback (most recent call last):
File "/home/zhasan/anaconda3/envs/cs234/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 343, in step
self.checkpoint()
File "/home/zhasan/anaconda3/envs/cs234/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 272, in checkpoint
json.dump(runner_state, f, indent=2, cls=_TuneFunctionEncoder)
File "/home/zhasan/anaconda3/envs/cs234/lib/python3.6/json/init.py", line 180, in dump
fp.write(chunk)
OSError: [Errno 28] No space left on device

Pyramidnet Issue

Hi,

I am currently trying to utilize the PyramidNet + ShakeDrop. However I am getting the following error:

RuntimeError: Output 0 of ShakeDropFunctionBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can remove this warning by cloning the output of the custom Function.

If I try to fix the error by changing some lines, the memory usage seems to increase a lot. So I was wondering whether you also encountered the following errors.

Thank you!

Can't run search.py

First of all, thanks you very much for your generous to sharing code public.

My problem happen when I try to run search.py file and it returns error as an image in below. I don't know why how I get folder models in FastautoAugment folder.
Hope that you can answer question soon. Thank you very much!

P/s: I run your code in my google colaboratory.

	if width == original_width and height == original_height:
	return self._fallback(img) # https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/preprocessing.py#L102