walkerning / aw_nas Goto Github PK
View Code? Open in Web Editor NEWaw_nas: A Modularized and Extensible NAS Framework
License: MIT License
aw_nas: A Modularized and Extensible NAS Framework
License: MIT License
Dear authors,
Thanks for this nice work.
I try to run the experiments on NAS benchmarks of GATES. However, I found that there was no data dumping code for NAS-Bench-201 though I have successfully repeated the experiments on NAS-Bench-201 for GATES. Specifically, how to generate the PKL file 'nasbench201.pkl' and 'nasbench201_valid.pkl' when running the scripts/nasbench/train_nasbench201_pkl.py?
Look forward to your reply.
Shun Lu
In function assemble_profiling_nets
:
aw_nas/aw_nas/hardware/utils.py
Line 104 in 9feb249
The first conv layer is written fixed.
But, profiling_primitives
also contains a first conv layer, such as:
{'prim_type': 'conv_3x3', 'spatial_size': 224, 'C': 3, 'C_out': 41, 'stride': 2, 'affine': True}
this layer would not be added to geno
, thus causing infinity loop at:
aw_nas/aw_nas/hardware/utils.py
Line 193 in 9feb249
because the input first conv layer will never be put into geno
in the following net generations.
trainer
/controller
/rollout
code can be shared.Other things need new implementation..
rnn_super_net
and diff_rnn_super_net
implementation...Hi! The presented DAG encoding scheme is very similar to the asynchronous message passing proposed in our existing paper "D-VAE: A Variational Autoencoder for Directed Acyclic Graphs", published in NeurIPS-2019. Unfortunately, it is not cited or discussed. Could you have a check?
Best,
Muhan
As in every one-shot parameter training step, only a subset of parameters are active, especially when mepa_sample_size
is small. We by default apply weight decay to all super net's parameters in every training step, is this an "over-regularization" or a desired behavior (which i will refer to "auto-regularization"). When some parameters are not active in any of the sampled architecture, maybe they should not be regularized, at least in the very begining of the training. As this might cause this unsampled path to be under trained, and the architecture that is sampled more is trained even better. This could lead to unsufficient exploration maybe?
However, when the controller is somehow well trained, the less sampled path means it just does not work well in the architecture, and thus the less training and over regularizaiton these paths get is an "auto-regularization" of this super network. (But do we really need this auto-regularization in this super network, as the only usage of the supernetwork is to be an performance indicator of it sub networks.
When I run awnas search examples/nasbench/nasbench-101_sa.yaml --gpu 0 --save-every 10 --train-dir /public/data1/users/ziyechen/awnas/logs/nasbench-101_sa, an error occurs:
Traceback (most recent call last):
File "/public/data1/users/ziyechen/.conda/envs/aw_nas/bin/awnas", line 33, in
sys.exit(load_entry_point('aw-nas', 'console_scripts', 'awnas')())
File "/public/data1/users/ziyechen/.conda/envs/aw_nas/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/public/data1/users/ziyechen/.conda/envs/aw_nas/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/public/data1/users/ziyechen/.conda/envs/aw_nas/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/public/data1/users/ziyechen/.conda/envs/aw_nas/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/public/data1/users/ziyechen/.conda/envs/aw_nas/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/public/data1/users/ziyechen/aw_nas-master/aw_nas/main.py", line 235, in search
trainer.train()
File "/public/data1/users/ziyechen/aw_nas-master/aw_nas/trainer/simple.py", line 293, in train
finished_e_steps, finished_c_steps)
File "/public/data1/users/ziyechen/aw_nas-master/aw_nas/trainer/simple.py", line 204, in _controller_update
step_loss=step_loss))
File "/public/data1/users/ziyechen/aw_nas-master/aw_nas/btcs/nasbench_101.py", line 706, in evaluate_rollouts
query_res = rollout.search_space.nasbench.query(rollout.genotype)
File "/public/data1/users/ziyechen/nasbench/nasbench/api.py", line 237, in query
fixed_stat, computed_stat = self.get_metrics_from_spec(model_spec)
File "/public/data1/users/ziyechen/nasbench/nasbench/api.py", line 364, in get_metrics_from_spec
self._check_spec(model_spec)
File "/public/data1/users/ziyechen/nasbench/nasbench/api.py", line 391, in _check_spec
% (op, self.config['available_ops']))
nasbench.api.OutOfDomainError: unsupported op none (available ops = ['conv3x3-bn-relu', 'conv1x1-bn-relu', 'maxpool3x3'])
sub_named_parameters
: From a simple profiling run, named_parameters
actually take's a significant amount of time, but I'm not sure it's due to the sub_named_parameters
calculation. Profile what performance difference the candidate_cached_named_parameters
and candidate_member_mask
switch bring.sub_named_parameters
, can run profiling to see use a weakref.WeakValueDictionary
to cache op_type
to named_parameters
help with the performance.del candidate_net
after use can help reduce the memory significantly.Dear TA-GATES authors,
I'm trying to figure out how your pre-processed pickle file of ENAS, NB301 dataset for TA-GATES is formed. Can you provide pre-processing codes for theses datasets?
Cheers.
Hi, helpful author, we meet again...Although I don’t want to follow a questioning method...
The error message is as follows:
The error seems to stop in your eval_queue method. I compared this method before and after your modification. The method before and after the modification works normally, but this problem occurs after the modification.
If you have time, you can confirm whether this method really has a little problem as I described, thank you very much.
Maybe this is my own question...
To search on ImageNet, this seems useful. All the interface can be left unchanged, except when a weights manager init, it should call DataParallel
on itself. And when assemble the candidate network, it should also pass the parallelized weights manager to candidate network too.
trainer/archnetwork_trainer.py
, evaluator/bftune.py
, controller/compare.py
, controller/archnetwork_controller.py
...Research docs
Implement biasd-path reparametrization instead of learning-signal based optimization for learning the controller.
Differentiable relaxation of the sampling provide a more controllable way of backward the gradients through the discrete arch r.v.s. As the 'hard/soft' level can control the bias-variance trade-off of the learning process.
DifferentiableSuperNet
weights manager and a new DifferentiableController
controller;
DifferentiableController
, the sampling probability of the node/op on each edge are modeled as global parameters. At first, we can the sample the operation only.Rollout
, as the arch representation is different now... So rollout must have their subclasses too... The weights manager (the consumer/assembler) and the controller (the producer/sampler) are generally not agnostic to the rollout type, so it's reasonable to add an interface to specificy which type of rollout a controller produce and a weights manager can take. The main script can be responsible to check if this rollout interface match. The handling of DifferentiableRollout in trainer is different too... e.g. mepa trainer should call set_perf
with in-graph loss tensor when using differentiable rollout (eval should pass self._criterion
instead of _ce_loss_mean
in), but call set_perf
with acc or detached loss when using DiscreteRollout...Currently, the evolutionary controller for each search space is implemented separately! This is not elegant since there are multiple copies of the same logic (e.g., tournament selection in controller.sample
, killing mechanism in controller.step
, and population save
&load
utility). Let us change evolution to use a single controller (Implement RegularizedEvoController
), and implement mutate
method in Rollout
class definitions.
Also, implement ParetoEvoController
for pareto evolutionary search for multiple objectives!
The ModelRecord
class should be changed into a search-space-anogstic one too (Seems okay now? need some test). Current population requires template final training cfg to work, actually it can be unnecessary. Mutation rollout should proxy method calls and attribute lookup to a mutated rollout. But to enable the access to mutation and the parent rollout, it should also provide several special calls/attributes. These info should be put into ModelRecord
by the evaluator actually, and also managed by Population
. The controller does not need to access the ModelRecord.
Hi, thank you for sharing this great work for NAS!
While running the following command for generating configs:
awnas gen-sample-config -r nasbench-101 -d image ./sample_nb101.yaml
I get error:
Error: Invalid value for "-r" / "--rollout-type": invalid choice: nasbench-101. (choose from discrete, differentiable, mutation, dense_discrete, dense_mutation, ofa, ssd_ofa, compare, general, layer2, layer2-differentiable, macro-stagewise, macro-stagewise-diff, macro-sink-connect-diff, micro-dense, micro-dense-diff)
Seems like nasbench-101 is not supported as the -r argument, and I am wondering what would be the way to generate configs for other NAS approaches like nasbench101 and FBNet?
Thank you!
First of all, thank you for taking the time to modify your FTT-NAS guidance document.
After you modify the FTT-NAS README, I proceeded step by step according to your instructions, but in the end there was an error like the previous question, which is stuck at
aw_nas/aw_nas/objective/fault_injection.py 159 line np.random.randint(0, 2 * _n_err,size=size_)
I think the problem should appear aw_nas/aw_nas/objective/fault_injection.py on line 152:
size_ = fault_ind.sum().cpu().data
The size parameter in np.random.randint does not seem to support Tensor type variables,It only supports int or tuple... So I tried the following to modify your code:
size_ = fault_ind.sum().cpu().data --> size_ = fault_ind.sum().cpu().numpy()
After my attempts to modify, this BUG seems to disappear, and now the project can work normally.
So, if you have time, please confirm whether this bug appears when FTT-NAS uses GPU to search, and whether my solution is appropriate, thank you very much!
Hi, I tried to set: "inner_controller_type: pareto-evo" and "evaluator_type: tune" for robust predictor-based search. But it seems multishot-robnas does not support these. Could you elaborate how to tune each rollout and how to use the Pareto-based EA? Thank you so much.
It's been a long time since I last use the ray dispatcher. Should test whether it still works, maybe add a unit test.
In aw_nas/aw_nas/objective/zerocost.py, the module foresight cannot be imported.
The corresponding .py file also cannot be found.
How to address this??
Hello,
I hope I am not missing something, but I cannot find the code/documentation for the TA-GATES paper. (https://openreview.net/forum?id=74fJwNrBlPI)
Has it been released yet? If not, when is the planned release?
Occasionally, the data loader in the search process will stuck at some point... Usually the first time when the controller queue is used. This might be related to this issue: pytorch/pytorch/issues/1355.
pytorch/pytorch/issues/1355#issuecomment-308587289 said this issue might be related to shm running out. But there are 32G shm configured, and the actual usage is never close to that.
Try adding some swap space to avoid data handling thread being killed due to running out of memory (not work).
Seems this might be also due to calling iter on the data loader too early, and not used it for a long while.
Hi, I discovered a few small issues while running OFA latency profiling with mobilenet v2 block.
aw_nas/aw_nas/ops/baseline_ops.py
Line 482 in 9feb249
this line causes error:
AssertionError: The passed parameters are different from the formal parameter list of primitive type `mobilenet_v2_block`, expected odict_keys(['expansion', 'C', 'C_out', 'stride', 'kernel_size', 'affine']), got ['C', 'C_out', 'stride', 'affine', 'kernel_size', 'activation', 'use_se', 'expansion']
another is a typo:
https://github.com/walkerning/aw_nas/blob/master/aw_nas/hardware/ofa_obj.py#L82
might be
use_ses = use_ses or [
None,
] * len(strides)
otherwise it causes problem:
TypeError: 'NoneType' object is not subscriptable
AWNAS_HOME is an environment variable default to be ~/awnas. How to specialize this?
Thank you!
When I run "awnas search examples/nasbench/nasbench-101_gates_sa.yaml --gpu 0 --save-every 10 --train-dir /public/data1/users/ziyechen/awnas/logs/nasbench-101_gates_sa", the program is stuck in an endless loop.
I find the program is stuck in the 633 line of https://github.com/walkerning/aw_nas/blob/master/aw_nas/btcs/nasbench_101.py
try:
ss.nasbench._check_spec(new_rollout.genotype)
except api.OutOfDomainError:
continue
I print the mutated genotype, and find many 'none' operations.
I guess the mutation of operation in the 'else' clause may be wrong, since it may change the old operation with the 'none' opearion. And I think we should change 'new_ops = np.random.randint(0, ss.num_op_choices, size=1)[0]' to 'new_ops = np.random.randint(0, ss.num_op_choices-1, size=1)[0]', since the last operation is the 'none' operation.
Dear author,
Thanks a lot for developing such a good framework. I am interested in adversarial robustness. Could you show me how to implement adversarial supernet training by using the plugins? Thanks a lot for your time.
Best wishes,
Jia
Hi,thanks for your great work! I setup awnas and the environment.
When I prepare the environment, I use "ln -s readlink -f plugins/robustness
${HOME}/awnas/plugins".
However, the error is as follows
"aw_nas.utils.registry.RegistryError: No registry item dense_rob available in registry search_space."
How can I solve the problem?
Best wishes!
Hello,
The pre-processed data you provided seems to have a specific representation.
Can you provide preprocessing code? or can you give how did you make it?
Thank you!
Hello author, after I tried to complete all the environment installation and configuration according to your requirements, I first used the CPU to search for the structure and found that ENAS and FTT-NAS are no problems, but when I use GPU to search, ENAS is no problem. But FTT-NAS has a problem, I don’t know how to solve this problem now...
....................................................
The following is an error message......
...................................................
fault_injection.py", line 244, in get_reward perfs = self.get_perfs(inputs, outputs, targets, cand_net)
fault_injection.py", line 254, in get_perfs outputs_f = cand_net.forward_one_step_callback(inputs, callback=self.inject) super_net.py", line 175, in forward_one_step_callback callback(context.last_state, context)
fault_injection.py", line 337, in inject context.last_state = self.injector.inject(state, n_mac=n_mac)
fault_injection.py", line 180, in inject return eval("self.inject_" + self.mode)(out, **kwargs)
fault_injection.py", line 159, in inject_fixed size=size_)])
prod() received an invalid combination of arguments - got (out=NoneType, axis=NoneType, )
Could I know the ray version?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.