Using deep reinforcement learning to design a broadband acoustic cloak. Created under the supervision of PhD. Feruza Amirkulova and PhD Peter Gerstoft. With the help of: Linwei Zhou, Peter Lai, and Amaris De La Rosa.
Python 64.41%MATLAB 27.83%Jupyter Notebook 7.76%
reinforcement-learning-applied-to-metamaterial-design's People
This is the first code written for this RL project using a static dataset. Somehow it preforms better than the methods we have now. It is simply a critic network which learns the resulting change in mean TSCS caused by an action. Optimized configurations are discovered by selecting random actions from a starting configuration and choosing the one which suppresses the scattering the most.
Create some code which allows us to save (state, action, reward, next_state, done) tuples to database of some kind. Also include ability to save images.
Try using some RL techniques to test different hyperparameters on this data.
Test cases
able to save data
able to easily read in data to python
run experiments on data to determine which hyperparameters are the best. Specifically number of hidden layers, neurons per layer, gamma, optimizer type.
Current behavior is that states are passed to Actor network which generates an action and scales it to a specified range. To simplify the code and add the ability to have different scales of actions we need to use gym env action spaces.
Test cases:
Change environments to gym.Env
Scale actions to action range specified in gym action space
Currently the code operates as one agent interacting with one environment. This slows the training down significantly. If we have multiple environments generating data at asynchronously then update the agents at each learning step we will speed up train time.
Currently the way we apply an action to a configuration is to simply add the action vector to the coordinates of the current configuration. If the resulting configuration is invalid (overlapping cylinders or cylinders beyond walls) then we reject the next configuration and move back to the original one as well as give a negative reward. This current system probably causes the agent to not see as many states since every illegal move causes the environment to revert back to the same state it was in before the move. We need a way to apply partial actions to the environment in a consistent and time efficient manner.
Try creating a better reward function which is universal for all wavenumber ranges. So far I have been getting ok results for simple reward functions but maybe there is a better solution.
You can modify the reward function (getReward) in the env.py file in the DDPG folder.
In models.py in the DDPG folder you can create two more models (ImageActor, ImageCritic) which are able to process images in addition to the standard data passed.
Create a new ImageDDPG object which inherits from DDPG and overrides any methods you need to change.
Notes:
We already have a function in env.py which produces an image from a configuration of cylinders. Call env.getImage(env.config) to produce image.
You will also need to modify the way we store data, I suggest adding two additional fields to the namedtuple on line 51 in ddpg.py.
_Currently the way our DDPG explores is the actor generates an action which would be represented by an 8 by 1 vector and then noise sampled from a normal distribution with a mean of 0 and scale of epsilon is added to it. This is an ok way to explore but perhaps there is a better way. I found a paper by openAI which uses parameter noise in order to explore. https://openai.com/blog/better-exploration-with-parameter-noise/
Read this paper and implement it on our DDPG by adding new noisy neural networks to the models.py file._
Currently with DDQN using discrete actions and a step size of 0.5 the lowest scattering the agent can find is ~0.45. This is not low enough. With a continuous action space we may get better results.
Currently the way we apply an action to a configuration is to simply add the action vector to the coordinates of the current configuration. If the resulting configuration is invalid (overlapping cylinders or cylinders beyond walls) then we reject the next configuration and move back to the original one as well as give a negative reward. This current system probably causes the agent to not see as many states since every illegal move causes the environment to revert back to the same state it was in before the move.
To solve this issue, we can first train the agent to learn how to output a valid actions. Instead of come back to the original state when an invalid action was given, we can execute the invalid action and give the agent a penalty. This way the agent can learn from its own mistake.
After the training, we can transfer the weight to the new agent. Such way the numbers of invalid actions are minimized. We can use this new agent to speed up our training time and eliminate the exploration problem.
Attempting to increase the number of cylinders in the environment with a single agent shows no sign of convergence. Perhaps this is because the problem becomes much more complex when we increase the number of design parameters.
If we expand the number of agents maybe the problem will become simple enough for each agent to solve.