shariqiqbal2810 / maac Goto Github PK
View Code? Open in Web Editor NEWCode for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019
License: MIT License
Code for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019
License: MIT License
Hello, thank you very much for being able to open source the code for this paper. This is a very good job!
When running this code, for the Cooperative Treasure Collection multi-agent environment, my results are as follows:
These results are quite different from the average reward in the paper, which is about 100, and I have not changed any parameters. Is there any special place in the calculation method of average reward?
I noticed that the code creashes when 2 agents are used. since there are problems with the dimension for the sum function in critics.py
line 138.
I managed to sort it out in this way:
for i, a_i in enumerate(agents):
if max(agents) == 1:
head_entropies = [(-((probs + 1e-8).log() * probs).squeeze().sum(0)
.mean()) for probs in all_attend_probs[i]]
else:
head_entropies = [(-((probs + 1e-8).log() * probs).squeeze().sum(1)
.mean()) for probs in all_attend_probs[i]]
Does it sound good by you?
I read your algorithm MAAC. Is this algorithm based on SAC, and if it is based on SAC. How is the formula in the paper inferred from SAC?
Episodes 1-13 of 50000
Process Process-1:
Traceback (most recent call last):
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/gezhixin/MAAC-master/utils/env_wrappers.py", line 20, in worker
ob = env.reset()
File "/home/gezhixin/anaconda3/lib/python3.6/site-packages/gym/core.py", line 66, in reset
raise NotImplementedError
NotImplementedError
Process Process-2:
Traceback (most recent call last):
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/gezhixin/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/gezhixin/MAAC-master/utils/env_wrappers.py", line 20, in worker
ob = env.reset()
File "/home/gezhixin/anaconda3/lib/python3.6/site-packages/gym/core.py", line 66, in reset
raise NotImplementedError
NotImplementedError
Process Process-3:
How to visualize during training,and how to test the model
Hi,
When I run the code under the fullobs_collect_treasure domain on cpu only, I noticed there was memory leak happening inside of model.update_critic and model.update_policies functions. Even though the buffer is fully filled, the memory usage will keep going up and finally use out of my memory. I don't know which line of the code leads to this problem.
Does anyone run into this issue? Thank you!
The function names of "update_policies" and "update_critic" functions are reversed.
Although the two functions are used together, it does not affect the function.
The error log is as follows:
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "D:\project_code\python\yiwei\MAAC-master\envs\mpe_scenarios\fullobs_collect_treasure.py", line 3, in
from multiagent.core import World, Agent, Landmark, Wall
ImportError: cannot import name 'Wall'
It seems that the there is no "Wall" class in the multiagent repo, i search the the core.py file in multiagent repo https://github.com/openai/multiagent-particle-envs and find World, Agent, Landmark. But there is no Wall class. Do you change your multiagent.core file? what is the defination of the Wall class? Thanks.
Hi, I am wondering whether you have a result of the visualization so that the effectiveness of the algorithm on the two new envs can be seen directly?
I noticed that in baselines,common.vec_env, the VecEnv class has not defined the Render function, is there any idea I can do the visualization?
Thank you so much!
why the average rewards reported in the paper is much higher than the code. it's ~6 after training, but the paper is 125. Did you change the reward in the environment?
and at the end of each episode, for example, in the multi_speaker_listener environment, the listener cannot reach to its target position. Is this the same as your results?
Hi, in the code
class SubprocVecEnv(VecEnv):
def __init__(self, env_fns, spaces=None):
"""
envs: list of gym environments to run in subprocesses
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
for remote in self.work_remotes:
remote.close()
Why
for remote in self.work_remotes:
remote.close()
if the remote
pipe is closed, how the message be sent?
Thank for sharing the source code of MAAC. This is a very interesting papar. When i reproduce the experiments, the result of Cooperative Treasure Collection is quite different from the paper's. The parameter of episode_length is 100, and the source code statistics are the average of 100 steps for each agent.
def get_average_rewards(self, N):
if self.filled_i == self.max_steps:
inds = np.arange(self.curr_i - N, self.curr_i) # allow for negative indexing
else:
inds = np.arange(max(0, self.curr_i - N), self.curr_i)
return [self.rew_buffs[i][inds].sum() for i in range(self.num_agents)]
Therefore, i sum the values of each agent to get the result shown in the figure.
So, i want to know how to calculate the results in original paper!
Hope for you reply!!!
Wei Zhou,
[email protected]
Hi, thanks for this great code. I've been using it for some experiments and have been having some issues with replicability. One thing I notice is that learning curves are different even when I try and set the same random seed. I still get different results even if I do:
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.
np.random.seed(seed) # Numpy module.
random.seed(seed) # Python random module.
torch.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
Have you noticed this issue and is there a way to resolve it?
Also, a quick unrelated question: for Treasure Collection I notice that by default you use substantial reward shaping. Was this shaping used for the Actor-Attention-Critic paper? Were you able to successfully train without it?
Hello! I run your MADDPG on the simple_spread environment in your fork of MPE, but it doesn't work as in the original MPE. Could you please help me fix this problem?
when i run your multi-agent particle environments,the error:
Traceback (most recent call last):
File "/home/cherry/multiagent-particle-envs-master/bin/interactive.py", line 26, in
env.render()
File "/home/cherry/anaconda3/envs/shyang/lib/python3.6/site-packages/gym/core.py", line 108, in render
raise NotImplementedError
NotImplementedError
Ask for advice inps (list of PyTorch Matrices): Inputs to each agents' encoder (batch of obs + ac) How is this parameter set .
A=np.mat(obs1)
B=np.mat(action) ---- --> states = [s for s, a in inps] ValueError: too many values to unpack (expected 2)
inps=list((A,B))
Looking forward to your reply !
o_dim = 12
a_dim = 6
Dear Shariq,
In your article "Actor-Attention-Critic for Multi-Agent Reinforcement Learning", figure 1., there is a MLP, which is said to be unique per agent according to the legend (blue background color), that realize a state-action encoding from (o_i, a_i) to e_i.
I believed this corresponds to what is named "critic_encoders" in your code. However, these encoders are listed in the shared_modules here:
Line 73 in 6174a01
Is it normal ? From my understanding, the consequence from being listed in shared_modules is that the gradients are scaled by (1/agents_numbers), so I believe this would have only minor consequences on the behavior of the algorithm.
Best regards
I wonder how the gradient back propagate from Q to
Trace from Q:
Lines 149 to 150 in 105d60e
critic_in
:Line 148 in 105d60e
s_encoding
doesn't contain input from other_all_values[i]
:Lines 125 to 141 in 105d60e
keys
and values
don't contain agent i's action as input, and selector
uses only observations as input:Lines 118 to 119 in 105d60e
So, is there gradient from Q to action
Hi Shariq, first thank you for your code! And it works well. But when optimizing policy, shouldn't it be probs * (-pol_target)
? Why we use log_pi
here?
MAAC/algorithms/attention_sac.py
Line 150 in bd263af
Dear Author,
I take a fast look at your code on actor updates. It seems that you have use advantage soft actor critic, i.e.,
Advantage: pol_target = q - v
loss: pol_loss = (log_pi * (log_pi / self.reward_scale - pol_target).detach()).mean()
If you use the above updates, I think it's an on-policy soft A2C, therefore unbiased actor should only be updated based on the incremental data rather than the data from replay buffer. Otherwise, it will be an biased estimate of the real policy. Right?
Best,
Hui
Hi, thanks for your great job! I have a question on how to visualize the attention weights between agents in the testing phase, i.e., Figure 6 in your article. Could you please give me some advice? Thank you very much!
Hi,
Your codes are written very well~ Thanks for your great job! But in all folders, I only found codes for training the model and I didn’t find the codes for testing or evaluating the trained model in environments , would you mind uploading the evaluation codes file for testing or evaluating?
Thank you~
Hi! I would like to know how you guys implement MADDPG+SAC and COMA+SAC. I cannot find the implementation in the source code.
Dear Shariq,
I am trying to run this code and I already have installed all of the requirements, but this is the error I have face:
python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
I was wondering if you could let me know how can I solve it.
Bests,
Azadeh
main.py: error: the following arguments are required: env_id, model_name
I was going through your code and I am having difficult time understanding 1 part of the critic. If you see line
Line 148 in 1006cff
you are just using a state encoding along with the joint embedding of all state-action pair of other agents as an input to the Q-function. If I recall correctly equation 5 takes an embedding of the current agents state-action pair alongside with joint embedding. Can you please explain what is going on here?
hello,I try to run your code in simple_reference and simpel_world_com. However, the code reported error. Can MAAC deal with the both physical actions and communication actions scenairos?
Hi, thanks for your great job! I met a problem when I run your code in other scenarios, i.e., simple_spread. The command I used is " python main.py simple_spread dataxx --use_gpu". Is it caused by the version of gym. I have tested both gym 0.9.4 and 0.12.5, but the following problem still exists. Could you please give me some advice on the problem? Thanks very much!
Hi,
I'm very sorry to trouble you. I am reading your paper MAAC and I find that the result is compare with COMA. So I want to ask whether you can open the code about COMA.
Thanks!
In the code:
the input of sel_ext(query
) is state_encodings
the input of k_ext(key
) is state_action_encodings
the input of v_ext(value
) is state_action_encodings
In the paper, the input of key and query should be state_action_encodings.
I think the correct input should be
the input of sel_ext(query
) is state_action_encodings(change)
the input of k_ext(key
) is state_action_encodings
the input of v_ext(value
) is state_encodings(change)
Could you explain why this is done in the code?
Dear Shariq,
In your article, there is no bias used to calculate the x_i. However, in the code the bias is not set to False, and I believe that the default value is True, for the value extractors:
Line 68 in 6174a01
Is there a reason for that ? Thank you.
Hi Shariq,
In your implementation and MAAC paper, you use expected discounted returns to learn the state-action Q function, e.g., Eq. (2) and (7), instead of the maximum Q(s, a) w.r.t action a. Could you explain it or give a reference?
Best,
Yesiam
Hi,
Regarding the OpenAI baselines, I know you recommand the commit bash 98257ef8c9bd23a24a330731ae54ed086d9ce4a7.
If I use the latest version of the OpenAI baselines, it will take much more memory (more than 10 times). Do you have any idea why this happens?
Thanks!
Hi, in your implementation, SAC is used but V is estimated by Q-function when updating critic and calculating target Q, instead of a separated value network in the original SAC paper. Would you please explain it or give some references? Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.