uoe-agents / epymarl Goto Github PK
View Code? Open in Web Editor NEWAn extension of the PyMARL codebase that includes additional algorithms and environment support
License: Apache License 2.0
An extension of the PyMARL codebase that includes additional algorithms and environment support
License: Apache License 2.0
Thank you for sharing!!
Using colored RWARE environment, I want to ask how to load and visualize trained models on colored rware
import gym
env = gym.make("rware-1color-tiny-4ag-v1", sensor_range=3, request_queue_size=6)
n_step = 1000
model = load_something(model_path) # how do I load model?
for _ in range(n_step):
obs = env.reset()
actions = model.forward(obs)
n_obs, reward, done, info = env.step(actions)
print(f"{info=}")
env.render()
time.sleep(0.5)
if (done):
break
env.close()
Setting runner: "parallel"
in maddag_ns.yaml
causes error:
TypeError: select_actions() got an unexpected keyword argument 'bs'
Thanks for this impactful work. I look forward to your reply and thank you in advance.
Hi,
Thanks for the great code! However, when I run the given command to train MPE:SimpleSpread task, it seems the converged performance is far from the results on the paper. For example, the average return using QMIX, MADDPG are around -500, -400, and random policy is around -700. The number from the paper is around -120.
Could you tell me what is the possible problem?
Thanks!
Command run:
python src/main.py --config=qmix --env-config=gymma with env_args.time_limit=500 env_args.key="rware:rware-tiny-2ag-v1"
Error:
Traceback (most recent call last):
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/experiment.py", line 276, in run
run()
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/run.py", line 238, in __call__
self.result = self.main_function(*args)
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "/home/kinal/Desktop/marl/quick_results/epymarl/src/main.py", line 36, in my_main
run(_run, config, _log)
File "/home/kinal/Desktop/marl/quick_results/epymarl/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/home/kinal/Desktop/marl/quick_results/epymarl/src/run.py", line 87, in run_sequential
runner = r_REGISTRY[args.runner](args=args, logger=logger)
File "/home/kinal/Desktop/marl/quick_results/epymarl/src/runners/episode_runner.py", line 15, in __init__
self.env = env_REGISTRY[self.args.env](**self.args.env_args)
File "/home/kinal/Desktop/marl/quick_results/epymarl/src/envs/__init__.py", line 17, in env_fn
return env(**kwargs)
File "/home/kinal/Desktop/marl/quick_results/epymarl/src/envs/__init__.py", line 84, in __init__
self._env = TimeLimit(gym.make(f"{key}"), max_episode_steps=time_limit)
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 676, in make
return registry.make(id, **kwargs)
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 490, in make
versions = self.env_specs.versions(namespace, name)
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 220, in versions
self._assert_name_exists(namespace, name)
File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 297, in _assert_name_exists
raise error.NameNotFound(message)
gym.error.NameNotFound: Environment `rware:rware-tiny-2ag` doesn't exist. Did you mean: `rware-tiny-2ag`?
During handling of the above exception, another exception occurred:
Environment:
python 3.9
Package Version
------------------ --------------
absl-py 0.5.0
atomicwrites 1.2.1
attrs 21.4.0
brotlipy 0.7.0
certifi 2018.8.24
cffi 1.15.0
chardet 3.0.4
charset-normalizer 2.0.4
cloudpickle 2.0.0
colorama 0.4.4
cryptography 36.0.0
cycler 0.10.0
deepdiff 5.8.0
docopt 0.6.2
enum34 1.1.6
fonttools 4.32.0
future 0.16.0
gitdb 4.0.9
GitPython 3.1.27
gym 0.23.1
gym-notices 0.0.6
idna 2.7
imageio 2.17.0
importlib-metadata 4.11.3
iniconfig 1.1.1
jsonpickle 0.9.6
kiwisolver 1.0.1
lbforaging 1.1.0
matplotlib 3.5.1
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
mock 2.0.0
more-itertools 4.3.0
mpyq 0.2.5
munch 2.3.2
networkx 2.8
numpy 1.22.3
ordered-set 4.1.0
packaging 21.3
pathlib2 2.3.2
pbr 4.3.0
Pillow 9.1.0
pip 21.2.4
pluggy 0.7.1
portpicker 1.2.0
probscale 0.2.5
protobuf 3.6.1
py 1.6.0
py-cpuinfo 8.0.0
pycparser 2.21
pygame 2.1.2
pyglet 1.5.23
pyOpenSSL 22.0.0
pyparsing 2.2.2
PySC2 3.0.0
PySocks 1.7.1
pytest 4.3.1
python-dateutil 2.7.3
PyYAML 5.3.1
requests 2.19.1
rware 1.0.3
s2clientprotocol 4.10.1.75800.0
s2protocol 5.0.9.87702.0
sacred 0.8.0
scipy 1.8.0
setuptools 61.2.0
six 1.11.0
sk-video 1.1.10
SMAC 1.0.0
smmap 5.0.0
snakeviz 2.1.1
tensorboard-logger 0.1.0
tomli 2.0.1
torch 1.11.0
torchaudio 0.11.0
torchvision 0.12.0
tornado 5.1.1
typing_extensions 4.1.1
urllib3 1.23
websocket-client 0.53.0
wheel 0.37.1
whichcraft 0.5.2
wrapt 1.10.11
zipp 3.8.0
Hi,
I have some doubts about the number of evaluations between off-policy and on-policy algorithms. In the paper is explained that:
To account for the improved sample efficiency of off-policy over on-policy algorithms and to allow for fair comparisons, we train off-policy algorithms for two million steps and on-policy algorithms for 20 million steps. We evaluate off-policy algorithms every 50,000 steps and on-policy algorithms every 500,000 steps. As a result, a total of 40 evaluations are executed during the training of each algorithm
But with the code as it is in the repository but changing the t_max to 2M and 20M at QMIX and MAPPO respectively I get 41 test_return_mean in QMIX and 401. So I do not really know I have to change other parameter or the numbers are corrected.
Thanks
Hey guys, me again. Thanks for closing the last issue so quickly, I hope this one is just as simple. I am trying to train qtran on the LBF environment and epymarl keep crashing when I do so. There seems to be some issue with the argument args.use_rnn in the basic controller that is causing the program to fail. I am getting the following error: attribute error: 'types.SimpleNamespace' object has no attribute 'use_rnn'
I hope its not too much of a problem and look forwards to seeing this resolved :)
thanks,
Peter
May I ask which version of python should we use?
I'm trying to train MADDPG algorithm with smac environment with the code below
python src/main.py --config=maddpg --env-config=sc2 with env_args.map_name="corridor"
but I got this error:
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 75067200000 bytes. Error code 12 (Cannot allocate memory)
my system:
ubuntu: 20.04
python: 3.7.12
torch: 1.13.1
I try use others algorithms it run fine. I only got this problem when I use MADDPG algorithms and I got this problem with the main epymarl code and the addtional_algo code too.
The full error script is below
[DEBUG 10:22:34] git.cmd Popen(['git', 'version'], cwd=/home/abdulghani/epymarl_addtional_algo, universal_newlines=False, shell=None, istream=None)
[DEBUG 10:22:34] git.cmd Popen(['git', 'version'], cwd=/home/abdulghani/epymarl_addtional_algo, universal_newlines=False, shell=None, istream=None)
src/main.py:81: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config_dict = yaml.load(f)
src/main.py:50: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config_dict = yaml.load(f)
src/main.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
if isinstance(v, collections.Mapping):
[INFO 10:22:34] root Saving to FileStorageObserver in results/sacred.
[DEBUG 10:22:34] pymarl Using capture mode "fd"
[INFO 10:22:34] pymarl Running command 'my_main'
[INFO 10:22:34] pymarl Started run with ID "4"
[DEBUG 10:22:34] pymarl Starting Heartbeat
[DEBUG 10:22:34] my_main Started
[INFO 10:22:34] my_main Experiment Parameters:
[INFO 10:22:34] my_main
{ 'add_value_last_step': True,
'agent': 'rnn',
'agent_output_type': 'pi_logits',
'batch_size': 32,
'batch_size_run': 1,
'buffer_cpu_only': True,
'buffer_size': 50000,
'checkpoint_path': '',
'critic_type': 'maddpg_critic',
'env': 'sc2',
'env_args': { 'continuing_episode': False,
'debug': False,
'difficulty': '7',
'game_version': None,
'heuristic_ai': False,
'heuristic_rest': False,
'map_name': 'corridor',
'move_amount': 2,
'obs_all_health': True,
'obs_instead_of_state': False,
'obs_last_action': False,
'obs_own_health': True,
'obs_pathing_grid': False,
'obs_terrain_height': False,
'obs_timestep_number': False,
'replay_dir': '',
'replay_prefix': '',
'reward_death_value': 10,
'reward_defeat': 0,
'reward_negative_scale': 0.5,
'reward_only_positive': True,
'reward_scale': True,
'reward_scale_rate': 20,
'reward_sparse': False,
'reward_win': 200,
'seed': 36613826,
'state_last_action': True,
'state_timestep_number': False,
'step_mul': 8},
'evaluate': False,
'gamma': 0.99,
'grad_norm_clip': 10,
'hidden_dim': 128,
'hypergroup': None,
'label': 'default_label',
'learner': 'maddpg_learner',
'learner_log_interval': 10000,
'load_step': 0,
'local_results_path': 'results',
'log_interval': 50000,
'lr': 0.0005,
'mac': 'maddpg_mac',
'name': 'maddpg',
'obs_agent_id': True,
'obs_individual_obs': False,
'obs_last_action': False,
'optim_alpha': 0.99,
'optim_eps': 1e-05,
'reg': 0.001,
'repeat_id': 1,
'runner': 'episode',
'runner_log_interval': 10000,
'save_model': True,
'save_model_interval': 50000,
'save_replay': True,
'seed': 36613826,
'standardise_returns': False,
'standardise_rewards': True,
't_max': 2050000,
'target_update_interval_or_tau': 200,
'test_greedy': True,
'test_interval': 50000,
'test_nepisode': 100,
'use_cuda': True,
'use_rnn': True,
'use_tensorboard': True}
2023-04-28 10:22:34.886446: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 10:22:34.972517: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2023-04-28 10:22:35.382670: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:
2023-04-28 10:22:35.382712: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:
2023-04-28 10:22:35.382715: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[DEBUG 10:22:35] h5py._conv Creating converter from 7 to 5
[DEBUG 10:22:35] h5py._conv Creating converter from 5 to 7
[DEBUG 10:22:35] h5py._conv Creating converter from 7 to 5
[DEBUG 10:22:35] h5py._conv Creating converter from 5 to 7
[DEBUG 10:22:39] pymarl Stopping Heartbeat
[ERROR 10:22:39] pymarl Failed after 0:00:05!
Traceback (most recent calls WITHOUT Sacred internals):
File "src/main.py", line 36, in my_main
run(_run, config, _log)
File "/home/abdulghani/epymarl_addtional_algo/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/home/abdulghani/epymarl_addtional_algo/src/run.py", line 117, in run_sequential
device="cpu" if args.buffer_cpu_only else args.device,
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 209, in init
super(ReplayBuffer, self).init(scheme, groups, buffer_size, max_seq_length, preprocess=preprocess, device=device)
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 28, in init
self._setup_data(self.scheme, self.groups, batch_size, max_seq_length, self.preprocess)
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 75, in _setup_data
self.data.transition_data[field_key] = th.zeros((batch_size, max_seq_length, *shape), dtype=dtype, device=self.device)
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 75067200000 bytes. Error code 12 (Cannot allocate memory)
Dear Authors,
To train QMIX on LBF, I found it uses the concatenated obs as the states. For fully observable settings, there is no problem doing so. I am curious about did you try to use the true global state as input in LBF? Is the performance good?
Thank you in advance.
Awesome work for extending the original pymarl! But will be much better if MPE can be included.
Is there any plan to open source the MPE-related code in this repository?
In the environment LBF, it raise this error. I want to know the meaning of 'unit_dim'.
Hi, thanks for your wonderful sharing. I am wondering if we don't share paremeters, there should be mult-agents, but in your classe NonSharedMAC definition, there is only one agent exactly like the case of sharing paremeters? please help me to understand better, thanks.
BTW, ia2c is not working if I set mac to non_shared_mac with Foraging-8x8-2p-3f-v2 env.
Hi,
Thank you so much for this wonderful repo! But I was wondering could you advice on the recommended version of torch so that our first-time users will be able to use this repo more smoothly? That would be extremely helpful.
Thank you in advance!
Hi, I recently tried to reproduce the relevant results. But when I tested the IPPO algorithm in the SMAC 2s1sc scenario, I found that the training performance was very unstable. I configured it according to the parameters in your paper, but the performance is still not ideal. Can you help me to see what is wrong with this configuration? Thanks.
` --- IPPO specific parameters ---
action_selector: "soft_policies"
mask_before_softmax: True
runner: "parallel"
buffer_size: 10
batch_size_run: 10
batch_size: 10
env_args:
state_last_action: False # critic adds last action internally
target_update_interval_or_tau: 0.01
lr: 0.0005
hidden_dim: 128
obs_agent_id: True
obs_last_action: False
obs_individual_obs: False
agent_output_type: "pi_logits"
learner: "ppo_learner"
entropy_coef: 0.001
standardise_returns: False
standardise_rewards: False
use_rnn: True
q_nstep: 10 # 1 corresponds to normal r + gammaV
critic_type: "ac_critic"
epochs: 4
eps_clip: 0.2
name: "ippo"
t_max: 20050000
`
Hi, thanks for the contribution to the community!
I didn't find the environments of repeated matrix games. Although they are simple and may be created by users, it would be great if we can use them directly from this repo since the results from the paper show the suboptimal performance of many algorithms. Thus, I think these matrix games are worth a try.
Awesome work for extending the original pymarl! But my experimental results fail to meet the return or battle_won described in the paper using MAPPO algorithm. I have adjusted some parameters according to the chapter I(selected hyperparameters) in the paper, but it didn't work. Is there any detail I ignored that can be adjusted and achieve the result in the paper, especially for SMAC env (MMM2 and corridor).
Awesome work! Just having trouble running MAPPO on the RWARE environment. My run.json
is below. The main error is the nan
causing ValueError
appearing in the middle of training. I am wondering if maybe using the latest version of rware
instead of the one used in the paper might be the problem, and if so, why. Command run was
python3 src/main.py --config=mappo --env-config=gymma with env_args.time_limit=500 env_args.key="rware:rware-tiny-2ag-v1"
{
"artifacts": [],
"command": "my_main",
"experiment": {
"base_dir": "/home/x/epymarl/src",
"dependencies": [
"munch==2.5.0",
"numpy==1.21.6",
"PyYAML==5.3.1",
"sacred==0.8.0",
"torch==1.12.1",
"wqmix==0.1.0"
],
"mainfile": "main.py",
"name": "pymarl",
"repositories": [
{
"commit": "f047072aedc9a128d28b01ba42ef3381dcad2328",
"dirty": true,
"url": "https://github.com/uoe-agents/epymarl.git"
},
{
"commit": "f047072aedc9a128d28b01ba42ef3381dcad2328",
"dirty": true,
"url": "https://github.com/uoe-agents/epymarl.git"
}
],
"sources": [
[
"main.py",
"_sources/main_663aa94901b58b1db134be7c43e7a0df.py"
],
[
"run.py",
"_sources/run_4f45b371acea3e76064abb7e3b6103b1.py"
]
]
},
"fail_trace": [
"Traceback (most recent call last):\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/stdout_capturing.py\", line 163, in tee_output_fd\n yield out # let the caller do their printing\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/run.py\", line 238, in __call__\n self.result = self.main_function(*args)\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/config/captured_function.py\", line 42, in captured_function\n result = wrapped(*args, **kwargs)\n",
" File \"src/main.py\", line 36, in my_main\n run(_run, config, _log)\n",
" File \"/home/x/epymarl/src/run.py\", line 55, in run\n run_sequential(args=args, logger=logger)\n",
" File \"/home/x/epymarl/src/run.py\", line 185, in run_sequential\n episode_batch = runner.run(test_mode=False)\n",
" File \"/home/x/epymarl/src/runners/parallel_runner.py\", line 104, in run\n actions = self.mac.select_actions(self.batch, t_ep=self.t, t_env=self.t_env, bs=envs_not_terminated, test_mode=test_mode)\n",
" File \"/home/x/epymarl/src/controllers/basic_controller.py\", line 23, in select_actions\n chosen_actions = self.action_selector.select_action(agent_outputs[bs], avail_actions[bs], t_env, test_mode=test_mode)\n",
" File \"/home/x/epymarl/src/components/action_selectors.py\", line 73, in select_action\n m = Categorical(agent_inputs)\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/torch/distributions/categorical.py\", line 64, in __init__\n super(Categorical, self).__init__(batch_shape, validate_args=validate_args)\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/torch/distributions/distribution.py\", line 56, in __init__\n f\"Expected parameter {param} \"\n",
"ValueError: Expected parameter probs (Tensor of shape (10, 2, 5)) of distribution Categorical(probs: torch.Size([10, 2, 5])) to satisfy the constraint Simplex(), but found invalid values:\ntensor([[[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]],\n\n [[nan, nan, nan, nan, nan],\n [nan, nan, nan, nan, nan]]], device='cuda:0', grad_fn=<DivBackward0>)\n",
"\nDuring handling of the above exception, another exception occurred:\n\n",
"Traceback (most recent call last):\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/contextlib.py\", line 130, in __exit__\n self.gen.throw(type, value, traceback)\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/stdout_capturing.py\", line 175, in tee_output_fd\n tee_stdout.wait(timeout=1)\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/subprocess.py\", line 1019, in wait\n return self._wait(timeout=timeout)\n",
" File \"/home/x/anaconda3/envs/x/lib/python3.7/subprocess.py\", line 1645, in _wait\n raise TimeoutExpired(self.args, timeout)\n",
"subprocess.TimeoutExpired: Command '['tee', '-a', '/tmp/tmpz5nib2ap']' timed out after 1 seconds\n"
],
"heartbeat": "2022-09-21T15:11:48.887817",
"host": {
"ENV": {},
"cpu": "Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz",
"gpus": {
"driver_version": "471.41",
"gpus": [
{
"model": "NVIDIA GeForce RTX 2070",
"persistence_mode": false,
"total_memory": 8192
}
]
},
"hostname": "x",
"os": [
"Linux",
"Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid"
],
"python_version": "3.7.13"
},
"meta": {
"command": "my_main",
"options": {
"--beat-interval": null,
"--capture": null,
"--comment": null,
"--debug": false,
"--enforce_clean": false,
"--file_storage": null,
"--force": false,
"--help": false,
"--loglevel": null,
"--mongo_db": null,
"--name": null,
"--pdb": false,
"--print-config": false,
"--priority": null,
"--queue": false,
"--s3": null,
"--sql": null,
"--tiny_db": null,
"--unobserve": false,
"COMMAND": null,
"UPDATE": [
"env_args.time_limit=500",
"env_args.key=rware:rware-tiny-2ag-v1"
],
"help": false,
"with": true
}
},
"resources": [],
"result": null,
"start_time": "2022-09-21T11:53:23.830946",
"status": "FAILED",
"stop_time": "2022-09-21T15:11:48.976631"
Edit: Ran it again with the exact environment (https://github.com/uoe-agents/robotic-warehouse) but still got the same error.
[INFO 17:09:58] my_main t_env: 7055000 / 20050000
[INFO 17:09:58] my_main Estimated time left: 7 hours, 20 minutes, 15 seconds. Time passed: 4 hours, 0 seconds
[INFO 17:11:32] my_main Recent Stats | t_env: 7100000 | Episode: 14200
advantage_mean: -0.0393 agent_grad_norm: 0.1806 critic_grad_norm: 18.5048 critic_loss: 2.0751
ep_length_mean: 500.0000 pg_loss: 0.0272 pi_max: 0.6237 q_taken_mean: 21.0140
return_mean: 0.1333 return_std: 0.3700 target_mean: 20.9746 td_error_abs: 0.6127
test_ep_length_mean: 500.0000 test_return_mean: 0.0500 test_return_std: 0.2126
[DEBUG 17:11:33] pymarl Stopping Heartbeat
[ERROR 17:11:34] pymarl Failed after 4:01:40!
Traceback (most recent calls WITHOUT Sacred internals):
File "src/main.py", line 36, in my_main
run(_run, config, _log)
File "/home/x/epymarl/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/home/x/epymarl/src/run.py", line 185, in run_sequential
episode_batch = runner.run(test_mode=False)
File "/home/x/epymarl/src/runners/parallel_runner.py", line 104, in run
actions = self.mac.select_actions(self.batch, t_ep=self.t, t_env=self.t_env, bs=envs_not_terminated, test_mode=test_mode)
File "/home/x/epymarl/src/controllers/basic_controller.py", line 23, in select_actions
chosen_actions = self.action_selector.select_action(agent_outputs[bs], avail_actions[bs], t_env, test_mode=test_mode)
File "/home/x/epymarl/src/components/action_selectors.py", line 73, in select_action
m = Categorical(agent_inputs)
File "/home/x/anaconda3/envs/qmix/lib/python3.7/site-packages/torch/distributions/categorical.py", line 64, in __init__
super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
File "/home/x/anaconda3/envs/qmix/lib/python3.7/site-packages/torch/distributions/distribution.py", line 56, in __init__
f"Expected parameter {param} "
ValueError: Expected parameter probs (Tensor of shape (10, 2, 5)) of distribution Categorical(probs: torch.Size([10, 2, 5])) to satisfy the constraint Simplex(), but found invalid values:
tensor([[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]],
[[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]]], device='cuda:0', grad_fn=<DivBackward0>)
During handling of the above exception, another exception occurred:
Traceback (most recent calls WITHOUT Sacred internals):
File "/home/x/anaconda3/envs/qmix/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/home/x/anaconda3/envs/qmix/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/home/x/anaconda3/envs/qmix/lib/python3.7/subprocess.py", line 1645, in _wait
raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['tee', '-a', '/tmp/tmpr0gvavq4']' timed out after 1 seconds
My steps so far:
env = runner.env._envs
env = gym.wrappers.Monitor(env, '/tmp/policy.mp4')
rec = VideoRecorder(env, enabled=True)
then I highjack runner.run(test_mode=True, rec=rec)
in EpisodeRunner
by adding
rec.capture_frame()
in the while not terminated loop
afterwards I close the VideoRecorder.
My expectation would be a .mp4 Video. Instead I just get the .json files, which are empty.
Has anybody managed to get a Video out of epymarl?
Hello again, and this time I have another question about batch reward standarization: why it works?
if self.args.standardise_rewards: rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
This normalizes rewards per batch sampled from the buffer. But if an episode is sampled by different batches, the normalized rewards of that episode will be different(for different episodes sampled by different policies are not i.i.d.). I think it would break the reward of the environment and cause some instability, but it indeed improves the performance of my MARL algorithms. So could you give me more explanations on why using batch reward standarization and why it works? Thanks!
Why is there a target network in A2C / MAA2C / PPO / MAPPO implementation?
Till now I haven't find any target network used in previous works like MAPPO's original implementation.
Is there any explanation?
Thanks for the outstanding work and continuing support -- I would like to report a possible typo:
Those values are mutually inconsistent -- Moreover, I was able to verify the maximum returns in Table 7 and not Table 8.
Hi, contributors of pymarl!
I am currently working on a project where we try to adopt a single-agent RL framework to a multi-agent one and hope to compare on different MA algorithms on our specific problem. After I read through the both papers and corresponding implementation (mainly on QMIX and COMA) , I have some trouble on understanding the implementation of the module part, which contains agents, critics and mixers.
My first concern would be the RNN-Agents
. In COMA and QMIX, agents actually play different roles in the algorithm. In QMIX, agents are just local q-functions, which input the obs and actions and outputs the corresponding Q values (action-state value function) to the mixer, where we argmax to obtain the optimal policy (this is more likely to the behaviour of the implemented RNN agents, which outputs q). However, in the COMA, agents are defined in an actor-critic way, just parameterizing a policy, which means it obvious outputs a certain action (maybe in logits manner). How could QMIX and COMA both use the same RNN agent (both algorithms init agents in the controller to interact with the env)? Am I misunderstanding some thing?
My second confusion is about the non-shared COMA (coma_ns.py
from module dir). In COMA, the critic is obviously defined as a centralized critic Q(U,s). How could this critic be defined in a decentralized way? Because from my perspective, non-shared modules should only be the agents, not something defined to be centralized. In the COMA_learner.py
, a single centralized critic would obviously make more sense to me.
I'm trying to run MAPPO or IPPO with sc2 environment but I couldn't figured out how to specify the maps.
I already us pymarl and run the algorithms with this code:
python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=2s3z
I want to run epymarl with MAPPO or IPPO algorithms with the same map 2s3z from sc2. could any one help me with that?
I'm using this code:
python3.7 src/main.py --config=mappo --env-config=sc2 with env_args.time_limit=25 env_args.key="SMAC:2s3z"
I get this error:
sacred.utils.ConfigAddedError: Added new config entry that is not used anywhere
Conflicting configuration values:
env_args.key=SMAC:2s3z
Hi,
I would like to ask, are there any plans to make this library compatible with PettingZoo? I've read the overview and docs but I can't seem to find any reference to that at all, and it's a shame since PZ would work extremely well (in principle) with another dedicated MARL library.
The conflict is caused by:
The user requested protobuf==3.6.1
pysc2 3.0.0 depends on protobuf>=2.6
s2clientprotocol 4.10.1.75800.0 depends on protobuf
smac 1.0.0 depends on protobuf==3.19.5
The SMAC repo requirements were recently updated to include protobuf==3.19.5
. I think the SMAC version that EPyMARL uses should be specified in requirements.txt
.
When I was running the example of mpe python3 src/main.py --config=qmix --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleSpeakerListener-v0"
I met this issue:
[ERROR 08:07:39] pymarl Failed after 0:00:00!
Traceback (most recent calls WITHOUT Sacred internals):
File "src/main.py", line 36, in my_main
run(_run, config, _log)
File "/root/code/epymarl/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/root/code/epymarl/src/run.py", line 87, in run_sequential
runner = r_REGISTRY[args.runner](args=args, logger=logger)
File "/root/code/epymarl/src/runners/episode_runner.py", line 15, in init
self.env = env_REGISTRYself.args.env
File "/root/code/epymarl/src/envs/init.py", line 14, in env_fn
return env(**kwargs)
File "/root/code/epymarl/src/envs/init.py", line 81, in init
self._env = TimeLimit(gym.make(f"{key}"), max_episode_steps=time_limit)
File "/root/anaconda3/envs/pymarl/lib/python3.6/site-packages/gym/envs/registration.py", line 235, in make
return registry.make(id, **kwargs)
File "/root/anaconda3/envs/pymarl/lib/python3.6/site-packages/gym/envs/registration.py", line 128, in make
spec = self.spec(path)
File "/root/anaconda3/envs/pymarl/lib/python3.6/site-packages/gym/envs/registration.py", line 143, in spec
mod_name
gym.error.Error: A module (mpe) was specified for the environment but was not found, make sure the package is installed with pip install
before calling gym.make()
Did anyone meet this issue?
Hi!
I'm trying to train the 'rware:rware-tiny-2ag-v1' environment with MAPPO, and I'm following the 'Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks' paper hyperparameters tables!
One of the configurations is the network type. In the paper I understand that two types of networks are used: GRU and FC, however I'm not finding where can I set the type of network I want to use in EPyMARL.
Thanks in advance!
Thanks for your work.
I am trying to train algorithms on the LBF environment, but when I test algorithms I found that the results are significantly worse than those in the Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. For example, when I run 10 seeds of VDN on 15x15-3p-5f in 2M steps, none of those seeds can achieve a return higher than 0.2, while the average return in the paper is 0.58. I wonder if I mistake any essential parameters?
Here‘s one of my config.json:
{
"action_selector": "epsilon_greedy",
"agent": "rnn",
"agent_output_type": "q",
"batch_size": 32,
"batch_size_run": 1,
"buffer_cpu_only": true,
"buffer_size": 5000,
"checkpoint_path": "",
"double_q": true,
"env": "gymma",
"env_args": {
"key": "lbforaging:Foraging-15x15-3p-5f-v1",
"pretrained_wrapper": null,
"time_limit": 50
},
"epsilon_anneal_time": 200000,
"epsilon_finish": 0.05,
"epsilon_start": 1.0,
"evaluate": false,
"evaluation_epsilon": 0.0,
"gamma": 0.99,
"grad_norm_clip": 10,
"hidden_dim": 128,
"hypergroup": null,
"label": "default_label",
"learner": "q_learner",
"learner_log_interval": 10000,
"load_step": 0,
"local_results_path": "results",
"log_interval": 50000,
"lr": 0.0003,
"mac": "basic_mac",
"mixer": "vdn",
"name": "vdn",
"obs_agent_id": true,
"obs_individual_obs": false,
"obs_last_action": false,
"optim_alpha": 0.99,
"optim_eps": 1e-05,
"repeat_id": 1,
"runner": "episode",
"runner_log_interval": 10000,
"save_model": true,
"save_model_interval": 50000,
"save_replay": false,
"seed": 291174067,
"standardise_rewards": true,
"t_max": 2050000,
"target_update_interval_or_tau": 0.01,
"test_greedy": true,
"test_interval": 50000,
"test_nepisode": 100,
"use_cuda": true,
"use_rnn": true,
"use_tensorboard": true
}
Hello
I have a clean conda environment with python 3.9 and I am getting the following error. Any pointers on how to resolve this?
Also, are there any future plans to make this repo compatible with gymnasium?
Collecting gym==0.21.0 (from -r requirements.txt (line 3))
Using cached gym-0.21.0.tar.gz (1.5 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [1 lines of output]
error in gym setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
Hi, I am using epymarl to train melting pot, I replaced the rnn+mlp network with rnn+cnn. I found in IPPO and IA2C that the time cost for learning each batch increases over time shown below. I tried my best to debug it and I found it very hard to find the reason. I even used th.cuda.empty_cache()
and th.cuda.synchronize(device=th.device("cuda"))
but no help. The following figure shows the average time cost of the past 10 updates.
Did you find also found such an issue?
Hi! Thank you for your work!
I am new to MARL and i am study MAPPO. I am wondering why the MAPPO algorithm as implemented exist a target critic, which i did not find in the original paper. Also, I thought MAPPO is a on-policy algorithm, then why there exist a reply buffer to store episodes?
Look forward to your reply.
I was unable to verify the results reported for algorithm MAA2C_NS and TAG
task. Even after correcting for the add_value_last_step=False
as per issue #43.
Upon cross validation I found evidence pointing to the possibility of
swapped values between the maximum returns for shared parameters, Table 3, and
the maximum returns, Table 7, for non-shared parameters modalities.
maa2c_ns.yaml
according to Section C.1, subsectiontime_limit=25
gymma.yaml
.add_value_last_step=False
{
"action_selector": "soft_policies",
"add_value_last_step": false,
"agent": "rnn_ns",
"agent_output_type": "pi_logits",
"batch_size": 10,
"batch_size_run": 10,
"buffer_cpu_only": true,
"buffer_size": 10,
"checkpoint_path": "",
"critic_type": "cv_critic_ns",
"entropy_coef": 0.01,
"env": "gymma",
"env_args": { "key": "mpe:SimpleTag-v0",
"pretrained_wrapper": "PretrainedTag",
"seed": 343532797,
"state_last_action": false,
"time_limit": 25},
"evaluate": false,
"gamma": 0.99,
"grad_norm_clip": 10,
"hidden_dim": 128,
"hypergroup": null,
"label": "default_label",
"learner": "actor_critic_learner",
"learner_log_interval": 10000,
"load_step": 0,
"local_results_path": "results",
"log_interval": 250000,
"lr": 0.0003,
"mac": "non_shared_mac",
"mask_before_softmax": true,
"name": "maa2c_ns",
"obs_agent_id": false,
"obs_individual_obs": false,
"obs_last_action": false,
"optim_alpha": 0.99,
"optim_eps": 1e-05,
"q_nstep": 5,
"repeat_id": 1,
"runner": "parallel",
"runner_log_interval": 10000,
"save_model": false,
"save_model_interval": 500000,
"save_replay": false,
"seed": 343532797,
"standardise_returns": false,
"standardise_rewards": true,
"t_max": 20050000,
"target_update_interval_or_tau": 0.01,
"test_greedy": true,
"test_interval": 500000,
"test_nepisode": 100,
"use_cuda": false,
"use_rnn": true,
"use_tensorboard": true
}
The first consideration is that I have ran experiments for both MAA2C and MAA2C_NS,
and got better results for the MAA2C.
The second consideration is the consistency of results for the Tag task, as reported: We
observe that in all environments except the matrix games, parameter sharing
improves the returns over no parameter sharing. While the average values
presented in Figure 3 do not seem statistically significant, by looking closer
in Tables 3 and 7 we observe that in several cases of algorithm-task pairs the
improvement due to parameter sharing seems significant. Such improvements can
be observed for most algorithms in MPE tasks, especially in Speaker-Listener
and Tag.
Table A groups the results for all the algorithms, minus COMA, for both
modalities for the MPE environment and shows the variation of the results. A
positive change means that the parameter sharing variation has excess of
maximum returns over the non-shared parameters.
Table A: Maximum returns over five seeds for eight algorithms with
parameter sharing (PS), without parameter sharing (NS), and the change in
excess of returns for MPE tasks.
Algorithm | Task | PS | NS | Change (%) |
---|---|---|---|---|
IQL | Speaker-Listener | -18.36 | -18.61 | 1.36% |
Spread | -132.63 | -141.87 | 6.97% | |
Adversary | 9.38 | 9.09 | 3.09% | |
Tag | 22.18 | 19.18 | 13.53% | |
IA2C | Speaker-Listener | -12.6 | -17.08 | 35.56% |
Spread | -134.43 | -131.74 | -2.00% | |
Adversary | 12.12 | 10.8 | 10.89% | |
Tag | 17.44 | 16.04 | 8.03% | |
IPPO | Speaker-Listener | -13.1 | -15.56 | 18.78% |
Spread | -133.86 | -132.46 | -1.05% | |
Adversary | 12.17 | 11.17 | 8.22% | |
Tag | 19.44 | 18.46 | 5.04% | |
MADDPG | Speaker-Listener | -13.56 | -12.73 | -6.12% |
Spread | -141.7 | -136.73 | -3.51% | |
Adversary | 8.97 | 8.81 | 1.78% | |
Tag | 12.5 | 2.82 | 77.44% | |
MAA2C | Speaker-Listener | -10.71 | -13.66 | 27.54% |
Spread | -129.9 | -130.88 | 0.75% | |
Adversary | 12.06 | 10.88 | 9.78% | |
Tag | 19.95 | 26.5 | -32.83% | |
MAPPO | Speaker-Listener | -10.68 | -14.35 | 34.36% |
Spread | -133.54 | -128.64 | -3.67% | |
Adversary | 11.3 | 12.04 | -6.55% | |
Tag | 18.52 | 17.96 | 3.02% | |
VDN | Speaker-Listener | -15.95 | -15.47 | -3.01% |
Spread | -131.03 | -142.13 | 8.47% | |
Adversary | 9.28 | 9.34 | -0.65% | |
Tag | 24.5 | 18.44 | 24.73% | |
QMIX | Speaker-Listener | -11.56 | -11.59 | 0.26% |
Spread | -126.62 | -130.97 | 3.44% | |
Adversary | 9.67 | 11.32 | -17.06% | |
Tag | 31.18 | 26.88 | 13.79% |
More strictly, the differences are even larger when we take into account only the Tag task.
Table B: Maximum returns over five seeds for the Tag task with parameter sharing (PS),
without parameter sharing (NS), the excess of returns of PS over NS, and the change in
excess of returns for the eight algorithms.
Algorithm | PS | NS | Excess of Returns | Change (%) |
---|---|---|---|---|
IQL | 22.18 | 19.18 | 3 | 13.53% |
IA2C | 17.44 | 16.04 | 1.4 | 8.03% |
IPPO | 19.44 | 18.46 | 0.98 | 5.04% |
MADDPG | 12.5 | 2.82 | 9.68 | 77.44% |
MAA2C | 19.95 | 26.5 | -6.55 | -32.83% |
MAPPO | 18.52 | 17.96 | 0.56 | 3.02% |
VDN | 24.5 | 18.44 | 6.06 | 24.73% |
QMIX | 31.18 | 26.88 | 4.3 | 13.79% |
Can you confirm that is indeed the case? Or point to the right direction.
Thanks,
I'm getting
AttributeError: 'Box' object has no attribute 'n'
When I try to use a spaces.Tuple ( tuple ( spaces.box(
as an action space for a gym environment, and I suspect (from some of the issues on this repo, some of the code and discussion elsewhere) that it is because this repo does not support continuous action spaces, is that right?
Hey y'all,
I'm trying to replicate your results from you paper but am unfamiliar with torch. I encountered an error when trying to run mappo and mappo_ns:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat1 in method wrapper_addmm)
I have tested all the other algorithms and none of them have this error. I was wondering if you have encountered this error before and know of a way to force all the tensors to either go to the CPU or the GPU (preferably).
The system that I am running has a GPU with CUDA 11.4 and CuDNN 8.2 and is running Windows 10.
thanks for your attention,
Peter
It's a really nice work on benchmarking multi-agent RL algorithms. When I was running the code, I had some confusion about the parameters of the onpolicy algorithm. Especially when using MAPPO algorithm, the winning rate is always 0 during training 2s_vs_1sc and 5m_vs_6m. I tried to use mappo.yaml in the source code and the relevant parameters given in the paper, but there was no effect. Could you please provide the parameters for training the MAPPO algorithm for the two scenarios of 2s_vs_1sc and 5m_vs_6m. Thank you .
Hi,
I used those lines in my custom environment:
self.observation_space = gym.spaces.Box(np.array([0,0,0,0,0,0,0,0,0,0]), np.array([9,9,9,9,9,9,9,9,9,9]), shape=(10,), dtype=np.int64)
self.action_space = gym.spaces.Box(low=0, high=9, shape=(1,), dtype=np.int64)
but I received this error:
.....\epymarl-main\src\envs_init_.py", line 56, in init
for sa_obs in env.observation_space:
TypeError: 'Box' object is not iterable
thank you
"return_mean" value remains constant value = 45.0 in all steps in metric.json with COMA or MAA2C. but why?
class ClassName(MultiAgentEnv):
def init(self):
self.n_agents = 10
self.observation_space = gym.spaces.Tuple(tuple( [gym.spaces.Box(np.array([0,0,0,0,0,0,0,0,0,0]),
np.array([3,108,6,8,3,2,3,17,19,17]), shape=(10,), dtype=np.int64)] * self.n_agents ))
self.action_space = gym.spaces.Tuple(tuple([
gym.spaces.Discrete(4),
gym.spaces.Discrete(109),
gym.spaces.Discrete(7),
gym.spaces.Discrete(9),
gym.spaces.Discrete(4),
gym.spaces.Discrete(3),
gym.spaces.Discrete(4),
gym.spaces.Discrete(18),
gym.spaces.Discrete(20),
gym.spaces.Discrete(18)
]))
self.state = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
self.episode = 0
self.thereshold = -10000
super().init()
def reset(self):
self.state = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
return self.observation_space.sample()
def step(self, actions):
cooperation_reward = Compute_reward_function(actions)
self.state = actions
obs, rew, dones, info = {}, {}, {}, {}
for i in range(10):
obs[i] = self.observation_space.sample()
rew[i] = cooperation_reward
dones[i] = False
info[i] = {}
if(cooperation_reward > self.thereshold):
self.thereshold = cooperation_reward
dones = self.n_agents * [True]
self.episode +=1
dones = self.n_agents * [True]
return obs, rew, dones, info
It's a really Nice work on benchmarking multi-agent RL algorithms and I really like it. When I go through the code, I find out epymarl basically only implements the basic version of different algorithms and ignore many different implementation tricks for different algorithms, such as the value normalization trick for MAPPO mentioned in MAPPO , or the value clipping trick for IPPO mentioned in IPPO. Will furture version epymarl support such tricks?
Hi,
Because I have some problems in running an experiment using the bash run.sh command, could you please let me know if the standar running command python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=2s3z
automatically check if there are somo GPU device and then use that GPU to running the experiment??
I check in real time the GPU usage and it is true that the % of usage is changing from 7% to 40%, so I don't know if the fact that it is not 100% used is due to some problem of mine of it is the way the code is design.
Thanks!
There are many config files under ./src/config/algs
and python modules under ./src/modules
which has suffix ns
. I go through the paper, code comment, README and even PyMARL project, but failed to find its meaning. What does it stand for?
I would like to report an issue which has a defining impact on the critic based
algorithms for the MPE PredatorPrey Task. The target_value
variable, built
for training the critic network, is about three orders of magnitude in relation
to return_mean
, the long term sample return. This behavior is abnormal for
three reasons: (i) The return_mean
is not discounted whereas the target_value
is discounted. The rewards used to estimate the value
are (ii) unbiased,
where we subtract a non-negative value, (iii) and re-scaled where we divide by
a standard deviation larger than one. Figure 1 depicts the behavior:
maa2c_ns.yaml
according to Section C.1, subsectiontime_limit=25
gymma.yaml
.{
"action_selector": "soft_policies",
"add_value_last_step": true,
"agent": "rnn_ns",
"agent_output_type": "pi_logits",
"batch_size": 10,
"batch_size_run": 10,
"buffer_cpu_only": true,
"buffer_size": 10,
"checkpoint_path": "",
"critic_type": "cv_critic_ns",
"entropy_coef": 0.01,
"env": "gymma",
"env_args": { "key": "mpe:SimpleTag-v0",
"pretrained_wrapper": "PretrainedTag",
"seed": 853609918,
"state_last_action": false,
"time_limit": 25},
"evaluate": false,
"gamma": 0.99,
"grad_norm_clip": 10,
"hidden_dim": 128,
"hypergroup": null,
"label": "default_label",
"learner": "actor_critic_learner",
"learner_log_interval": 10000,
"load_step": 0,
"local_results_path": "results",
"log_interval": 250000,
"lr": 0.0003,
"mac": "non_shared_mac",
"mask_before_softmax": true,
"name": "maa2c_ns",
"obs_agent_id": false,
"obs_individual_obs": false,
"obs_last_action": false,
"optim_alpha": 0.99,
"optim_eps": 1e-05,
"q_nstep": 5,
"repeat_id": 1,
"runner": "parallel",
"runner_log_interval": 10000,
"save_model": false,
"save_model_interval": 500000,
"save_replay": false,
"seed": 853609918,
"standardise_returns": false,
"standardise_rewards": true,
"t_max": 20050000,
"target_update_interval_or_tau": 0.01,
"test_greedy": true,
"test_interval": 500000,
"test_nepisode": 100,
"use_cuda": false,
"use_rnn": true,
"use_tensorboard": true
}
The process is better controlled when the option add_value_last_step
is set to false
as showed in Figure 2.
{
"action_selector": "soft_policies",
"add_value_last_step": false,
"agent": "rnn_ns",
"agent_output_type": "pi_logits",
"batch_size": 10,
"batch_size_run": 10,
"buffer_cpu_only": true,
"buffer_size": 10,
"checkpoint_path": "",
"critic_type": "cv_critic_ns",
"entropy_coef": 0.01,
"env": "gymma",
"env_args": { "key": "mpe:SimpleTag-v0",
"pretrained_wrapper": "PretrainedTag",
"seed": 932101488,
"state_last_action": false,
"time_limit": 25},
"evaluate": false,
"gamma": 0.99,
"grad_norm_clip": 10,
"hidden_dim": 128,
"hypergroup": null,
"label": "default_label",
"learner": "actor_critic_learner",
"learner_log_interval": 10000,
"load_step": 0,
"local_results_path": "results",
"log_interval": 250000,
"lr": 0.0003,
"mac": "non_shared_mac",
"mask_before_softmax": true,
"name": "maa2c_ns",
"obs_agent_id": false,
"obs_individual_obs": false,
"obs_last_action": false,
"optim_alpha": 0.99,
"optim_eps": 1e-05,
"q_nstep": 5,
"repeat_id": 1,
"runner": "parallel",
"runner_log_interval": 10000,
"save_model": false,
"save_model_interval": 500000,
"save_replay": false,
"seed": 932101488,
"standardise_returns": false,
"standardise_rewards": true,
"t_max": 20050000,
"target_update_interval_or_tau": 0.01,
"test_greedy": true,
"test_interval": 500000,
"test_nepisode": 100,
"use_cuda": false,
"use_rnn": true,
"use_tensorboard": true
}
This is the output from the diff
command between the two configuration files.
1,3c1,3
< {
< "action_selector": "soft_policies",
< "add_value_last_step": true,
---
> {
> "action_selector": "soft_policies",
> "add_value_last_step": false,
16c16
< "seed": 853609918,
---
> "seed": 932101488,
46c46
< "seed": 853609918,
---
> "seed": 932101488,
Unfortunately, I wasn't able to verify the published numbers even when correcting by this flag. Could you point me to the right direction?
Thanks in advance.
I am having some problems with your code base and I hope you can help me.
I have tried to run the MADDPG in both "SimpleTag-v0" and "SimpleSpread-v0" scenarios, but I have not achieved the results mentioned in your paper. SimpleSpread-v0 gets a return mean of -400 and SimpleTag-v0 gets a return mean of 5.
I am using the run command :
python3 src/main.py --config=maddpg --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleTag-v0" env_args.pretrained_wrapper="PretrainedTag"
python3 src/main.py --config=maddpg --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleSpread-v0"
Could you please tell me what is wrong, my hyperparameters are set according to the values mentioned in your paper.
Thanks!
I've just started using this repository and would like to view rendered results from the simple_tag environment
. I've run with the line provided for the pretrained models python3 src/main.py --config=qmix --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleTag-v0" env_args.pretrained_wrapper="PretrainedTag"
but have not had any luck.
It does not look like rendering during training/evaluation is supported - is this the case? When trying to save a gif of a trained model after the fact, I am not seeing anything saved off that I can view after setting save_replay: True
. What am I missing?
I noticed that when updating actors in the MADDPG paper, only the actions of agent i are generated by the policy, and the actions of other agents come from buffer storage
However, in this codebase, the actions of other agents also come from their respective policies. Will this cause some problems?
Hi,
I have a question about the evaluation of a model. I use the code as described
`python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=Multi_task_6m1M_vs_12m1M checkpoint_path=results/models/qmix_best/ save_replay=True test_nepisode=5 evaluate=True'
So, I run the model for evualuation 5 episodes but the resulst with the return_mean and the other metric have only one value
I try some modifications on the config but I get the same results.
What I try to do is obtain the same number of results that the number of episodes. That is to say, the return and the other metric obtained on each of the episodes.
Thanks!
Hi!
Whenever I try to run the MAPPO algorithm, executing:
python3 src/main.py --config=mappo --env-config=gymma with env_args.time_limit=500 env_args.key="rware:rware-tiny-2ag-v1"
I obtain the following error:
Traceback (most recent call last):
File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
self.result = self.main_function(*args)
File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "src/main.py", line 36, in my_main
run(_run, config, _log)
File "/Users/miguelferreira/PycharmProjects/AASMA/epymarl/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/Users/miguelferreira/PycharmProjects/AASMA/epymarl/src/run.py", line 127, in run_sequential
learner = le_REGISTRY[args.learner](mac, buffer.scheme, logger, args)
File "/Users/miguelferreira/PycharmProjects/AASMA/epymarl/src/learners/ppo_learner.py", line 37, in __init__
if self.args.standardise_rewards:
AttributeError: 'types.SimpleNamespace' object has no attribute 'standardise_rewards'
During handling of the above exception, another exception occurred:
Does anyone have any tip on how to fix it?
Thanks in advance!
Hi,
I would like to congratulate the authors for the article and for the initiative of maintaining those benchmark environments the community sorely needs. For the first time ever, I haven't had any major issues replicating the results from a reinforcement learning article. With the exception of the Level-based Foraging environment, namely for the task Foraging-15x15-4p-5f-v1 and algorithms IA2C and MAA2C with parameter sharing. The plots that I got where:
When we contrast with yours:
We can draw the following conclusions:
By using the evaluation metrics I was able to get:
IA2C | IA2C | MAA2C | MAA2C | |
---|---|---|---|---|
Mine | Yours | Mine | Yours | |
Maximum Return | 0.92 ± 0.05 | 0.93 ± 0.03 | 0.92 ± 0.02 | 0.95 ± 0.01 |
Average Return | 0.65 ± 0.04 | 0.59 ± 0.06 | 0.67 ± 0.04 | 0.73 ± 0.02 |
The random seeds used for both experiments are:
The configuration for the IA2C:
{
"action_selector": "soft_policies",
"add_value_last_step": true,
"agent": "rnn",
"agent_output_type": "pi_logits",
"batch_size": 10,
"batch_size_run": 10,
"buffer_cpu_only": true,
"buffer_size": 10,
"checkpoint_path": "",
"critic_type": "ac_critic",
"entropy_coef": 0.001,
"env": "gymma",
"env_args": {
"key": "lbforaging:Foraging-15x15-4p-5f-v1",
"pretrained_wrapper": null,
"state_last_action": false,
"time_limit": 50
},
"evaluate": false,
"gamma": 0.99,
"grad_norm_clip": 10,
"hidden_dim": 128,
"hypergroup": null,
"label": "default_label",
"learner": "actor_critic_learner",
"learner_log_interval": 10000,
"load_step": 0,
"local_results_path": "results",
"log_interval": 500000,
"lr": 0.0005,
"mac": "basic_mac",
"mask_before_softmax": true,
"name": "ia2c",
"obs_agent_id": true,
"obs_individual_obs": false,
"obs_last_action": false,
"optim_alpha": 0.99,
"optim_eps": 1e-05,
"q_nstep": 5,
"repeat_id": 1,
"runner": "parallel",
"runner_log_interval": 10000,
"save_model": true,
"save_model_interval": 500000,
"save_replay": false,
"seed": 291174067,
"standardise_returns": false,
"standardise_rewards": true,
"t_max": 20500000,
"target_update_interval_or_tau": 0.01,
"test_greedy": true,
"test_interval": 500000,
"test_nepisode": 100,
"use_cuda": false,
"use_rnn": true,
"use_tensorboard": true
}
Note that I wasn't sure why the configs in src/config/algos/ had a larger t_max
than the reported. In this case I dropped the last evaluation from this analysis.
The diff ia2c/1/config.json maa2c/1/config
:
11,12c11,12
< "critic_type": "ac_critic",
< "entropy_coef": 0.001,
---
> "critic_type": "cv_critic",
> "entropy_coef": 0.01,
34c34
< "name": "ia2c",
---
> "name": "maa2c",
40c40
< "q_nstep": 5,
---
> "q_nstep": 10,
50c50
< "t_max": 20500000,
---
> "t_max": 20050000,
Unless I missed something the critic_type
isn't covered in the article the difference in configuration are compatible with the different configurations reported on Tables 14 and 24.
Could you clarify what may be happening? And how can I approximate your results?
I' reading the version on https://arxiv.org/pdf/2006.07869.pdf, the first time the term MAA2C
appears is in the following paragraph
But I didn't see the cited paper of MAA2C.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.