uoe-agents / epymarl Goto Github PK

View Code? Open in Web Editor NEW

433.0 433.0 126.0 3.12 MB

An extension of the PyMARL codebase that includes additional algorithms and environment support

License: Apache License 2.0

Dockerfile 0.82% Shell 1.01% Python 98.16%

epymarl's People

Contributors

Stargazers

Watchers

Forkers

wu-jingcheng yuanleirl at-peter cenyu2018 joshwlks dennismalmgren rl-code-lib benellis3 jasonbyun8 jarvis-k alroda92 hnekoeiq vananle manueleberhardinger blankuca takieddinesoualhi spateria kinalmehta zhangtjtongxue sihongho changsenxia andyyue1893 q-max007 dreamwest nusmadrl skonwa wenhaoma-uts jgawiebe gsavarela mrrobot2211 jault ljcode621520 tekier privilger vbkbmqj zzq-bot hongdazhang dtabas andrewtanjs luorq3 pi-star-lab ignaciocarlucho djmartingale cswangle liangliang-lab infinfin xiaoyangyang2 lorenzflow lcdbezerra swordest apocalypsex yuhang065 qlt315 tomosii salwamostafa qiufengsly songyanghan longlongshot mmcaulif hristobuyukliev awabalii edu-ai carolinewang01 jahandad-baloch shubhlohiya muerterauda yuchen-x sullins2 128f kdkhanmir 3urn1ng sriyash421 shapeno sputnik524 cor3bit altundasbatu kiranikram sjtuwbl aslansd jonaswild maxrudolph1 arnaudgardille davidwang527 ramiribat peihongyu davidrother efosong stardusten libin-star tompan-1901 rezatorbati egg-west fanfanfan123456 cndengyu hamza-101 pyk1998 ai-for-infrastructure-management yuxin916 aliseraj78 ltcsoar

epymarl's Issues

How to load model and visualize results

Thank you for sharing!!

Using colored RWARE environment, I want to ask how to load and visualize trained models on colored rware

import gym

env = gym.make("rware-1color-tiny-4ag-v1", sensor_range=3, request_queue_size=6)
n_step = 1000 
model = load_something(model_path) # how do I load model?
for _ in range(n_step):
  obs = env.reset()
  actions = model.forward(obs)
  n_obs, reward, done, info = env.step(actions)
  print(f"{info=}")
  env.render()
  time.sleep(0.5)
  if (done):
    break
env.close()

Choosing runner: "parallel" in maddpg_ns.yaml causes error.

Setting runner: "parallel" in maddag_ns.yaml causes error:

TypeError: select_actions() got an unexpected keyword argument 'bs'

Thanks for this impactful work. I look forward to your reply and thank you in advance.

Cannot obtain the reported results on MPE:SimpleSpread task

Hi,

Thanks for the great code! However, when I run the given command to train MPE:SimpleSpread task, it seems the converged performance is far from the results on the paper. For example, the average return using QMIX, MADDPG are around -500, -400, and random policy is around -700. The number from the paper is around -120.

Could you tell me what is the possible problem?

Thanks!

How could I render the lbf environment?

When I set render=True to render the lbf environment, I met the following problem, is there anyone else had met the same problem before?

How could I render the test result?

error running example rware/lbforaging

Command run:

python src/main.py --config=qmix --env-config=gymma with env_args.time_limit=500 env_args.key="rware:rware-tiny-2ag-v1"

Error:

Traceback (most recent call last):
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/home/kinal/Desktop/marl/quick_results/epymarl/src/main.py", line 36, in my_main
    run(_run, config, _log)
  File "/home/kinal/Desktop/marl/quick_results/epymarl/src/run.py", line 55, in run
    run_sequential(args=args, logger=logger)
  File "/home/kinal/Desktop/marl/quick_results/epymarl/src/run.py", line 87, in run_sequential
    runner = r_REGISTRY[args.runner](args=args, logger=logger)
  File "/home/kinal/Desktop/marl/quick_results/epymarl/src/runners/episode_runner.py", line 15, in __init__
    self.env = env_REGISTRY[self.args.env](**self.args.env_args)
  File "/home/kinal/Desktop/marl/quick_results/epymarl/src/envs/__init__.py", line 17, in env_fn
    return env(**kwargs)
  File "/home/kinal/Desktop/marl/quick_results/epymarl/src/envs/__init__.py", line 84, in __init__
    self._env = TimeLimit(gym.make(f"{key}"), max_episode_steps=time_limit)
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 676, in make
    return registry.make(id, **kwargs)
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 490, in make
    versions = self.env_specs.versions(namespace, name)
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 220, in versions
    self._assert_name_exists(namespace, name)
  File "/home/kinal/miniconda3/envs/epymarl/lib/python3.9/site-packages/gym/envs/registration.py", line 297, in _assert_name_exists
    raise error.NameNotFound(message)
gym.error.NameNotFound: Environment `rware:rware-tiny-2ag` doesn't exist. Did you mean: `rware-tiny-2ag`?

During handling of the above exception, another exception occurred:

Environment:
python 3.9

Package            Version
------------------ --------------
absl-py            0.5.0
atomicwrites       1.2.1
attrs              21.4.0
brotlipy           0.7.0
certifi            2018.8.24
cffi               1.15.0
chardet            3.0.4
charset-normalizer 2.0.4
cloudpickle        2.0.0
colorama           0.4.4
cryptography       36.0.0
cycler             0.10.0
deepdiff           5.8.0
docopt             0.6.2
enum34             1.1.6
fonttools          4.32.0
future             0.16.0
gitdb              4.0.9
GitPython          3.1.27
gym                0.23.1
gym-notices        0.0.6
idna               2.7
imageio            2.17.0
importlib-metadata 4.11.3
iniconfig          1.1.1
jsonpickle         0.9.6
kiwisolver         1.0.1
lbforaging         1.1.0
matplotlib         3.5.1
mkl-fft            1.3.1
mkl-random         1.2.2
mkl-service        2.4.0
mock               2.0.0
more-itertools     4.3.0
mpyq               0.2.5
munch              2.3.2
networkx           2.8
numpy              1.22.3
ordered-set        4.1.0
packaging          21.3
pathlib2           2.3.2
pbr                4.3.0
Pillow             9.1.0
pip                21.2.4
pluggy             0.7.1
portpicker         1.2.0
probscale          0.2.5
protobuf           3.6.1
py                 1.6.0
py-cpuinfo         8.0.0
pycparser          2.21
pygame             2.1.2
pyglet             1.5.23
pyOpenSSL          22.0.0
pyparsing          2.2.2
PySC2              3.0.0
PySocks            1.7.1
pytest             4.3.1
python-dateutil    2.7.3
PyYAML             5.3.1
requests           2.19.1
rware              1.0.3
s2clientprotocol   4.10.1.75800.0
s2protocol         5.0.9.87702.0
sacred             0.8.0
scipy              1.8.0
setuptools         61.2.0
six                1.11.0
sk-video           1.1.10
SMAC               1.0.0
smmap              5.0.0
snakeviz           2.1.1
tensorboard-logger 0.1.0
tomli              2.0.1
torch              1.11.0
torchaudio         0.11.0
torchvision        0.12.0
tornado            5.1.1
typing_extensions  4.1.1
urllib3            1.23
websocket-client   0.53.0
wheel              0.37.1
whichcraft         0.5.2
wrapt              1.10.11
zipp               3.8.0

Number of Evaluations between Off-Policy and On-Policy

Hi,

I have some doubts about the number of evaluations between off-policy and on-policy algorithms. In the paper is explained that:

To account for the improved sample efficiency of off-policy over on-policy algorithms and to allow for fair comparisons, we train off-policy algorithms for two million steps and on-policy algorithms for 20 million steps. We evaluate off-policy algorithms every 50,000 steps and on-policy algorithms every 500,000 steps. As a result, a total of 40 evaluations are executed during the training of each algorithm

But with the code as it is in the repository but changing the t_max to 2M and 20M at QMIX and MAPPO respectively I get 41 test_return_mean in QMIX and 401. So I do not really know I have to change other parameter or the numbers are corrected.

Thanks

qtran rnn error

Hey guys, me again. Thanks for closing the last issue so quickly, I hope this one is just as simple. I am trying to train qtran on the LBF environment and epymarl keep crashing when I do so. There seems to be some issue with the argument args.use_rnn in the basic controller that is causing the program to fail. I am getting the following error: attribute error: 'types.SimpleNamespace' object has no attribute 'use_rnn'

I hope its not too much of a problem and look forwards to seeing this resolved :)

thanks,
Peter

python version

May I ask which version of python should we use?

MADDPG algorithm problem

I'm trying to train MADDPG algorithm with smac environment with the code below

python src/main.py --config=maddpg --env-config=sc2 with env_args.map_name="corridor"

but I got this error:
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 75067200000 bytes. Error code 12 (Cannot allocate memory)

my system:
ubuntu: 20.04
python: 3.7.12
torch: 1.13.1

I try use others algorithms it run fine. I only got this problem when I use MADDPG algorithms and I got this problem with the main epymarl code and the addtional_algo code too.

The full error script is below

[DEBUG 10:22:34] git.cmd Popen(['git', 'version'], cwd=/home/abdulghani/epymarl_addtional_algo, universal_newlines=False, shell=None, istream=None)
[DEBUG 10:22:34] git.cmd Popen(['git', 'version'], cwd=/home/abdulghani/epymarl_addtional_algo, universal_newlines=False, shell=None, istream=None)
src/main.py:81: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config_dict = yaml.load(f)
src/main.py:50: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config_dict = yaml.load(f)
src/main.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
if isinstance(v, collections.Mapping):
[INFO 10:22:34] root Saving to FileStorageObserver in results/sacred.
[DEBUG 10:22:34] pymarl Using capture mode "fd"
[INFO 10:22:34] pymarl Running command 'my_main'
[INFO 10:22:34] pymarl Started run with ID "4"
[DEBUG 10:22:34] pymarl Starting Heartbeat
[DEBUG 10:22:34] my_main Started
[INFO 10:22:34] my_main Experiment Parameters:
[INFO 10:22:34] my_main

{ 'add_value_last_step': True,
'agent': 'rnn',
'agent_output_type': 'pi_logits',
'batch_size': 32,
'batch_size_run': 1,
'buffer_cpu_only': True,
'buffer_size': 50000,
'checkpoint_path': '',
'critic_type': 'maddpg_critic',
'env': 'sc2',
'env_args': { 'continuing_episode': False,
'debug': False,
'difficulty': '7',
'game_version': None,
'heuristic_ai': False,
'heuristic_rest': False,
'map_name': 'corridor',
'move_amount': 2,
'obs_all_health': True,
'obs_instead_of_state': False,
'obs_last_action': False,
'obs_own_health': True,
'obs_pathing_grid': False,
'obs_terrain_height': False,
'obs_timestep_number': False,
'replay_dir': '',
'replay_prefix': '',
'reward_death_value': 10,
'reward_defeat': 0,
'reward_negative_scale': 0.5,
'reward_only_positive': True,
'reward_scale': True,
'reward_scale_rate': 20,
'reward_sparse': False,
'reward_win': 200,
'seed': 36613826,
'state_last_action': True,
'state_timestep_number': False,
'step_mul': 8},
'evaluate': False,
'gamma': 0.99,
'grad_norm_clip': 10,
'hidden_dim': 128,
'hypergroup': None,
'label': 'default_label',
'learner': 'maddpg_learner',
'learner_log_interval': 10000,
'load_step': 0,
'local_results_path': 'results',
'log_interval': 50000,
'lr': 0.0005,
'mac': 'maddpg_mac',
'name': 'maddpg',
'obs_agent_id': True,
'obs_individual_obs': False,
'obs_last_action': False,
'optim_alpha': 0.99,
'optim_eps': 1e-05,
'reg': 0.001,
'repeat_id': 1,
'runner': 'episode',
'runner_log_interval': 10000,
'save_model': True,
'save_model_interval': 50000,
'save_replay': True,
'seed': 36613826,
'standardise_returns': False,
'standardise_rewards': True,
't_max': 2050000,
'target_update_interval_or_tau': 200,
'test_greedy': True,
'test_interval': 50000,
'test_nepisode': 100,
'use_cuda': True,
'use_rnn': True,
'use_tensorboard': True}

2023-04-28 10:22:34.886446: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 10:22:34.972517: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-04-28 10:22:35.382670: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:
2023-04-28 10:22:35.382712: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:
2023-04-28 10:22:35.382715: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[DEBUG 10:22:35] h5py._conv Creating converter from 7 to 5
[DEBUG 10:22:35] h5py._conv Creating converter from 5 to 7
[DEBUG 10:22:35] h5py._conv Creating converter from 7 to 5
[DEBUG 10:22:35] h5py._conv Creating converter from 5 to 7
[DEBUG 10:22:39] pymarl Stopping Heartbeat
[ERROR 10:22:39] pymarl Failed after 0:00:05!
Traceback (most recent calls WITHOUT Sacred internals):
File "src/main.py", line 36, in my_main
run(_run, config, _log)
File "/home/abdulghani/epymarl_addtional_algo/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/home/abdulghani/epymarl_addtional_algo/src/run.py", line 117, in run_sequential
device="cpu" if args.buffer_cpu_only else args.device,
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 209, in init
super(ReplayBuffer, self).init(scheme, groups, buffer_size, max_seq_length, preprocess=preprocess, device=device)
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 28, in init
self._setup_data(self.scheme, self.groups, batch_size, max_seq_length, self.preprocess)
File "/home/abdulghani/epymarl_addtional_algo/src/components/episode_buffer.py", line 75, in _setup_data
self.data.transition_data[field_key] = th.zeros((batch_size, max_seq_length, *shape), dtype=dtype, device=self.device)
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 75067200000 bytes. Error code 12 (Cannot allocate memory)

Did you try to use the true global state as input in LBF?

Dear Authors,

To train QMIX on LBF, I found it uses the concatenated obs as the states. For fully observable settings, there is no problem doing so. I am curious about did you try to use the true global state as input in LBF? Is the performance good?

Thank you in advance.

Is there any plan to opensource the MPE related code in this repository?

Awesome work for extending the original pymarl! But will be much better if MPE can be included.
Is there any plan to open source the MPE-related code in this repository?

AttributeError: Has no attribute 'unit_dim'

In the environment LBF, it raise this error. I want to know the meaning of 'unit_dim'.

Hi, why NonSharedMAC only has one agent? thanks

Hi, thanks for your wonderful sharing. I am wondering if we don't share paremeters, there should be mult-agents, but in your classe NonSharedMAC definition, there is only one agent exactly like the case of sharing paremeters? please help me to understand better, thanks.
BTW, ia2c is not working if I set mac to non_shared_mac with Foraging-8x8-2p-3f-v2 env.

Recommended torch version?

Hi,

Thank you so much for this wonderful repo! But I was wondering could you advice on the recommended version of torch so that our first-time users will be able to use this repo more smoothly? That would be extremely helpful.

Thank you in advance!

Result with IPPO in SMAC: 2s_vs_1sc

Hi, I recently tried to reproduce the relevant results. But when I tested the IPPO algorithm in the SMAC 2s1sc scenario, I found that the training performance was very unstable. I configured it according to the parameters in your paper, but the performance is still not ideal. Can you help me to see what is wrong with this configuration? Thanks.

` --- IPPO specific parameters ---

action_selector: "soft_policies"
mask_before_softmax: True

runner: "parallel"

buffer_size: 10
batch_size_run: 10
batch_size: 10

env_args:
state_last_action: False # critic adds last action internally

target_update_interval_or_tau: 0.01

lr: 0.0005
hidden_dim: 128

obs_agent_id: True
obs_last_action: False
obs_individual_obs: False

agent_output_type: "pi_logits"
learner: "ppo_learner"
entropy_coef: 0.001
standardise_returns: False
standardise_rewards: False
use_rnn: True
q_nstep: 10 # 1 corresponds to normal r + gammaV
critic_type: "ac_critic"
epochs: 4
eps_clip: 0.2
name: "ippo"

t_max: 20050000
`

repeated matrix game

Hi, thanks for the contribution to the community!

I didn't find the environments of repeated matrix games. Although they are simple and may be created by users, it would be great if we can use them directly from this repo since the results from the paper show the suboptimal performance of many algorithms. Thus, I think these matrix games are worth a try.

Mappo algorithm can not reach the results in the paper？

Awesome work for extending the original pymarl! But my experimental results fail to meet the return or battle_won described in the paper using MAPPO algorithm. I have adjusted some parameters according to the chapter I(selected hyperparameters) in the paper, but it didn't work. Is there any detail I ignored that can be adjusted and achieve the result in the paper, especially for SMAC env (MMM2 and corridor).

nan while training MAPPO on RWARE

Awesome work! Just having trouble running MAPPO on the RWARE environment. My run.json is below. The main error is the nan causing ValueError appearing in the middle of training. I am wondering if maybe using the latest version of rware instead of the one used in the paper might be the problem, and if so, why. Command run was

python3 src/main.py --config=mappo --env-config=gymma with env_args.time_limit=500 env_args.key="rware:rware-tiny-2ag-v1"

{
  "artifacts": [],
  "command": "my_main",
  "experiment": {
    "base_dir": "/home/x/epymarl/src",
    "dependencies": [
      "munch==2.5.0",
      "numpy==1.21.6",
      "PyYAML==5.3.1",
      "sacred==0.8.0",
      "torch==1.12.1",
      "wqmix==0.1.0"
    ],
    "mainfile": "main.py",
    "name": "pymarl",
    "repositories": [
      {
        "commit": "f047072aedc9a128d28b01ba42ef3381dcad2328",
        "dirty": true,
        "url": "https://github.com/uoe-agents/epymarl.git"
      },
      {
        "commit": "f047072aedc9a128d28b01ba42ef3381dcad2328",
        "dirty": true,
        "url": "https://github.com/uoe-agents/epymarl.git"
      }
    ],
    "sources": [
      [
        "main.py",
        "_sources/main_663aa94901b58b1db134be7c43e7a0df.py"
      ],
      [
        "run.py",
        "_sources/run_4f45b371acea3e76064abb7e3b6103b1.py"
      ]
    ]
  },
  "fail_trace": [
    "Traceback (most recent call last):\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/stdout_capturing.py\", line 163, in tee_output_fd\n    yield out  # let the caller do their printing\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/run.py\", line 238, in __call__\n    self.result = self.main_function(*args)\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/config/captured_function.py\", line 42, in captured_function\n    result = wrapped(*args, **kwargs)\n",
    "  File \"src/main.py\", line 36, in my_main\n    run(_run, config, _log)\n",
    "  File \"/home/x/epymarl/src/run.py\", line 55, in run\n    run_sequential(args=args, logger=logger)\n",
    "  File \"/home/x/epymarl/src/run.py\", line 185, in run_sequential\n    episode_batch = runner.run(test_mode=False)\n",
    "  File \"/home/x/epymarl/src/runners/parallel_runner.py\", line 104, in run\n    actions = self.mac.select_actions(self.batch, t_ep=self.t, t_env=self.t_env, bs=envs_not_terminated, test_mode=test_mode)\n",
    "  File \"/home/x/epymarl/src/controllers/basic_controller.py\", line 23, in select_actions\n    chosen_actions = self.action_selector.select_action(agent_outputs[bs], avail_actions[bs], t_env, test_mode=test_mode)\n",
    "  File \"/home/x/epymarl/src/components/action_selectors.py\", line 73, in select_action\n    m = Categorical(agent_inputs)\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/torch/distributions/categorical.py\", line 64, in __init__\n    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/torch/distributions/distribution.py\", line 56, in __init__\n    f\"Expected parameter {param} \"\n",
    "ValueError: Expected parameter probs (Tensor of shape (10, 2, 5)) of distribution Categorical(probs: torch.Size([10, 2, 5])) to satisfy the constraint Simplex(), but found invalid values:\ntensor([[[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]],\n\n        [[nan, nan, nan, nan, nan],\n         [nan, nan, nan, nan, nan]]], device='cuda:0', grad_fn=<DivBackward0>)\n",
    "\nDuring handling of the above exception, another exception occurred:\n\n",
    "Traceback (most recent call last):\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/contextlib.py\", line 130, in __exit__\n    self.gen.throw(type, value, traceback)\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/site-packages/sacred/stdout_capturing.py\", line 175, in tee_output_fd\n    tee_stdout.wait(timeout=1)\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/subprocess.py\", line 1019, in wait\n    return self._wait(timeout=timeout)\n",
    "  File \"/home/x/anaconda3/envs/x/lib/python3.7/subprocess.py\", line 1645, in _wait\n    raise TimeoutExpired(self.args, timeout)\n",
    "subprocess.TimeoutExpired: Command '['tee', '-a', '/tmp/tmpz5nib2ap']' timed out after 1 seconds\n"
  ],
  "heartbeat": "2022-09-21T15:11:48.887817",
  "host": {
    "ENV": {},
    "cpu": "Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz",
    "gpus": {
      "driver_version": "471.41",
      "gpus": [
        {
          "model": "NVIDIA GeForce RTX 2070",
          "persistence_mode": false,
          "total_memory": 8192
        }
      ]
    },
    "hostname": "x",
    "os": [
      "Linux",
      "Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid"
    ],
    "python_version": "3.7.13"
  },
  "meta": {
    "command": "my_main",
    "options": {
      "--beat-interval": null,
      "--capture": null,
      "--comment": null,
      "--debug": false,
      "--enforce_clean": false,
      "--file_storage": null,
      "--force": false,
      "--help": false,
      "--loglevel": null,
      "--mongo_db": null,
      "--name": null,
      "--pdb": false,
      "--print-config": false,
      "--priority": null,
      "--queue": false,
      "--s3": null,
      "--sql": null,
      "--tiny_db": null,
      "--unobserve": false,
      "COMMAND": null,
      "UPDATE": [
        "env_args.time_limit=500",
        "env_args.key=rware:rware-tiny-2ag-v1"
      ],
      "help": false,
      "with": true
    }
  },
  "resources": [],
  "result": null,
  "start_time": "2022-09-21T11:53:23.830946",
  "status": "FAILED",
  "stop_time": "2022-09-21T15:11:48.976631"

Edit: Ran it again with the exact environment (https://github.com/uoe-agents/robotic-warehouse) but still got the same error.

[INFO 17:09:58] my_main t_env: 7055000 / 20050000
[INFO 17:09:58] my_main Estimated time left: 7 hours, 20 minutes, 15 seconds. Time passed: 4 hours, 0 seconds
[INFO 17:11:32] my_main Recent Stats | t_env:    7100000 | Episode:    14200
advantage_mean:           -0.0393       agent_grad_norm:           0.1806       critic_grad_norm:         18.5048       critic_loss:               2.0751
ep_length_mean:          500.0000       pg_loss:                   0.0272       pi_max:                    0.6237       q_taken_mean:             21.0140
return_mean:               0.1333       return_std:                0.3700       target_mean:              20.9746       td_error_abs:              0.6127
test_ep_length_mean:     500.0000       test_return_mean:          0.0500       test_return_std:           0.2126
[DEBUG 17:11:33] pymarl Stopping Heartbeat
[ERROR 17:11:34] pymarl Failed after 4:01:40!
Traceback (most recent calls WITHOUT Sacred internals):
  File "src/main.py", line 36, in my_main
    run(_run, config, _log)
  File "/home/x/epymarl/src/run.py", line 55, in run
    run_sequential(args=args, logger=logger)
  File "/home/x/epymarl/src/run.py", line 185, in run_sequential
    episode_batch = runner.run(test_mode=False)
  File "/home/x/epymarl/src/runners/parallel_runner.py", line 104, in run
    actions = self.mac.select_actions(self.batch, t_ep=self.t, t_env=self.t_env, bs=envs_not_terminated, test_mode=test_mode)
  File "/home/x/epymarl/src/controllers/basic_controller.py", line 23, in select_actions
    chosen_actions = self.action_selector.select_action(agent_outputs[bs], avail_actions[bs], t_env, test_mode=test_mode)
  File "/home/x/epymarl/src/components/action_selectors.py", line 73, in select_action
    m = Categorical(agent_inputs)
  File "/home/x/anaconda3/envs/qmix/lib/python3.7/site-packages/torch/distributions/categorical.py", line 64, in __init__
    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/x/anaconda3/envs/qmix/lib/python3.7/site-packages/torch/distributions/distribution.py", line 56, in __init__
    f"Expected parameter {param} "
ValueError: Expected parameter probs (Tensor of shape (10, 2, 5)) of distribution Categorical(probs: torch.Size([10, 2, 5])) to satisfy the constraint Simplex(), but found invalid values:
tensor([[[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]],

        [[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]]], device='cuda:0', grad_fn=<DivBackward0>)

During handling of the above exception, another exception occurred:

Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/x/anaconda3/envs/qmix/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/x/anaconda3/envs/qmix/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/x/anaconda3/envs/qmix/lib/python3.7/subprocess.py", line 1645, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['tee', '-a', '/tmp/tmpr0gvavq4']' timed out after 1 seconds

Gym Videorecorder not working with LBFG environment

My steps so far:

Trained and saved model for LBFG
in run.py I highjack the evaluate_sequential()

env = runner.env._envs
env = gym.wrappers.Monitor(env, '/tmp/policy.mp4')
rec = VideoRecorder(env, enabled=True)

then I highjack runner.run(test_mode=True, rec=rec) in EpisodeRunner
by adding
rec.capture_frame()
in the while not terminated loop

afterwards I close the VideoRecorder.

My expectation would be a .mp4 Video. Instead I just get the .json files, which are empty.

Has anybody managed to get a Video out of epymarl?

Why batch reward standarization works?

Hello again, and this time I have another question about batch reward standarization: why it works?
if self.args.standardise_rewards: rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
This normalizes rewards per batch sampled from the buffer. But if an episode is sampled by different batches, the normalized rewards of that episode will be different(for different episodes sampled by different policies are not i.i.d.). I think it would break the reward of the environment and cause some instability, but it indeed improves the performance of my MARL algorithms. So could you give me more explanations on why using batch reward standarization and why it works? Thanks!

Weird, a2c and ppo learner has a target critic network

Why is there a target network in A2C / MAA2C / PPO / MAPPO implementation?
Till now I haven't find any target network used in previous works like MAPPO's original implementation.
Is there any explanation?

Are values from Tables 7-8 for task LBF-15x15-3p-5f and algorithm VDN_NS switched?

Thanks for the outstanding work and continuing support -- I would like to report a possible typo:

Table 7 reports the maximum returns for LBF-15x15-3p-5f and algorithm VDN_NS as 0.11
Table 8 reports the average returns for LBF-15x15-3p-5f and algorithm VDN_NS as 0.18

Those values are mutually inconsistent -- Moreover, I was able to verify the maximum returns in Table 7 and not Table 8.

Question with the implentation of Modules w.r.t different Multi-agent algorithms

Hi, contributors of pymarl!

I am currently working on a project where we try to adopt a single-agent RL framework to a multi-agent one and hope to compare on different MA algorithms on our specific problem. After I read through the both papers and corresponding implementation (mainly on QMIX and COMA) , I have some trouble on understanding the implementation of the module part, which contains agents, critics and mixers.

My first concern would be the RNN-Agents. In COMA and QMIX, agents actually play different roles in the algorithm. In QMIX, agents are just local q-functions, which input the obs and actions and outputs the corresponding Q values (action-state value function) to the mixer, where we argmax to obtain the optimal policy (this is more likely to the behaviour of the implemented RNN agents, which outputs q). However, in the COMA, agents are defined in an actor-critic way, just parameterizing a policy, which means it obvious outputs a certain action (maybe in logits manner). How could QMIX and COMA both use the same RNN agent (both algorithms init agents in the controller to interact with the env)? Am I misunderstanding some thing?

My second confusion is about the non-shared COMA (coma_ns.py from module dir). In COMA, the critic is obviously defined as a centralized critic Q(U,s). How could this critic be defined in a decentralized way? Because from my perspective, non-shared modules should only be the agents, not something defined to be centralized. In the COMA_learner.py, a single centralized critic would obviously make more sense to me.

SMAC environment problem

I'm trying to run MAPPO or IPPO with sc2 environment but I couldn't figured out how to specify the maps.
I already us pymarl and run the algorithms with this code:

python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=2s3z

I want to run epymarl with MAPPO or IPPO algorithms with the same map 2s3z from sc2. could any one help me with that?
I'm using this code:

python3.7 src/main.py --config=mappo --env-config=sc2 with env_args.time_limit=25 env_args.key="SMAC:2s3z"

I get this error:

sacred.utils.ConfigAddedError: Added new config entry that is not used anywhere
Conflicting configuration values:
env_args.key=SMAC:2s3z

Support for PettingZoo?

Hi,

I would like to ask, are there any plans to make this library compatible with PettingZoo? I've read the overview and docs but I can't seem to find any reference to that at all, and it's a shame since PZ would work extremely well (in principle) with another dedicated MARL library.

Unable to install requirements.txt

The conflict is caused by:
    The user requested protobuf==3.6.1
    pysc2 3.0.0 depends on protobuf>=2.6
    s2clientprotocol 4.10.1.75800.0 depends on protobuf
    smac 1.0.0 depends on protobuf==3.19.5

The SMAC repo requirements were recently updated to include protobuf==3.19.5. I think the SMAC version that EPyMARL uses should be specified in requirements.txt.

How to get the Episodic return graph

After running the command, I get the following file, but I don't see any files about tensorboard, how do I get the Episodic return graph?

mpe gym.error

When I was running the example of mpe python3 src/main.py --config=qmix --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleSpeakerListener-v0"
I met this issue:
[ERROR 08:07:39] pymarl Failed after 0:00:00!
Traceback (most recent calls WITHOUT Sacred internals):
File "src/main.py", line 36, in my_main
run(_run, config, _log)
File "/root/code/epymarl/src/run.py", line 55, in run
run_sequential(args=args, logger=logger)
File "/root/code/epymarl/src/run.py", line 87, in run_sequential
runner = r_REGISTRY[args.runner](args=args, logger=logger)
File "/root/code/epymarl/src/runners/episode_runner.py", line 15, in init
self.env = env_REGISTRYself.args.env
File "/root/code/epymarl/src/envs/init.py", line 14, in env_fn
return env(**kwargs)
File "/root/code/epymarl/src/envs/init.py", line 81, in init
self._env = TimeLimit(gym.make(f"{key}"), max_episode_steps=time_limit)
File "/root/anaconda3/envs/pymarl/lib/python3.6/site-packages/gym/envs/registration.py", line 235, in make
return registry.make(id, **kwargs)
File "/root/anaconda3/envs/pymarl/lib/python3.6/site-packages/gym/envs/registration.py", line 128, in make
spec = self.spec(path)
File "/root/anaconda3/envs/pymarl/lib/python3.6/site-packages/gym/envs/registration.py", line 143, in spec
mod_name
gym.error.Error: A module (mpe) was specified for the environment but was not found, make sure the package is installed with pip install before calling gym.make()

Did anyone meet this issue?

network type configuration

Hi!

I'm trying to train the 'rware:rware-tiny-2ag-v1' environment with MAPPO, and I'm following the 'Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks' paper hyperparameters tables!

One of the configurations is the network type. In the paper I understand that two types of networks are used: GRU and FC, however I'm not finding where can I set the type of network I want to use in EPyMARL.

Thanks in advance!

Problem on reproducing LBF results

Thanks for your work.

I am trying to train algorithms on the LBF environment, but when I test algorithms I found that the results are significantly worse than those in the Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. For example, when I run 10 seeds of VDN on 15x15-3p-5f in 2M steps, none of those seeds can achieve a return higher than 0.2, while the average return in the paper is 0.58. I wonder if I mistake any essential parameters？
Here‘s one of my config.json:

{
"action_selector": "epsilon_greedy",
"agent": "rnn",
"agent_output_type": "q",
"batch_size": 32,
"batch_size_run": 1,
"buffer_cpu_only": true,
"buffer_size": 5000,
"checkpoint_path": "",
"double_q": true,
"env": "gymma",
"env_args": {
"key": "lbforaging:Foraging-15x15-3p-5f-v1",
"pretrained_wrapper": null,
"time_limit": 50
},
"epsilon_anneal_time": 200000,
"epsilon_finish": 0.05,
"epsilon_start": 1.0,
"evaluate": false,
"evaluation_epsilon": 0.0,
"gamma": 0.99,
"grad_norm_clip": 10,
"hidden_dim": 128,
"hypergroup": null,
"label": "default_label",
"learner": "q_learner",
"learner_log_interval": 10000,
"load_step": 0,
"local_results_path": "results",
"log_interval": 50000,
"lr": 0.0003,
"mac": "basic_mac",
"mixer": "vdn",
"name": "vdn",
"obs_agent_id": true,
"obs_individual_obs": false,
"obs_last_action": false,
"optim_alpha": 0.99,
"optim_eps": 1e-05,
"repeat_id": 1,
"runner": "episode",
"runner_log_interval": 10000,
"save_model": true,
"save_model_interval": 50000,
"save_replay": false,
"seed": 291174067,
"standardise_rewards": true,
"t_max": 2050000,
"target_update_interval_or_tau": 0.01,
"test_greedy": true,
"test_interval": 50000,
"test_nepisode": 100,
"use_cuda": true,
"use_rnn": true,
"use_tensorboard": true
}

Unable to install requirements.txt

Hello

I have a clean conda environment with python 3.9 and I am getting the following error. Any pointers on how to resolve this?

Also, are there any future plans to make this repo compatible with gymnasium?

Collecting gym==0.21.0 (from -r requirements.txt (line 3))
  Using cached gym-0.21.0.tar.gz (1.5 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in gym setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Training slows down

Hi, I am using epymarl to train melting pot, I replaced the rnn+mlp network with rnn+cnn. I found in IPPO and IA2C that the time cost for learning each batch increases over time shown below. I tried my best to debug it and I found it very hard to find the reason. I even used th.cuda.empty_cache() and th.cuda.synchronize(device=th.device("cuda")) but no help. The following figure shows the average time cost of the past 10 updates.

Did you find also found such an issue?

Question about MAPPO centralized critic's target network

Hi! Thank you for your work!

I am new to MARL and i am study MAPPO. I am wondering why the MAPPO algorithm as implemented exist a target critic, which i did not find in the original paper. Also, I thought MAPPO is a on-policy algorithm, then why there exist a reply buffer to store episodes?

Look forward to your reply.

Are values from Tables 3-7 for task MPE Tag, algorithms MAA2C and MAA2C_NS swapped?

I was unable to verify the results reported for algorithm MAA2C_NS and TAG
task. Even after correcting for the add_value_last_step=False as per issue #43.
Upon cross validation I found evidence pointing to the possibility of
swapped values between the maximum returns for shared parameters, Table 3, and
the maximum returns, Table 7, for non-shared parameters modalities.

Reproduce:

Commit reference: 3d1463d
Divide the rewards by a factor of 3.0 according to issue #29
Set configurations: (i) maa2c_ns.yaml according to Section C.1, subsection
MPE PredadorPrey and Table 23 from Supplemental. (ii) Set time_limit=25
in gymma.yaml.
Set add_value_last_step=False

Config:

{   
    "action_selector": "soft_policies",
    "add_value_last_step": false,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 343532797,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 343532797,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

Considerations

The first consideration is that I have ran experiments for both MAA2C and MAA2C_NS,
and got better results for the MAA2C.

The second consideration is the consistency of results for the Tag task, as reported: We
observe that in all environments except the matrix games, parameter sharing
improves the returns over no parameter sharing. While the average values
presented in Figure 3 do not seem statistically significant, by looking closer
in Tables 3 and 7 we observe that in several cases of algorithm-task pairs the
improvement due to parameter sharing seems significant. Such improvements can
be observed for most algorithms in MPE tasks, especially in Speaker-Listener
and Tag.

Table A groups the results for all the algorithms, minus COMA, for both
modalities for the MPE environment and shows the variation of the results. A
positive change means that the parameter sharing variation has excess of
maximum returns over the non-shared parameters.

Table A: Maximum returns over five seeds for eight algorithms with
parameter sharing (PS), without parameter sharing (NS), and the change in
excess of returns for MPE tasks.

Algorithm	Task	PS	NS	Change (%)
IQL	Speaker-Listener	-18.36	-18.61	1.36%
	Spread	-132.63	-141.87	6.97%
	Adversary	9.38	9.09	3.09%
	Tag	22.18	19.18	13.53%
IA2C	Speaker-Listener	-12.6	-17.08	35.56%
	Spread	-134.43	-131.74	-2.00%
	Adversary	12.12	10.8	10.89%
	Tag	17.44	16.04	8.03%
IPPO	Speaker-Listener	-13.1	-15.56	18.78%
	Spread	-133.86	-132.46	-1.05%
	Adversary	12.17	11.17	8.22%
	Tag	19.44	18.46	5.04%
MADDPG	Speaker-Listener	-13.56	-12.73	-6.12%
	Spread	-141.7	-136.73	-3.51%
	Adversary	8.97	8.81	1.78%
	Tag	12.5	2.82	77.44%
MAA2C	Speaker-Listener	-10.71	-13.66	27.54%
	Spread	-129.9	-130.88	0.75%
	Adversary	12.06	10.88	9.78%
	Tag	19.95	26.5	-32.83%
MAPPO	Speaker-Listener	-10.68	-14.35	34.36%
	Spread	-133.54	-128.64	-3.67%
	Adversary	11.3	12.04	-6.55%
	Tag	18.52	17.96	3.02%
VDN	Speaker-Listener	-15.95	-15.47	-3.01%
	Spread	-131.03	-142.13	8.47%
	Adversary	9.28	9.34	-0.65%
	Tag	24.5	18.44	24.73%
QMIX	Speaker-Listener	-11.56	-11.59	0.26%
	Spread	-126.62	-130.97	3.44%
	Adversary	9.67	11.32	-17.06%
	Tag	31.18	26.88	13.79%

Average Change (%): 7.51%
Total Change (%): 240.40%

More strictly, the differences are even larger when we take into account only the Tag task.

Table B: Maximum returns over five seeds for the Tag task with parameter sharing (PS),
without parameter sharing (NS), the excess of returns of PS over NS, and the change in
excess of returns for the eight algorithms.

Algorithm	PS	NS	Excess of Returns	Change (%)
IQL	22.18	19.18	3	13.53%
IA2C	17.44	16.04	1.4	8.03%
IPPO	19.44	18.46	0.98	5.04%
MADDPG	12.5	2.82	9.68	77.44%
MAA2C	19.95	26.5	-6.55	-32.83%
MAPPO	18.52	17.96	0.56	3.02%
VDN	24.5	18.44	6.06	24.73%
QMIX	31.18	26.88	4.3	13.79%

Average Change: 2.42875 14.09%
Total Change: 19.43 112.75%

Can you confirm that is indeed the case? Or point to the right direction.

Thanks,

Does this library support continuous action spaces?

I'm getting

AttributeError: 'Box' object has no attribute 'n'

When I try to use a spaces.Tuple ( tuple ( spaces.box( as an action space for a gym environment, and I suspect (from some of the issues on this repo, some of the code and discussion elsewhere) that it is because this repo does not support continuous action spaces, is that right?

mappo and mappo_ns dont run

Hey y'all,

I'm trying to replicate your results from you paper but am unfamiliar with torch. I encountered an error when trying to run mappo and mappo_ns:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat1 in method wrapper_addmm)

I have tested all the other algorithms and none of them have this error. I was wondering if you have encountered this error before and know of a way to force all the tensors to either go to the CPU or the GPU (preferably).

The system that I am running has a GPU with CUDA 11.4 and CuDNN 8.2 and is running Windows 10.

thanks for your attention,

Peter

Parameters of MAPPO in different scenarios

It's a really nice work on benchmarking multi-agent RL algorithms. When I was running the code, I had some confusion about the parameters of the onpolicy algorithm. Especially when using MAPPO algorithm, the winning rate is always 0 during training 2s_vs_1sc and 5m_vs_6m. I tried to use mappo.yaml in the source code and the relevant parameters given in the paper, but there was no effect. Could you please provide the parameters for training the MAPPO algorithm for the two scenarios of 2s_vs_1sc and 5m_vs_6m. Thank you .

TypeError: 'Box' object is not iterable, WHY?

Hi,
I used those lines in my custom environment:
self.observation_space = gym.spaces.Box(np.array([0,0,0,0,0,0,0,0,0,0]), np.array([9,9,9,9,9,9,9,9,9,9]), shape=(10,), dtype=np.int64)
self.action_space = gym.spaces.Box(low=0, high=9, shape=(1,), dtype=np.int64)

but I received this error:
.....\epymarl-main\src\envs_init_.py", line 56, in init
for sa_obs in env.observation_space:
TypeError: 'Box' object is not iterable

thank you

why "return" remains constant in my custom environment?

"return_mean" value remains constant value = 45.0 in all steps in metric.json with COMA or MAA2C. but why?

class ClassName(MultiAgentEnv):
def init(self):
self.n_agents = 10
self.observation_space = gym.spaces.Tuple(tuple( [gym.spaces.Box(np.array([0,0,0,0,0,0,0,0,0,0]),
np.array([3,108,6,8,3,2,3,17,19,17]), shape=(10,), dtype=np.int64)] * self.n_agents ))
self.action_space = gym.spaces.Tuple(tuple([
gym.spaces.Discrete(4),
gym.spaces.Discrete(109),
gym.spaces.Discrete(7),
gym.spaces.Discrete(9),
gym.spaces.Discrete(4),
gym.spaces.Discrete(3),
gym.spaces.Discrete(4),
gym.spaces.Discrete(18),
gym.spaces.Discrete(20),
gym.spaces.Discrete(18)
]))
self.state = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
self.episode = 0
self.thereshold = -10000
super().init()

def reset(self):
    self.state = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    return self.observation_space.sample()

def step(self, actions):
    cooperation_reward = Compute_reward_function(actions)
    self.state = actions      
    obs, rew, dones, info = {}, {}, {}, {}
    for i in range(10):               
        obs[i] = self.observation_space.sample()
        rew[i] = cooperation_reward
        dones[i] = False
        info[i] = {}
    if(cooperation_reward  > self.thereshold):
        self.thereshold = cooperation_reward
        dones = self.n_agents * [True]
    self.episode +=1 
    dones = self.n_agents * [True]
    return obs, rew, dones, info

Implementation tricks for different algorithms

It's a really Nice work on benchmarking multi-agent RL algorithms and I really like it. When I go through the code, I find out epymarl basically only implements the basic version of different algorithms and ignore many different implementation tricks for different algorithms, such as the value normalization trick for MAPPO mentioned in MAPPO , or the value clipping trick for IPPO mentioned in IPPO. Will furture version epymarl support such tricks?

GPU usage without run.sh

Hi,

Because I have some problems in running an experiment using the bash run.sh command, could you please let me know if the standar running command python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=2s3zautomatically check if there are somo GPU device and then use that GPU to running the experiment??

I check in real time the GPU usage and it is true that the % of usage is changing from 7% to 40%, so I don't know if the fact that it is not 100% used is due to some problem of mine of it is the way the code is design.

Thanks!

What does ns stand for?

There are many config files under ./src/config/algs and python modules under ./src/modules which has suffix ns. I go through the paper, code comment, README and even PyMARL project, but failed to find its meaning. What does it stand for?

Exploding targets for A2C and task Tag when add_value_last_step: True

I would like to report an issue which has a defining impact on the critic based
algorithms for the MPE PredatorPrey Task. The target_value variable, built
for training the critic network, is about three orders of magnitude in relation
to return_mean, the long term sample return. This behavior is abnormal for
three reasons: (i) The return_mean is not discounted whereas the target_value
is discounted. The rewards used to estimate the value are (ii) unbiased,
where we subtract a non-negative value, (iii) and re-scaled where we divide by
a standard deviation larger than one. Figure 1 depicts the behavior:

Reproduce:

Commit reference: 3d1463d
Divide the rewards by a factor of 3.0 according to issue #29
Set configurations: (i) maa2c_ns.yamlaccording to Section C.1, subsection
MPE PredadorPrey and Table 23 from Supplemental. (ii) Set time_limit=25
in gymma.yaml.

Config:

{   
    "action_selector": "soft_policies",
    "add_value_last_step": true,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 853609918,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 853609918,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

Correction

The process is better controlled when the option add_value_last_step is set to false as showed in Figure 2.

Config

{    
    "action_selector":  "soft_policies",
    "add_value_last_step": false,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 932101488,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 932101488,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

This is the output from the diff command between the two configuration files.

1,3c1,3
< {   
<     "action_selector": "soft_policies",
<     "add_value_last_step": true,
---
> {    
>     "action_selector":  "soft_policies",
>     "add_value_last_step": false,
16c16
<                     "seed": 853609918,
---
>                     "seed": 932101488,
46c46
<     "seed": 853609918,
---
>     "seed": 932101488,

Unfortunately, I wasn't able to verify the published numbers even when correcting by this flag. Could you point me to the right direction?

Thanks in advance.

About training the MADDPG in MPE.

I am having some problems with your code base and I hope you can help me.
I have tried to run the MADDPG in both "SimpleTag-v0" and "SimpleSpread-v0" scenarios, but I have not achieved the results mentioned in your paper. SimpleSpread-v0 gets a return mean of -400 and SimpleTag-v0 gets a return mean of 5.

I am using the run command :
python3 src/main.py --config=maddpg --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleTag-v0" env_args.pretrained_wrapper="PretrainedTag"

python3 src/main.py --config=maddpg --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleSpread-v0"

Could you please tell me what is wrong, my hyperparameters are set according to the values mentioned in your paper.

Thanks!

How to render or view gifs?

I've just started using this repository and would like to view rendered results from the simple_tag environment. I've run with the line provided for the pretrained models python3 src/main.py --config=qmix --env-config=gymma with env_args.time_limit=25 env_args.key="mpe:SimpleTag-v0" env_args.pretrained_wrapper="PretrainedTag" but have not had any luck.

It does not look like rendering during training/evaluation is supported - is this the case? When trying to save a gif of a trained model after the fact, I am not seeing anything saved off that I can view after setting save_replay: True. What am I missing?

Problem with the implementation of MADDPG

I noticed that when updating actors in the MADDPG paper, only the actions of agent i are generated by the policy, and the actions of other agents come from buffer storage

However, in this codebase, the actions of other agents also come from their respective policies. Will this cause some problems?

Evaluate results question

Hi,

I have a question about the evaluation of a model. I use the code as described

`python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=Multi_task_6m1M_vs_12m1M checkpoint_path=results/models/qmix_best/ save_replay=True test_nepisode=5 evaluate=True'

So, I run the model for evualuation 5 episodes but the resulst with the return_mean and the other metric have only one value

I try some modifications on the config but I get the same results.

What I try to do is obtain the same number of results that the number of episodes. That is to say, the return and the other metric obtained on each of the episodes.

Thanks!

RWARE with MAPPO training

Hi!

Whenever I try to run the MAPPO algorithm, executing:

python3 src/main.py --config=mappo --env-config=gymma with env_args.time_limit=500 env_args.key="rware:rware-tiny-2ag-v1"

I obtain the following error:

Traceback (most recent call last):
  File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/Users/miguelferreira/PycharmProjects/AASMA/venv/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "src/main.py", line 36, in my_main
    run(_run, config, _log)
  File "/Users/miguelferreira/PycharmProjects/AASMA/epymarl/src/run.py", line 55, in run
    run_sequential(args=args, logger=logger)
  File "/Users/miguelferreira/PycharmProjects/AASMA/epymarl/src/run.py", line 127, in run_sequential
    learner = le_REGISTRY[args.learner](mac, buffer.scheme, logger, args)
  File "/Users/miguelferreira/PycharmProjects/AASMA/epymarl/src/learners/ppo_learner.py", line 37, in __init__
    if self.args.standardise_rewards:
AttributeError: 'types.SimpleNamespace' object has no attribute 'standardise_rewards'

During handling of the above exception, another exception occurred:

Does anyone have any tip on how to fix it?

Thanks in advance!

LBF results with IA2C and MAA2C: Foraging-15x15-4p-5f-v1

Hi,

I would like to congratulate the authors for the article and for the initiative of maintaining those benchmark environments the community sorely needs. For the first time ever, I haven't had any major issues replicating the results from a reinforcement learning article. With the exception of the Level-based Foraging environment, namely for the task Foraging-15x15-4p-5f-v1 and algorithms IA2C and MAA2C with parameter sharing. The plots that I got where:

When we contrast with yours:

We can draw the following conclusions:

For both algorithms there is a lot more of standard-error. Prominently at the beginning of the training.
Judging by the average returns: My IA2C slightly over-performs yours while my MAA2C slighlty under-performs yours.
As a result, while your plots show that MAA2C is a better option we cannot draw the same conclusion by looking at mine.

By using the evaluation metrics I was able to get:

	IA2C	IA2C	MAA2C	MAA2C
	Mine	Yours	Mine	Yours
Maximum Return	0.92 ± 0.05	0.93 ± 0.03	0.92 ± 0.02	0.95 ± 0.01
Average Return	0.65 ± 0.04	0.59 ± 0.06	0.67 ± 0.04	0.73 ± 0.02

The random seeds used for both experiments are:

291174067
392184168
402285178
493194269
503295279

The configuration for the IA2C:

{
  "action_selector": "soft_policies",
  "add_value_last_step": true,
  "agent": "rnn",
  "agent_output_type": "pi_logits",
  "batch_size": 10,
  "batch_size_run": 10,
  "buffer_cpu_only": true,
  "buffer_size": 10,
  "checkpoint_path": "",
  "critic_type": "ac_critic",
  "entropy_coef": 0.001,
  "env": "gymma",
  "env_args": {
    "key": "lbforaging:Foraging-15x15-4p-5f-v1",
    "pretrained_wrapper": null,
    "state_last_action": false,
    "time_limit": 50
  },
  "evaluate": false,
  "gamma": 0.99,
  "grad_norm_clip": 10,
  "hidden_dim": 128,
  "hypergroup": null,
  "label": "default_label",
  "learner": "actor_critic_learner",
  "learner_log_interval": 10000,
  "load_step": 0,
  "local_results_path": "results",
  "log_interval": 500000,
  "lr": 0.0005,
  "mac": "basic_mac",
  "mask_before_softmax": true,
  "name": "ia2c",
  "obs_agent_id": true,
  "obs_individual_obs": false,
  "obs_last_action": false,
  "optim_alpha": 0.99,
  "optim_eps": 1e-05,
  "q_nstep": 5,
  "repeat_id": 1,
  "runner": "parallel",
  "runner_log_interval": 10000,
  "save_model": true,
  "save_model_interval": 500000,
  "save_replay": false,
  "seed": 291174067,
  "standardise_returns": false,
  "standardise_rewards": true,
  "t_max": 20500000,
  "target_update_interval_or_tau": 0.01,
  "test_greedy": true,
  "test_interval": 500000,
  "test_nepisode": 100,
  "use_cuda": false,
  "use_rnn": true,
  "use_tensorboard": true
}

Note that I wasn't sure why the configs in src/config/algos/ had a larger t_max than the reported. In this case I dropped the last evaluation from this analysis.

The diff ia2c/1/config.json maa2c/1/config:

11,12c11,12
<   "critic_type": "ac_critic",
<   "entropy_coef": 0.001,
---
>   "critic_type": "cv_critic",
>   "entropy_coef": 0.01,
34c34
<   "name": "ia2c",
---
>   "name": "maa2c",
40c40
<   "q_nstep": 5,
---
>   "q_nstep": 10,
50c50
<   "t_max": 20500000,
---
>   "t_max": 20050000,

Unless I missed something the critic_type isn't covered in the article the difference in configuration are compatible with the different configurations reported on Tables 14 and 24.

Could you clarify what may be happening? And how can I approximate your results?

Paper of MAA2C

I' reading the version on https://arxiv.org/pdf/2006.07869.pdf, the first time the term MAA2C appears is in the following paragraph

But I didn't see the cited paper of MAA2C.

uoe-agents / epymarl Goto Github PK

epymarl's People

Contributors

Stargazers

Watchers

Forkers

epymarl's Issues

Reproduce:

Config:

Considerations

Reproduce:

Config:

Correction

Config

Recommend Projects

Recommend Topics

Recommend Org