lucasalegre / morl-baselines Goto Github PK

View Code? Open in Web Editor NEW

232.0 6.0 32.0 22.73 MB

Multi-Objective Reinforcement Learning algorithms implementations.

Home Page: https://lucasalegre.github.io/morl-baselines

License: MIT License

Python 100.00%

rl gym morl multi-objective-rl pytorch rl-algorithms gymnasium reinforcement-learning mo-gymnasium multi-objective

morl-baselines's People

Contributors

Stargazers

Watchers

morl-baselines's Issues

Why is PCN only working for deterministic envs?

Hi,

I'm benchmarking some of the algorithms in this repository, and I noticed that the README mentions that the current PCN implementation only works for environments with deterministic transitions. However, I don't see an issue with the code that would make it unsuitable for stochastic envs. If there is still such an issue, what would be a fix for it?

The only thing that I can think of is evaluation: if the transitions are stochastic, then we have to sample multiple rollouts to find the average reward for each policy and better approximate metrics such as hypervolume. Was that what you had in mind?

Thanks!

Refactor Multi-Policy MO-Qlearning

Refactor MPMOQLearning such that is can use OLS or GPI-LS inside train() method.

Automatic Hyper-parameters tuning

Tryout optuna or wandb sweeps

Migration from gym to gymnasium (also from mo-gym to mo-gymnasium)

As the new mo-gymnasium been released (0.3.0), I think the library names here for gym and mo-gym should also be updated to gymnasium and mo-gymnasim, respectively. Otherwise issues like ModuleNotFoundError: No module named 'gym' will pop up. @LucasAlegre

Missing instructions on how to train CAPQL agent

Could you please provide an example code on how to train CAPQL agent? I don't find it in the README file or in other documentation. Thank you.

References points for the Environments

Hi! please could you provide the reference points for each of the environments on Mo-gym or as used in the MORL-baselines

Visualizing the generated Pareto front

Hello, thanks so much for sharing the codes!

I am a beginner in multi-objective reinforcement learning and would like to apply CAPQL in my domain. I have three questions:

(1) is the CAPQL code able to optimize three or more objectives currently?
(2) how to visualize the finally generated Pareto front?
(3) it that possible to visualize the Pareto front with three objectives?

Custom Enviroment

Hi I'm a new user, I will like to apply MORL to my custom problem, is the library easy to adapt to a custom environment?

Utility for visualizing the evolution of the Pareto front

Plot 2D, 3D fronts with sns or similar.
Idea: find a way to visualize the evolution of the front over the training phase in wandb (kind of like a video recorder for the PF).

Idea2: an interactive tool for executing the policy being chosen on the Pareto front (hovered with mouse pointer or something similar).

Add CAPQL

It should be very easy to add CAPQL from https://openreview.net/pdf?id=TjEzIsyEsQ6 . In a nutshell, it is a MO version of SAC, and nothing else.

Code: https://github.com/haoyelu/CAPQL

Performance report issue tracker

This issue is there to allow to coordinate who is running what and see a more or less live update of the performances being uploaded to openrlbenchmark.

See all runs: openrlbenchmark

How to help?

Mark your name on an algo/env combination and state the runs as you make them.

Run command with benchmark script:

python benchmark/launch_experiment.py --algo <ALGO> --env-id <ENV_ID> --num-timesteps 1000000 --gamma 0.99 --ref-point ... --auto-tag True --wandb-entity openrlbenchmark --seed <0 to 9> --init-hyperparams ... --train-hyperparams ...

Deterministic envs

For all deterministic environments, we push the learning rate to 1.0 and exploration rate higher since it's all about exploring fast in these cases. Our deterministic envs:

deep-sea-treasure-v0
deep-sea-treasure-concave-v0
four-room-v0
fruit-tree-v0

Multi-policy

✅ CAPQL

--env-id mo-lunar-lander-continuous-v2 --num-timesteps 50000 --ref-point -110 -400 -100 -100 --init-hyperparams "alpha:0.2" 10/10
--env-id mo-halfcheetah-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "alpha:0.2" 10/10
--env-id mo-hopper-2d-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "alpha:0.2" 10/10
--env-id mo-hopper-v4 --num-timesteps 200000 --ref-point -100 -100 -100 --init-hyperparams "alpha:0.2" 10/10

✅ GPI-LS continuous

--algo gpi_ls_continuous

--env-id mo-lunar-lander-continuous-v2 --num-timesteps 200000 --ref-point -110 -400 -100 -100 --init-hyperparams "per:False" 10/10
--env-id mo-halfcheetah-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "per:False" 10/10
--env-id mo-hopper-2d-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "per:False" 10/10
--env-id mo-hopper-v4 --num-timesteps 200000 --ref-point -100 -100 -100 --init-hyperparams "per:False" 10/10

✅ GPI-PD continuous

--algo gpi_pd_continuous

--env-id mo-lunar-lander-continuous-v2 --num-timesteps 200000 --ref-point -110 -400 -100 -100 10/10
--env-id mo-halfcheetah-v4 --num-timesteps 100000 --ref-point -100 -100 10/10
--env-id mo-hopper-2d-v4 --num-timesteps 100000 --ref-point -100 -100 10/10
--env-id mo-hopper-v4 --num-timesteps 100000 --ref-point -100 -100 -100 10/10

✅ GPI-LS discrete

--algo gpi_ls_discrete

--env-id mo-mountaincar-v0 --num-timesteps 200000 --ref-point -200 -200 -200 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
--env-id mo-lunar-lander-v2 --num-timesteps 200000 --ref-point -101 -1001 -101 -101 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
--env-id minecart-v0 --num-timesteps 200000 --gamma 0.98 --ref-point -1 -1 -200 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
--env-id mo-highway-fast-v0 --num-timesteps 200000 --ref-point -1 -1 -40 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
--env-id mo-reacher-v4 --num-timesteps 200000 --ref-point -50 -50 -50 -50 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10

✅ GPI-PD discrete

--algo gpi_pd_discrete

--env-id mo-mountaincar-v0 --num-timesteps 50000 --ref-point -200 -200 -200 10/10
--env-id mo-lunar-lander-v2 --num-timesteps 200000 --ref-point -101 -1001 -101 -101 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
--env-id minecart-v0 --num-timesteps 200000 --gamma 0.98 --ref-point -1 -1 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
--env-id mo-highway-fast-v0 --num-timesteps 100000 --ref-point -1 -1 -40 10/10
--env-id mo-reacher-v4 --num-timesteps 200000 --ref-point -50 -50 -50 -50 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10

✅ Envelope

--algo envelope

--env-id mo-mountaincar-v0 --num-timesteps 1000000 --ref-point -200 -200 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
--env-id mo-lunar-lander-v2 --num-timesteps 1000000 --ref-point -101 -1001 -101 -101 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
--env-id minecart-v0 --gamma 0.98 --num-timesteps 1000000 --ref-point -1 -1 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
--env-id mo-highway-fast-v0 --num-timesteps 1000000 --ref-point -1 -1 -40 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
--env-id mo-reacher-v4 --num-timesteps 1000000 --ref-point -50 -50 -50 -50 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10

✅ PGMORL

--algo pgmorl

--env-id mo-mountaincarcontinuous-v0 --num-timesteps 3000000 --ref-point -110 -110 10/10
--env-id mo-halfcheetah-v4 --num-timesteps 5000000 --ref-point -100 -100 10/10
--env-id mo-hopper-2d-v4 --num-timesteps 5000000 --ref-point -100 -100 10/10

PCN

--algo pcn

--env-id mo-mountaincar-v0 --init-hyperparams "scaling_factor:np.array([...]) 0/10
--env-id mo-lunar-lander-v2 --init-hyperparams "scaling_factor:np.array([...]) 0/10
`--env-id mo-highway-fast-v0
`--env-id mo-reacher-v4
--algo pcn --env-id minecart-v0 --gamma 0.98 --ref-point -1 -1 -200 --num-timesteps 10000000 --auto-tag True --wandb-entity openrlbenchmark --seed 0 --init-hyperparams "scaling_factor:np.array([1, 1, 0.1, 0.1])" --train-hyperparams "max_return:1.5" 0/10

✅ PQL (deterministic envs)

--algo pql

--env-id deep-sea-treasure-v0 --num-timesteps 200000 --ref-point 0 -50 --init-hyperparams "ref_point:np.array([0, -50])" 10/10 (deterministic env)
--env-id deep-sea-treasure-concave-v0 --num-timesteps 200000 --ref-point 0 -50 --init-hyperparams "ref_point:np.array([0, -50])" 10/10 (deterministic env)
--env-id fruit-tree-v0 --num-timesteps 150000 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "ref_point:np.array([-1, -1, -1, -1, -1, -1])" 10/10 (deterministic env)

✅ GPI-LS tabular

--algo gpi-ls --init-hyperparams "use_gpi_policy:True"

--env-id deep-sea-treasure-v0 --num-timesteps 400000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)"10/10 (deterministic env)
--env-id deep-sea-treasure-concave-v0 --num-timesteps 400000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
--env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --init-hyperparams "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20" 10/10
--env-id fruit-tree-v0 --num-timesteps 400000 --gamma 0.99 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
--env-id four-room-v0 --num-timesteps 400000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)

✅ MPMOQL

--algo mpmoql

--env-id deep-sea-treasure-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"10/10 (deterministic env)
--env-id deep-sea-treasure-concave-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
--env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20" 10/10
--env-id fruit-tree-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
--env-id four-room-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)

✅ OLS

--algo ols --init-hyperparams "weight_selection_algo:'ols'" "epsilon_ols:0.0"

--env-id deep-sea-treasure-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "weight_selection_algo:'ols'" "epsilon_ols:0.0" "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
--env-id deep-sea-treasure-concave-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "weight_selection_algo:'ols'" "epsilon_ols:0.0" "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
--env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20" 10/10
--env-id fruit-tree-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "weight_selection_algo:'ols'" "epsilon_ols:0.0" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
--env-id four-room-v0 resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "weight_selection_algo:'ols'" "epsilon_ols:0.0" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)

Single-policy

MOQL

EUPG

deep-sea-treasure-concave-v0 0/10
fishwood-v0 0/10

Bug in PCN due to logging

Hi, it seems I introduced a bug in PCN when extending the logger to also log the training parameters. I will open a pull request in a minute with a quick fix but I wanted to have this issue as a reference.

The problem starts at the following line that initialises a parameter with type np.ndarray to a float.

max_return: np.ndarray = 100.0

When we want to log this parameter, it is assumed that max_return is indeed an array and that tolist() is available. However, since it is initialised to a float, this raises an error at the following line:

"max_return": max_return.tolist(),

Add documentation

TODO:

Fix bug https://github.com/LucasAlegre/morl-baselines/actions/runs/4000541688/jobs/6865789098#step:6:41

Does anyone know how to load the trained model?

I follow the instruction on this page: https://mo-gymnasium.farama.org/introduction/api/
to run the code, but I got the following error:
Envelope.act() missing 1 required positional argument: 'w'
Thank you very much if anyone could help!!

Error when running CAPQL-based algorithm on Hopper

Hello,
I received this error when evaluating an algorithm based on CAPQL on hopper. Any insights what may have caused this?

File "/home/sahandr/workspace/morl-baselines/morl_baselines/common/evaluation.py", line 168, in log_all_multi_policy_metrics
    hv = hypervolume(hv_ref_point, filtered_front)
  File "/home/sahandr/workspace/morl-baselines/morl_baselines/common/performance_indicators.py", line 24, in hypervolume
    return HV(ref_point=ref_point * -1)(np.array(points) * -1)
  File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/core/indicator.py", line 15, in __call__
    return self.do(F, *args, **kwargs)
  File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/core/indicator.py", line 30, in do
    return self._do(F, *args, **kwargs)
  File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/indicators/hv/__init__.py", line 43, in _do
    val = hv.compute(F)
  File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/vendor/hv.py", line 56, in compute
    if weaklyDominates(point, referencePoint):
  File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/vendor/hv.py", line 47, in weaklyDominates
    if point[i] > other[i]:
IndexError: index 2 is out of bounds for axis 0 with size 2

Thanks in advance!
Sahand

Pareto Q-Learning fails with any environment that doesn't have 2D state/obs space

As the title suggests running PQL with an environment that doesn't have a 2D state/obs space will run into an error during training (pql.py):

state = int(np.ravel_multi_index(state, self.env_shape))

ravel_multi_index expects the first parameter to be of the same sequence length as the second otherwise it will throw a ValueError. Since env_shape is hard-coded to be 2D

self.env_shape = ( int(high_bound[0] - low_bound[0] + 1), int(high_bound[1] - low_bound[1] + 1), )

this assumes the state/obs is always of length 2. I assume something is off here, or that I've got something completely wrong but I'm consistently getting errors in environments that don't have a 2D obs space.

why do the GPI-LS and GPI-PD run for a very long time?

Hi Dears,

I would like to say thanks for your contributions to maintaining the code library, which helped me to understand MORLs.

When I try to run the GPI-LS and GPI-PD on the env namely Deep Sea Treasure, I find the running time is very long with total timestep=1e5. It takes 7-8 hours to run on my PC with a single NVIDIA A30 GPU, is this common case?

Looking forward to your reply, and thanks again.

Thanks,
Tianyang

Help Regarding Interpreting PGMORL Convergence

Hi,

I am a novice in the multi-objective RL realm, albeit I have quite a bit of experience working with single-objective RL.

I started off with single-objective/regular RL on one of my projects involving a drone performing a specific task. I was employing PPO (single-objective from stable_baselines3) and with a few experiments, the algorithm converged decently with the ESR - Expected Scalarized Return approach, i.e. first scalarize the returns (weighted sum) and then the expectation.

Then, I was curious to try out multi-objective RL as it made sense to tune one of the objectives I had in my reward function.
So, I converted my environment to a multi-objective one by simply extending from the base environment and redefining the reward function. However, on training an array of agents using PGMORL, I observed that none of the agents managed to converge for either of the objectives even after training for a really long time (1e7 timesteps). The entropy graph looked startling to me, the fact that the policy entropy keeps going up and down, ideally it should be reducing to some lower value.
See the entropy graph below. This is just one of the individuals, though it was the same case for all the learned policies.

I figured it could have been an underfit scenario and tried expanding the network archs, but it not help.

I then reverted to testing out one of the examples provided by PGMORL to observe the results. I ran the halfcheetah example as follows (default parameters listed here: #43 for PGMOL halfcheetah), except that I changed the origin to [-0, -0]

python benchmark/launch_experiment.py --algo pgmorl --env-id mo-halfcheetah-v4 --num-timesteps 5000000 --gamma 0.99 --ref-point -0 -0 --auto-tag True --seed 0 --init-hyperparams "project_name:'mo-halfcheetah'"

And the results I observed are quite similar. The entropy loss keeps fluctuating back and forth.

more results for half-cheetah

Do these results imply that the algorithm is unable to converge? Or should I just run the training for even longer?

Thanks,
Arshad

Multi Discrete action space

I have a question concerning the multi-discrete action space .I'm supposed to have an action which is a vector not a scalar but for example in pql.py the action is a scalar (for different functions ). multiplying the values of the vector is not logic it's like working on a discrete space but with more actions.

Evaluation of the generated Pareto front

Hello! Thank you for your codes!

As a beginner in MORL, I have two questions:

By referring to Xu et al. (2020), "a desired Pareto set approximation is expected to have high hypervolume metric and low sparsity metric." When I use CAPQL to generate Pareto fronts, in many cases, one generated Pareto front may have the highest hypervolume but its sparsity is not the lowest. I am wondering if there is a comprehensive indicator to evaluate the Pareto front.
I applied to CAPQL to a customized environment with four objectives where three of them are always positive and another is between [-1,1]. Below is a generated Pareto front. You can see some points even form a cluster. Theoretically, it should be something like a quarter circle. I think that might be due to my reward setting. Are there any requirements on the range (e.g., all should be positive) or the function format (e.g., linear function. Currently, three of my objectives are quadratic.) of the objective for CAPQL?

Thanks in advance!

Using mo-highway environment with morl baselines

Hi!
Thanks for your hard work on these implementations!
I've been trying to train an agent in the mo-highway-fast-v0 environment using both Envelope and GPI-PD algorithms, but I get the same error when trying to run it:
RuntimeError: Calculated padded input size per channel: (5 x 5). Kernel size: (8 x 8). Kernel size can't be greater than actual input size
The error is coming from the QNet definition, more precisely from line 75 in morl_baselines_main/morl_baselines/common/networks.py:
self.feature_extractor = NatureCNN(self.obs_shape, features_dim=512)

When running other environments it works ok, so I was wondering if there is something wrong with my code or environment.
Thanks in advance for your help!

AttributeError: module 'cdd' has no attribute 'Matrix'

Hi team, I am working on testing highway-env under python gpi_pd_highway.py gpi-ls false 1

It encounter the error as here around 10000 steps. Do you have any idea to overide this error below ?

Traceback (most recent call last):
  File "gpi_pd_highway_bs.py", line 69, in <module>
    fire.Fire(main)
  File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "gpi_pd_highway_bs.py", line 58, in main
    agent.train(
  File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/gpi_pd/gpi_pd.py", line 844, in train
    w = linear_support.next_weight(
  File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/linear_support/linear_support.py", line 72, in next_weight
    W_corner = self.compute_corner_weights()
  File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/linear_support/linear_support.py", line 345, in compute_corner_weights
    vertices = compute_poly_vertices(A, b)
  File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/linear_support/linear_support.py", line 332, in compute_poly_vertices
    mat = cdd.Matrix(np.hstack([b, -A]), number_type="float")
AttributeError: module 'cdd' has no attribute 'Matrix'

I manually reinstall cdd as

zi6106738@***** morl-baselines % pip show cdd          
Name: cdd
Version: 0.1.12
Summary: improved file system navigation with cd
Home-page: https://github.com/daltonserey/cdd
Author: Dalton Serey
Author-email: [email protected]
License: GNU Affero General Public License v3
Location: /Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages
Requires: 
Required-by:

But still have the error as here.

Thanks

Remove tensorboard writer

I believe there is no reason for us to use tensorboard + wandb. We could refactor the code to log everything purely using wandb.

issue when calculating the `log_probs` in EUPG.

I was studding the implementation of EUPG algorithm and I noticed that there is a problem when calculating the log_probs in line 179:
log_probs = current_distribution.log_prob(actions) using the fishwood environment. When updating the network, the size of actions is [200,1], and the size of current_distribution is [200,2]. But the size of log_probs is [200,200]. I suggest to use current_distribution.log_prob(actions.squeeze()) instead of current_distribution.log_prob(actions) in the implementation, but I'm not sure if it is correct or not.

Base testing

Implement tests which run algorithms very quickly, just to ensure they still do run after we make breaking changes.

Benchmark algorithms + wandb reports

Multi-core Training PGMORL

Hi,

I would to know if the algorithms support multi-core CPU processing or GPUs.

Currently, I am running PGMORL and I observe that my the computation is not utilized to its full potential. Initially, I was on M1 architecture but then switched to a 24 core 64-bit Linux machine, but I still observe that not all of the cores are being used, in fact just one of the cores reaches 100% during the training phase.

Since PGMORL trains several agents simultaneously, shouldn't it be possible to train them in parallel (on different cores)?

I looked up the documentation here PGMORL and found the reference to device parameter. So does it mean that GPU is supported?
Also is it possible to do multi-core training on CPUs using raylib? (I could not find any doc references to this)

Regards,
Arshad

how to run the file named gpi_pd_hopper.py in the example folder?

I try to run the code gpi_pd_hopper.py in the example folder, but got the following error:
ERROR: The function received no value for the required argument: algo
Usage: gpi_pd_minecart.py ALGO GPI_PD G
optional flags: --timesteps_per_iter | --seed

For detailed information on this command, run:
gpi_pd_minecart.py --help
An exception has occurred, use %tb to see the full traceback.

It seems that I have to pass some value to algo, could anyone give more instructions about how to run this code? Thank you very much!

Write general overview of each algorithm in the docs

The Pareto front generated by minecart is not optimal

I discovered that the Pareto front generated by the minecart environment is not optimal. I found this out by comparing the hypervolume of the Pareto front generated by GPI-LS with the one included in the environment itself. If the Pareto front generated by the environment was optimal, its hypervolume would be greater than or equal to the one resulting from GPI-LS. Instead, I observed that the hypervolume obtained by GPI-LS is greater.

Here is some code to reproduce it. I downloaded the Pareto front generated by the following run of GPI-LS: https://wandb.ai/openrlbenchmark/MORL-Baselines/runs/y6s3uaty
Note that the same gamma is used in both solution sets, so the comparison is valid.

import pygmo as pg
import pandas as pd
import mo_gymnasium as mo_gym


env = mo_gym.make("minecart-v0")
pf = np.array(env.unwrapped.pareto_front(gamma=0.98, symmetric=True))

ref_point = np.array([-1, -1, -200])
hv_pf = pg.hypervolume(-pf).compute(-ref_point)

print("Minecart")
print(f"True PF HV: {hv_pf}")
print("----------")

FILE_NAME = "GPI_LS_front.csv"  # Put the filename of the CSV here.
found_vecs = pd.read_csv(FILE_NAME).to_numpy()
hv_pf = pg.hypervolume(-found_vecs).compute(-ref_point)
print(f'GPI-LS PF HV: {hv_pf}')

I think it would be worthwhile to also verify whether the convex hull generated by minecart is correct, but I did not check this myself.

Finish GPI-PD tabular version

The current tabular version of GPI-PD does not have all the steps mentioned in the paper.

Unify exploration strategies

Use linear decay in PQL

Help Regarding GPILSContinuousAction Model Convergence

Hi,

I am fairly new to multi-objective RL, but I have quite a bit of experience working with single-objective tasks.

I was trying to train the GPILSContinousAction model on one of my quadrotor tasks, but the model doesn't seem to converge completely.

The task includes two objectives, and the single-objective PPO (ESR approach) had no problems converging, while the multi-objective RL seems to have a hard time. Initially, I tried using PGMORL since I was familiar with the PPO algorithm, but the performance was lacklustre with poor convergence. With GPILSContinuousAction model the reward seems to stagnate at $-39$ while the highest reward is 0.

I have also tried setting the weight supports prior to training to debug the convergence by having the weights as [[1.0, 0.0], [1.0, 0.25]], with the first weight it essentially converts the problem into a single-objective, however neither of the agents trained converge!

I would really appreciate your help!