lucasalegre / morl-baselines Goto Github PK
View Code? Open in Web Editor NEWMulti-Objective Reinforcement Learning algorithms implementations.
Home Page: https://lucasalegre.github.io/morl-baselines
License: MIT License
Multi-Objective Reinforcement Learning algorithms implementations.
Home Page: https://lucasalegre.github.io/morl-baselines
License: MIT License
Hi,
I'm benchmarking some of the algorithms in this repository, and I noticed that the README mentions that the current PCN implementation only works for environments with deterministic transitions. However, I don't see an issue with the code that would make it unsuitable for stochastic envs. If there is still such an issue, what would be a fix for it?
The only thing that I can think of is evaluation: if the transitions are stochastic, then we have to sample multiple rollouts to find the average reward for each policy and better approximate metrics such as hypervolume. Was that what you had in mind?
Thanks!
Refactor MPMOQLearning such that is can use OLS or GPI-LS inside train() method.
Tryout optuna or wandb sweeps
As the new mo-gymnasium been released (0.3.0), I think the library names here for gym and mo-gym should also be updated to gymnasium and mo-gymnasim, respectively. Otherwise issues like ModuleNotFoundError: No module named 'gym'
will pop up. @LucasAlegre
Could you please provide an example code on how to train CAPQL agent? I don't find it in the README file or in other documentation. Thank you.
Hi! please could you provide the reference points for each of the environments on Mo-gym or as used in the MORL-baselines
Hello, thanks so much for sharing the codes!
I am a beginner in multi-objective reinforcement learning and would like to apply CAPQL in my domain. I have three questions:
(1) is the CAPQL code able to optimize three or more objectives currently?
(2) how to visualize the finally generated Pareto front?
(3) it that possible to visualize the Pareto front with three objectives?
Hi I'm a new user, I will like to apply MORL to my custom problem, is the library easy to adapt to a custom environment?
Plot 2D, 3D fronts with sns or similar.
Idea: find a way to visualize the evolution of the front over the training phase in wandb (kind of like a video recorder for the PF).
Idea2: an interactive tool for executing the policy being chosen on the Pareto front (hovered with mouse pointer or something similar).
It should be very easy to add CAPQL from https://openreview.net/pdf?id=TjEzIsyEsQ6 . In a nutshell, it is a MO version of SAC, and nothing else.
This issue is there to allow to coordinate who is running what and see a more or less live update of the performances being uploaded to openrlbenchmark.
See all runs: openrlbenchmark
Mark your name on an algo/env combination and state the runs as you make them.
Run command with benchmark script:
python benchmark/launch_experiment.py --algo <ALGO> --env-id <ENV_ID> --num-timesteps 1000000 --gamma 0.99 --ref-point ... --auto-tag True --wandb-entity openrlbenchmark --seed <0 to 9> --init-hyperparams ... --train-hyperparams ...
For all deterministic environments, we push the learning rate to 1.0 and exploration rate higher since it's all about exploring fast in these cases. Our deterministic envs:
deep-sea-treasure-v0
deep-sea-treasure-concave-v0
four-room-v0
fruit-tree-v0
--env-id mo-lunar-lander-continuous-v2 --num-timesteps 50000 --ref-point -110 -400 -100 -100 --init-hyperparams "alpha:0.2"
10/10--env-id mo-halfcheetah-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "alpha:0.2"
10/10--env-id mo-hopper-2d-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "alpha:0.2"
10/10--env-id mo-hopper-v4 --num-timesteps 200000 --ref-point -100 -100 -100 --init-hyperparams "alpha:0.2"
10/10--algo gpi_ls_continuous
--env-id mo-lunar-lander-continuous-v2 --num-timesteps 200000 --ref-point -110 -400 -100 -100 --init-hyperparams "per:False"
10/10--env-id mo-halfcheetah-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "per:False"
10/10--env-id mo-hopper-2d-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "per:False"
10/10--env-id mo-hopper-v4 --num-timesteps 200000 --ref-point -100 -100 -100 --init-hyperparams "per:False"
10/10--algo gpi_pd_continuous
--env-id mo-lunar-lander-continuous-v2 --num-timesteps 200000 --ref-point -110 -400 -100 -100
10/10--env-id mo-halfcheetah-v4 --num-timesteps 100000 --ref-point -100 -100
10/10--env-id mo-hopper-2d-v4 --num-timesteps 100000 --ref-point -100 -100
10/10--env-id mo-hopper-v4 --num-timesteps 100000 --ref-point -100 -100 -100
10/10--algo gpi_ls_discrete
--env-id mo-mountaincar-v0 --num-timesteps 200000 --ref-point -200 -200 -200 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--env-id mo-lunar-lander-v2 --num-timesteps 200000 --ref-point -101 -1001 -101 -101 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--env-id minecart-v0 --num-timesteps 200000 --gamma 0.98 --ref-point -1 -1 -200 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--env-id mo-highway-fast-v0 --num-timesteps 200000 --ref-point -1 -1 -40 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--env-id mo-reacher-v4 --num-timesteps 200000 --ref-point -50 -50 -50 -50 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--algo gpi_pd_discrete
--env-id mo-mountaincar-v0 --num-timesteps 50000 --ref-point -200 -200 -200
10/10--env-id mo-lunar-lander-v2 --num-timesteps 200000 --ref-point -101 -1001 -101 -101 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--env-id minecart-v0 --num-timesteps 200000 --gamma 0.98 --ref-point -1 -1 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--env-id mo-highway-fast-v0 --num-timesteps 100000 --ref-point -1 -1 -40
10/10--env-id mo-reacher-v4 --num-timesteps 200000 --ref-point -50 -50 -50 -50 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10"
10/10--algo envelope
--env-id mo-mountaincar-v0 --num-timesteps 1000000 --ref-point -200 -200 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000"
10/10--env-id mo-lunar-lander-v2 --num-timesteps 1000000 --ref-point -101 -1001 -101 -101 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000"
10/10--env-id minecart-v0 --gamma 0.98 --num-timesteps 1000000 --ref-point -1 -1 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000"
10/10--env-id mo-highway-fast-v0 --num-timesteps 1000000 --ref-point -1 -1 -40 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000"
10/10--env-id mo-reacher-v4 --num-timesteps 1000000 --ref-point -50 -50 -50 -50 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000"
10/10--algo pgmorl
--env-id mo-mountaincarcontinuous-v0 --num-timesteps 3000000 --ref-point -110 -110
10/10--env-id mo-halfcheetah-v4 --num-timesteps 5000000 --ref-point -100 -100
10/10--env-id mo-hopper-2d-v4 --num-timesteps 5000000 --ref-point -100 -100
10/10--algo pcn
--env-id mo-mountaincar-v0 --init-hyperparams "scaling_factor:np.array([...])
0/10--env-id mo-lunar-lander-v2 --init-hyperparams "scaling_factor:np.array([...])
0/10--algo pcn --env-id minecart-v0 --gamma 0.98 --ref-point -1 -1 -200 --num-timesteps 10000000 --auto-tag True --wandb-entity openrlbenchmark --seed 0 --init-hyperparams "scaling_factor:np.array([1, 1, 0.1, 0.1])" --train-hyperparams "max_return:1.5"
0/10--algo pql
--env-id deep-sea-treasure-v0 --num-timesteps 200000 --ref-point 0 -50 --init-hyperparams "ref_point:np.array([0, -50])"
10/10 (deterministic env)--env-id deep-sea-treasure-concave-v0 --num-timesteps 200000 --ref-point 0 -50 --init-hyperparams "ref_point:np.array([0, -50])"
10/10 (deterministic env)--env-id fruit-tree-v0 --num-timesteps 150000 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "ref_point:np.array([-1, -1, -1, -1, -1, -1])"
10/10 (deterministic env)--algo gpi-ls --init-hyperparams "use_gpi_policy:True"
--env-id deep-sea-treasure-v0 --num-timesteps 400000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id deep-sea-treasure-concave-v0 --num-timesteps 400000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --init-hyperparams "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20"
10/10--env-id fruit-tree-v0 --num-timesteps 400000 --gamma 0.99 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id four-room-v0 --num-timesteps 400000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--algo mpmoql
--env-id deep-sea-treasure-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id deep-sea-treasure-concave-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20"
10/10--env-id fruit-tree-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id four-room-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--algo ols --init-hyperparams "weight_selection_algo:'ols'" "epsilon_ols:0.0"
--env-id deep-sea-treasure-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "weight_selection_algo:'ols'" "epsilon_ols:0.0" "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id deep-sea-treasure-concave-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "weight_selection_algo:'ols'" "epsilon_ols:0.0" "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20"
10/10--env-id fruit-tree-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "weight_selection_algo:'ols'" "epsilon_ols:0.0" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)--env-id four-room-v0 resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "weight_selection_algo:'ols'" "epsilon_ols:0.0" --train-hyperparams "timesteps_per_iteration:int(1e4)"
10/10 (deterministic env)deep-sea-treasure-v0
0/10deep-sea-treasure-concave-v0
0/10resource-gathering-v0
fruit-tree-v0
0/10four-room-v0
0/10deep-sea-treasure-concave-v0
0/10fishwood-v0
0/10Hi, it seems I introduced a bug in PCN when extending the logger to also log the training parameters. I will open a pull request in a minute with a quick fix but I wanted to have this issue as a reference.
The problem starts at the following line that initialises a parameter with type np.ndarray
to a float.
max_return: np.ndarray = 100.0
When we want to log this parameter, it is assumed that max_return is indeed an array and that tolist()
is available. However, since it is initialised to a float, this raises an error at the following line:
"max_return": max_return.tolist(),
I follow the instruction on this page: https://mo-gymnasium.farama.org/introduction/api/
to run the code, but I got the following error:
Envelope.act() missing 1 required positional argument: 'w'
Thank you very much if anyone could help!!
Hello,
I received this error when evaluating an algorithm based on CAPQL on hopper. Any insights what may have caused this?
File "/home/sahandr/workspace/morl-baselines/morl_baselines/common/evaluation.py", line 168, in log_all_multi_policy_metrics
hv = hypervolume(hv_ref_point, filtered_front)
File "/home/sahandr/workspace/morl-baselines/morl_baselines/common/performance_indicators.py", line 24, in hypervolume
return HV(ref_point=ref_point * -1)(np.array(points) * -1)
File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/core/indicator.py", line 15, in __call__
return self.do(F, *args, **kwargs)
File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/core/indicator.py", line 30, in do
return self._do(F, *args, **kwargs)
File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/indicators/hv/__init__.py", line 43, in _do
val = hv.compute(F)
File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/vendor/hv.py", line 56, in compute
if weaklyDominates(point, referencePoint):
File "/home/sahandr/env/morlenv/lib/python3.9/site-packages/pymoo/vendor/hv.py", line 47, in weaklyDominates
if point[i] > other[i]:
IndexError: index 2 is out of bounds for axis 0 with size 2
Thanks in advance!
Sahand
As the title suggests running PQL with an environment that doesn't have a 2D state/obs space will run into an error during training (pql.py):
state = int(np.ravel_multi_index(state, self.env_shape))
ravel_multi_index expects the first parameter to be of the same sequence length as the second otherwise it will throw a ValueError. Since env_shape is hard-coded to be 2D
self.env_shape = ( int(high_bound[0] - low_bound[0] + 1), int(high_bound[1] - low_bound[1] + 1), )
this assumes the state/obs is always of length 2. I assume something is off here, or that I've got something completely wrong but I'm consistently getting errors in environments that don't have a 2D obs space.
Hi Dears,
I would like to say thanks for your contributions to maintaining the code library, which helped me to understand MORLs.
When I try to run the GPI-LS and GPI-PD on the env namely Deep Sea Treasure, I find the running time is very long with total timestep=1e5. It takes 7-8 hours to run on my PC with a single NVIDIA A30 GPU, is this common case?
Looking forward to your reply, and thanks again.
Thanks,
Tianyang
Hi,
I am a novice in the multi-objective RL realm, albeit I have quite a bit of experience working with single-objective RL.
I started off with single-objective/regular RL on one of my projects involving a drone performing a specific task. I was employing PPO (single-objective from stable_baselines3) and with a few experiments, the algorithm converged decently with the ESR - Expected Scalarized Return approach, i.e. first scalarize the returns (weighted sum) and then the expectation.
Then, I was curious to try out multi-objective RL as it made sense to tune one of the objectives I had in my reward function.
So, I converted my environment to a multi-objective one by simply extending from the base environment and redefining the reward function. However, on training an array of agents using PGMORL, I observed that none of the agents managed to converge for either of the objectives even after training for a really long time (1e7 timesteps). The entropy graph looked startling to me, the fact that the policy entropy keeps going up and down, ideally it should be reducing to some lower value.
See the entropy graph below. This is just one of the individuals, though it was the same case for all the learned policies.
I figured it could have been an underfit scenario and tried expanding the network archs, but it not help.
I then reverted to testing out one of the examples provided by PGMORL to observe the results. I ran the halfcheetah
example as follows (default parameters listed here: #43 for PGMOL halfcheetah
), except that I changed the origin to [-0, -0]
python benchmark/launch_experiment.py --algo pgmorl --env-id mo-halfcheetah-v4 --num-timesteps 5000000 --gamma 0.99 --ref-point -0 -0 --auto-tag True --seed 0 --init-hyperparams "project_name:'mo-halfcheetah'"
And the results I observed are quite similar. The entropy loss keeps fluctuating back and forth.
more results for half-cheetah
Thanks,
Arshad
I have a question concerning the multi-discrete action space .I'm supposed to have an action which is a vector not a scalar but for example in pql.py the action is a scalar (for different functions ). multiplying the values of the vector is not logic it's like working on a discrete space but with more actions.
Hello! Thank you for your codes!
As a beginner in MORL, I have two questions:
By referring to Xu et al. (2020), "a desired Pareto set approximation is expected to have high hypervolume metric and low sparsity metric." When I use CAPQL to generate Pareto fronts, in many cases, one generated Pareto front may have the highest hypervolume but its sparsity is not the lowest. I am wondering if there is a comprehensive indicator to evaluate the Pareto front.
I applied to CAPQL to a customized environment with four objectives where three of them are always positive and another is between [-1,1]. Below is a generated Pareto front. You can see some points even form a cluster. Theoretically, it should be something like a quarter circle. I think that might be due to my reward setting. Are there any requirements on the range (e.g., all should be positive) or the function format (e.g., linear function. Currently, three of my objectives are quadratic.) of the objective for CAPQL?
Thanks in advance!
Hi!
Thanks for your hard work on these implementations!
I've been trying to train an agent in the mo-highway-fast-v0 environment using both Envelope and GPI-PD algorithms, but I get the same error when trying to run it:
RuntimeError: Calculated padded input size per channel: (5 x 5). Kernel size: (8 x 8). Kernel size can't be greater than actual input size
The error is coming from the QNet definition, more precisely from line 75 in morl_baselines_main/morl_baselines/common/networks.py:
self.feature_extractor = NatureCNN(self.obs_shape, features_dim=512)
When running other environments it works ok, so I was wondering if there is something wrong with my code or environment.
Thanks in advance for your help!
Hi team, I am working on testing highway-env under python gpi_pd_highway.py gpi-ls false 1
It encounter the error as here around 10000 steps. Do you have any idea to overide this error below ?
Traceback (most recent call last):
File "gpi_pd_highway_bs.py", line 69, in <module>
fire.Fire(main)
File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "gpi_pd_highway_bs.py", line 58, in main
agent.train(
File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/gpi_pd/gpi_pd.py", line 844, in train
w = linear_support.next_weight(
File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/linear_support/linear_support.py", line 72, in next_weight
W_corner = self.compute_corner_weights()
File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/linear_support/linear_support.py", line 345, in compute_corner_weights
vertices = compute_poly_vertices(A, b)
File "/Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages/morl_baselines/multi_policy/linear_support/linear_support.py", line 332, in compute_poly_vertices
mat = cdd.Matrix(np.hstack([b, -A]), number_type="float")
AttributeError: module 'cdd' has no attribute 'Matrix'
I manually reinstall cdd
as
zi6106738@***** morl-baselines % pip show cdd
Name: cdd
Version: 0.1.12
Summary: improved file system navigation with cd
Home-page: https://github.com/daltonserey/cdd
Author: Dalton Serey
Author-email: [email protected]
License: GNU Affero General Public License v3
Location: /Users/zi6106738/.pyenv/versions/3.8.16/lib/python3.8/site-packages
Requires:
Required-by:
But still have the error as here.
Thanks
I believe there is no reason for us to use tensorboard + wandb. We could refactor the code to log everything purely using wandb.
I was studding the implementation of EUPG algorithm and I noticed that there is a problem when calculating the log_probs
in line 179:
log_probs = current_distribution.log_prob(actions)
using the fishwood
environment. When updating the network, the size of actions
is [200,1], and the size of current_distribution
is [200,2]. But the size of log_probs
is [200,200]. I suggest to use current_distribution.log_prob(actions.squeeze())
instead of current_distribution.log_prob(actions)
in the implementation, but I'm not sure if it is correct or not.
Implement tests which run algorithms very quickly, just to ensure they still do run after we make breaking changes.
Hi,
I would to know if the algorithms support multi-core CPU processing or GPUs.
Currently, I am running PGMORL
and I observe that my the computation is not utilized to its full potential. Initially, I was on M1 architecture but then switched to a 24 core 64-bit Linux machine, but I still observe that not all of the cores are being used, in fact just one of the cores reaches 100% during the training phase.
Since PGMORL trains several agents simultaneously, shouldn't it be possible to train them in parallel (on different cores)?
I looked up the documentation here PGMORL and found the reference to device
parameter. So does it mean that GPU is supported?
Also is it possible to do multi-core training on CPUs using raylib
? (I could not find any doc references to this)
Regards,
Arshad
I try to run the code gpi_pd_hopper.py in the example folder, but got the following error:
ERROR: The function received no value for the required argument: algo
Usage: gpi_pd_minecart.py ALGO GPI_PD G
optional flags: --timesteps_per_iter | --seed
For detailed information on this command, run:
gpi_pd_minecart.py --help
An exception has occurred, use %tb to see the full traceback.
It seems that I have to pass some value to algo, could anyone give more instructions about how to run this code? Thank you very much!
I discovered that the Pareto front generated by the minecart environment is not optimal. I found this out by comparing the hypervolume of the Pareto front generated by GPI-LS with the one included in the environment itself. If the Pareto front generated by the environment was optimal, its hypervolume would be greater than or equal to the one resulting from GPI-LS. Instead, I observed that the hypervolume obtained by GPI-LS is greater.
Here is some code to reproduce it. I downloaded the Pareto front generated by the following run of GPI-LS: https://wandb.ai/openrlbenchmark/MORL-Baselines/runs/y6s3uaty
Note that the same gamma is used in both solution sets, so the comparison is valid.
import pygmo as pg
import pandas as pd
import mo_gymnasium as mo_gym
env = mo_gym.make("minecart-v0")
pf = np.array(env.unwrapped.pareto_front(gamma=0.98, symmetric=True))
ref_point = np.array([-1, -1, -200])
hv_pf = pg.hypervolume(-pf).compute(-ref_point)
print("Minecart")
print(f"True PF HV: {hv_pf}")
print("----------")
FILE_NAME = "GPI_LS_front.csv" # Put the filename of the CSV here.
found_vecs = pd.read_csv(FILE_NAME).to_numpy()
hv_pf = pg.hypervolume(-found_vecs).compute(-ref_point)
print(f'GPI-LS PF HV: {hv_pf}')
I think it would be worthwhile to also verify whether the convex hull generated by minecart is correct, but I did not check this myself.
The current tabular version of GPI-PD does not have all the steps mentioned in the paper.
Use linear decay in PQL
Hi,
I am fairly new to multi-objective RL, but I have quite a bit of experience working with single-objective tasks.
I was trying to train the GPILSContinousAction
model on one of my quadrotor tasks, but the model doesn't seem to converge completely.
The task includes two objectives, and the single-objective PPO (ESR approach) had no problems converging, while the multi-objective RL seems to have a hard time. Initially, I tried using PGMORL
since I was familiar with the PPO
algorithm, but the performance was lacklustre with poor convergence. With GPILSContinuousAction
model the reward seems to stagnate at
I have also tried setting the weight supports prior to training to debug the convergence by having the weights as [[1.0, 0.0], [1.0, 0.25]]
, with the first weight it essentially converts the problem into a single-objective, however neither of the agents trained converge!
I would really appreciate your help!
Regards,
Arshad
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.