ajlangley / cpo-pytorch Goto Github PK

View Code? Open in Web Editor NEW

20.0 2.0 9.0 42 KB

An implementation of Constrained Policy Optimization (Achiam 2017) in PyTorch

Python 100.00%

cpo-pytorch's People

Stargazers

Watchers

Forkers

mahaitongdae paperfactory feloundou kellsky dongshed zhouzhiqian sudo-michael 2bh

cpo-pytorch's Issues

line 2 lead to imp_sampling=1

log_action_probs = action_dists.log_prob(actions)

    imp_sampling = torch.exp(log_action_probs - log_action_probs.detach())
    # Change to torch.matmul
    reward_loss = -torch.mean(imp_sampling * reward_advs)

Since, log_action_probs - log_action_probs.detach()=0,
imp_sampling is a all one vector

mj_loadXML error: b'Error: engine error: Could not allocate memory'

(cpo-pytorch-master) D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master>python train.py --model-name point_gather
train.py:32: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = load(open(config_filename, 'r'))[model_name]
Traceback (most recent call last):
File "train.py", line 57, in
simulator = SinglePathSimulator(env_name, policy, n_trajectories, trajectory_len)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 37, in init
**env_args)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\autoassign.py", line 52, in decorated
return f(self, *args, **kwargs)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 24, in init
self.env = np.asarray([make_env(env_name, **env_args) for i in range(n_trajectories)])
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 24, in
self.env = np.asarray([make_env(env_name, **env_args) for i in range(n_trajectories)])
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 16, in make_env
return PointGatherEnv(**env_args)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\envs\point_gather.py", line 13, in init
sensor_span)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\autoassign.py", line 52, in decorated
return f(self, *args, **kwargs)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\envs\gather_env.py", line 31, in init
self.model = self.build_model(model_path)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\envs\gather_env.py", line 67, in build_model
model = load_model_from_xml(model_xml)
File "cymj.pyx", line 147, in mujoco_py.cymj.load_model_from_xml
Exception:

Failed to load XML from file: C:\Users\***\AppData\Local\Temp\tmpt9hpp_y5.xml. mj_loadXML error: b'Error: engine error: Could not allocate memory'

Trying to run train.py , and this bug occurs. Can anybody help me with this ?

Thanks!

Nice work

mean kl is always=0

hi, I notice that in your code, mean_kl always=0
constraint_grad = flat_grad(constraint_loss, self.policy.parameters(), retain_graph=True) # (b)

    mean_kl = mean_kl_first_fixed(actions_dists, actions_dists)
    Fvp_fun = get_Hvp_fun(mean_kl, self.policy.parameters())

what is the meaning of a gradient of a constant?

[question] How to turn my custom environment into an environment suitable for CPO?

I have a custom environment that I use to train PPO and SAC agents on it. I have some constraints that I penalize through the reward function when they are violated and I would like to test the performance of CPO in this context. My question is what steps should I follow to be able to use your CPO implementation. For example, how can I integrate a cost function?

Does it converge?

I transplanted this code to my environment but it did not converge.

Where can i find the AntGather env?

hi, nice work for this. I found there is some code in the python file related to AntGather env, but in the directory envs I didn't find AntGatherEnv, is it missing when you upload this git repo?

a "bug"? in the cpo method

reward_advs -= reward_advs.mean()
reward_advs /= reward_advs.std()
cost_advs -= \textbf{reward_advs}.mean()
cost_advs /= cost_advs.std()

I guess on line 3, it should be mean of the cost?

"from envs.ant_gather import AntGatherEnv"

there is no "ant_gather" in the envs folder.

Some questions about the codes

Thanks for your sharing. I just have some questions when I'm reading your codes.

Q1 In line_search.py Line 39
if constraints_satisfied(step_len * search_dir, step_len):

I think it should be constraints_satisfied(search_dir, step_len), because in cpo.py Line 190 ,test_policy = current_policy + step_len * search_dir. It seems that you multiply step_len by twice.

Q2 in cpo.py Line 205
cost_cond = step_len * torch.matmul(constraint_grad, search_dir) <= max(-c, 0.0)

It seems that you want test the constraint of the linearized problem (Equation 11 in the CPO paper). But according to the original paper (the last line in Algorithm 1 ), it should be testd the non-linearized problem (Equation 10 in the CPO paper).
So the right code may be
cost_cond = test_cost<= max(-c, 0.0)
(Besides, I notice the test_cost is calculated but not used in the codes)

ajlangley / cpo-pytorch Goto Github PK

cpo-pytorch's People

Stargazers

Watchers

Forkers

cpo-pytorch's Issues

line 2 lead to imp_sampling=1

mj_loadXML error: b'Error: engine error: Could not allocate memory'

Nice work

mean kl is always=0

[question] How to turn my custom environment into an environment suitable for CPO?

Does it converge?

Where can i find the AntGather env?

a "bug"? in the cpo method

"from envs.ant_gather import AntGatherEnv"

Some questions about the codes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent