ajlangley / cpo-pytorch Goto Github PK
View Code? Open in Web Editor NEWAn implementation of Constrained Policy Optimization (Achiam 2017) in PyTorch
An implementation of Constrained Policy Optimization (Achiam 2017) in PyTorch
hi, nice work for this. I found there is some code in the python file related to AntGather env, but in the directory envs
I didn't find AntGatherEnv, is it missing when you upload this git repo?
Nice work
log_action_probs = action_dists.log_prob(actions)
imp_sampling = torch.exp(log_action_probs - log_action_probs.detach())
# Change to torch.matmul
reward_loss = -torch.mean(imp_sampling * reward_advs)
Since, log_action_probs - log_action_probs.detach()=0,
imp_sampling is a all one vector
there is no "ant_gather" in the envs folder.
hi, I notice that in your code, mean_kl always=0
constraint_grad = flat_grad(constraint_loss, self.policy.parameters(), retain_graph=True) # (b)
mean_kl = mean_kl_first_fixed(actions_dists, actions_dists)
Fvp_fun = get_Hvp_fun(mean_kl, self.policy.parameters())
what is the meaning of a gradient of a constant?
I have a custom environment that I use to train PPO and SAC agents on it. I have some constraints that I penalize through the reward function when they are violated and I would like to test the performance of CPO in this context. My question is what steps should I follow to be able to use your CPO implementation. For example, how can I integrate a cost function?
I transplanted this code to my environment but it did not converge.
reward_advs -= reward_advs.mean()
reward_advs /= reward_advs.std()
cost_advs -= \textbf{reward_advs}.mean()
cost_advs /= cost_advs.std()
I guess on line 3, it should be mean of the cost?
(cpo-pytorch-master) D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master>python train.py --model-name point_gather
train.py:32: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = load(open(config_filename, 'r'))[model_name]
Traceback (most recent call last):
File "train.py", line 57, in
simulator = SinglePathSimulator(env_name, policy, n_trajectories, trajectory_len)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 37, in init
**env_args)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\autoassign.py", line 52, in decorated
return f(self, *args, **kwargs)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 24, in init
self.env = np.asarray([make_env(env_name, **env_args) for i in range(n_trajectories)])
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 24, in
self.env = np.asarray([make_env(env_name, **env_args) for i in range(n_trajectories)])
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\simulators.py", line 16, in make_env
return PointGatherEnv(**env_args)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\envs\point_gather.py", line 13, in init
sensor_span)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\autoassign.py", line 52, in decorated
return f(self, *args, **kwargs)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\envs\gather_env.py", line 31, in init
self.model = self.build_model(model_path)
File "D:\2021\ReinforcementLearning\cpo-pytorch-master\cpo-pytorch-master\envs\gather_env.py", line 67, in build_model
model = load_model_from_xml(model_xml)
File "cymj.pyx", line 147, in mujoco_py.cymj.load_model_from_xml
Exception:
Trying to run train.py , and this bug occurs. Can anybody help me with this ?
Thanks!
Thanks for your sharing. I just have some questions when I'm reading your codes.
Q1 In line_search.py Line 39
if constraints_satisfied(step_len * search_dir, step_len):
I think it should be constraints_satisfied(search_dir, step_len), because in cpo.py Line 190 ,test_policy = current_policy + step_len * search_dir. It seems that you multiply step_len by twice.
Q2 in cpo.py Line 205
cost_cond = step_len * torch.matmul(constraint_grad, search_dir) <= max(-c, 0.0)
It seems that you want test the constraint of the linearized problem (Equation 11 in the CPO paper). But according to the original paper (the last line in Algorithm 1 ), it should be testd the non-linearized problem (Equation 10 in the CPO paper).
So the right code may be
cost_cond = test_cost<= max(-c, 0.0)
(Besides, I notice the test_cost is calculated but not used in the codes)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.