ikostrikov / pytorch-trpo Goto Github PK
View Code? Open in Web Editor NEWPyTorch implementation of Trust Region Policy Optimization
License: MIT License
PyTorch implementation of Trust Region Policy Optimization
License: MIT License
Dear author:
I found your code very helpful. However, I have problems trying to read the following code:
Line 111 in eb26e29
I wonder the usage of volatile flag. I want to when u set volatile to True/False.
That KL is alwalys a zero Tensor .Is it the problem of torch's version?
In your main.py, line 147: for t in range(10000): # Don't infinite loop while learning
But actually, the t ends at 50, because the env is done in 50 steps. so the range(10000) is so big and not necessary.
Hello, I wanna ask that in line 67 in your trpo.py, you will get two terms, and in the TRPO paper, he said the second term vanishes ?, and you add v*damping, I guess its function is to make sure the positive definiteness? , could you explain it in detail? thank you very much!
and in your line 117 in your main.py, could you explain why this can approximate the average KL in detail? thank you very much!
It would be nice if the agent was an object (with methods "get_action" and "remember" or similar) so that it could be more easily reused.
Currently, the target for the value function is the discounted sum of all future rewards. This gives unbiased estimate but will result in higher variance. An alternative is to use bootstrapped estimate, i.e. something like
target[i] = rewards[i] + gamma * prev_values * masks[i]
Bootstrapping is often preferred due to low variance, even though it results in biased gradient estimate.
I'm wondering if you verified the performance of the current implementation on various MuJoCo environments?
Lines 108 to 119 in e200eb8
The fixed log prob part of the line and the "get_loss" function part are exactly the same.
The two parts are executed consecutively so that the two values ("fixed_log_prob", "log_prob") are exactly the same.
Is there a reason you wrote the code like this?
Thanks for your great code!
I notice that in the function get_kl(), you use policy net to generate the mean, log_std and std, then copy these three parameters and calculate the KL divergence between the original parameters and the copied parameters, which is obviously zero all the time. Is this a bug or a intended behavior?
hello, so I notice your code is about mujoco, and I wonder how to modify it to fit other env, I have tried but failed. thx a lot!
ikostrikov, thx very much. I have tried one continuous control game "MountainCarContinuous-v0" in classical control and it succeeds.
I'm new to pytorch and am having a hard time getting used to handling the variables properly on cpu and gpu. As we are calculating our own losses here, I am having trouble understanding what and when to send to the device (gpu). Would really appreciate an explanation of how to go about this. The code is quite well written and easy to understand btw.
There is no torch.autograd.grad in 0.1.2, the newest release version.
Thank you very much for the code you provided!I learn a lot from it . I would like to ask what is the function of these lines of code, is there any mathematical proof or the like, thank you very much!!!these are different from the original paper?Thanks!!!
neggdotstepdir = (-loss_grad * stepdir).sum(0, keepdim=True)
expected_improve = expected_improve_rate * stepfrac
ratio = actual_improve / expected_improve
if ratio.item() > accept_ratio and actual_improve.item() > 0:
flat_grad_grad_kl = torch.cat([grad.contiguous().view(-1) for grad in grads]).data
return flat_grad_grad_kl + v * damping
stepdir = conjugate_gradients(Fvp, -loss_grad, 10)
shs = 0.5 * (stepdir * Fvp(stepdir)).sum(0, keepdim=True)
lm = torch.sqrt(shs / max_kl)
fullstep = stepdir / lm[0]
According to the TRPO formular,
So
but your coding is different from that, why?
like eg, imagine I have my own policy, that takes in a state, and outputs an action, or perhaps a distribution over actions; and I have a world that takes an action, and returns a reward and a new state, how would I plug these into this TRPO implementation?
The get_kl function returns 0 always. What does this mean? Is this a bug or intended?
Hi, thanks once again for implementing a really interesting algorithm in PyTorch 👍 ,
I was wondering how to modify the code to be able to use it for environments which require discrete actions, (say cartpole
as in the other pytorch trpo implementation, or maybe even Atari games)?
Hi, I am a newcomer to drl. When I try to read trpo_step in trpo.py, I notice that you use a linesearch method instead of trust region for numerical optimization. So I want to know why you choose that method and dose it conflict with a "trust region" policy gradient algorithm?
Hi. Thanks for publishing implementation of trpo.
I have question about get_kl().
I thought what get_kl() is supposed to do is to calculate the kl divergence of old policy and new policy, but this get_kl() seems always returning 0.
Also,I do not see kl constraining part in the parameters updating process.
Is this code the modification of trpo or do I have some misunderstanding?
Thanks,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.