Git Product home page Git Product logo

Comments (5)

nikhilbarhate99 avatar nikhilbarhate99 commented on August 25, 2024 2

First, this repository does NOT use Generalized Advantage Estimation; it uses monte-carlo estimate for calculating rewards_to_go (reward variable in code) and advantages = rewards_to_go - V(s_t).

The only time we will get an unfinished trajectory is at the end. So an accurate version would be :

# Monte Carlo estimate of returns

rewards = []

if self.buffer.is_terminals[-1]:
   discounted_reward = 0
else:
   discounted_reward = self.policy_old.critic(self.buffer.states[-1]).item()

for reward, is_terminal in zip(reversed(self.buffer.rewards), reversed(self.buffer.is_terminals)):
    if is_terminal:
        discounted_reward = 0
    discounted_reward = reward + (self.gamma * discounted_reward)
    rewards.insert(0, discounted_reward)

Also, the rewards to go calculation introduced in issue #8 seems to be wrong. I am a little busy now and I will look into it later.

So the correct version, as other implementations use, might be just :

# Monte Carlo estimate of returns

rewards = []

if self.buffer.is_terminals[-1]:
   discounted_reward = 0
else:
   discounted_reward = self.policy_old.critic(self.buffer.states[-1]).item()

for reward in reversed(self.buffer.rewards):
    discounted_reward = reward + (self.gamma * discounted_reward)
    rewards.insert(0, discounted_reward)

from ppo-pytorch.

gianlucadest avatar gianlucadest commented on August 25, 2024

You are completely right. If the episode didn't end, you use the critic network (or the critic head of your twinheaded actor critic network) to approximate V(final_state)

from ppo-pytorch.

gianlucadest avatar gianlucadest commented on August 25, 2024

Your adjusted implementation is fine. I use the same semantic for my A2C rollouts where unfinished episodes are processed by calling critic(last_state). Else, the code just works with finished episodes.

from ppo-pytorch.

Boltzmachine avatar Boltzmachine commented on August 25, 2024

This inaccuracy (maybe I should call it a bug) troubles me a lot! Thanks @nikhilbarhate99
Also, I guess there is another issue.

if the loop exit with t >= max_ep_len, A False will be added into buffer.is_terminals ([False, False, ..., False]). When the next episode begins, the buffer continues appending steps in the next episode without clearing the buffer if it did not call update ([False, False, ... False] (first episode) + [False, False, ..., True] (next episode) ). So when we calculated the accumulated rewards, the rewards will be accumulated through the two episodes, which is not what we expect.

from ppo-pytorch.

11chens avatar 11chens commented on August 25, 2024

This inaccuracy (maybe I should call it a bug) troubles me a lot! Thanks @nikhilbarhate99 Also, I guess there is another issue.

if the loop exit with t >= max_ep_len, A False will be added into buffer.is_terminals ([False, False, ..., False]). When the next episode begins, the buffer continues appending steps in the next episode without clearing the buffer if it did not call update ([False, False, ... False] (first episode) + [False, False, ..., True] (next episode) ). So when we calculated the accumulated rewards, the rewards will be accumulated through the two episodes, which is not what we expect.

I think we can follow the practice of the latest version of gym: add a variable truncated (bool): indicates whether the game is still incomplete when t=max_ep_len. Then in the discounted_reward loop add:

if truncated : discounted_reward = self.policy_old.critic(self.buffer.states[-1]).item() 
elif is_terminal: discounted_reward = 0

from ppo-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.