bentrevett / pytorch-rl Goto Github PK

Tutorials for reinforcement learning in PyTorch and Gym by implementing a few of the popular algorithms. [IN PROGRESS]

License: MIT License

Jupyter Notebook 98.84% Python 1.16%

pytorch pytorch-tutorial pytorch-implmention pytorch-implementation reinforcement-learning reinforcement-learning-algorithms rl pytorch-tutorials pytorch-rl policy-gradient

pytorch-rl's Introduction

PyTorch Reinforcement Learning

This repo contains tutorials covering reinforcement learning using PyTorch 1.3 and Gym 0.15.4 using Python 3.7.

If you find any mistakes or disagree with any of the explanations, please do not hesitate to submit an issue. I welcome any feedback, positive or negative!

Getting Started

To install PyTorch, see installation instructions on the PyTorch website.

To install Gym, see installation instructions on the Gym GitHub repo.

Tutorials

All tutorials use Monte Carlo methods to train the CartPole-v1 environment with the goal of reaching a total episode reward of 475 averaged over the last 25 episodes. There are also alternate versions of some algorithms to show how to use those algorithms with other environments.

0 - Introduction to Gym
1 - Vanilla Policy Gradient (REINFORCE)

This tutorial covers the workflow of a reinforcement learning project. We'll learn how to: create an environment, initialize a model to act as our policy, create a state/action/reward loop and update our policy. We update our policy with the vanilla policy gradient algorithm, also known as REINFORCE.
2 - Actor Critic

This tutorial introduces the family of actor-critic algorithms, which we will use for the next few tutorials.
3 - Advantage Actor Critic (A2C)

We cover an improvement to the actor-critic framework, the A2C (advantage actor-critic) algorithm.
4 - Generalized Advantage Estimation (GAE)

We improve on A2C by adding GAE (generalized advantage estimation).
5 - Proximal Policy Evaluation

We cover another improvement on A2C, PPO (proximal policy optimization).

Potential algorithms covered in future tutorials: DQN, ACER, ACKTR.

References

'Reinforcement Learning: An Introduction' - http://incompleteideas.net/sutton/book/the-book-2nd.html
'Algorithms for Reinforcement Learning' - https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf
List of key papers in deep reinforcement learning - https://spinningup.openai.com/en/latest/spinningup/keypapers.html

pytorch-rl's People

Contributors

Stargazers

Watchers

pytorch-rl's Issues

3 - Advantage Actor Critic (A2C) [CartPole].ipynb - Returns do not need to be detached

Hi Ben

Thanks for the interesting notebooks. Upon studying the "3 - Advantage Actor Critic (A2C) [CartPole].ipynb" notebook, I came to the conclusion that detaching the returns in the update_policy() function is not necessary. The returns are only calculated on the rewards which are environment outputs and therefore not part of the computational graph. So even leaving out the .detach() call should not affect the model. Would you agree?

Taking 'done' into consideration while calculating returns

Hello, thank you for making this repo,
I think while calculating the returns you should take done into consideration as,


    def calculate_returns(self, rewards, dones, normalize = True):
       
        returns = []
        R = 0
        for r, d in zip(reversed(rewards), reversed(dones)):    
            if d:
                R = 0
            R = r + R * self.gamma
            returns.insert(0, R)
            
        returns = torch.tensor(returns).to(device)
        
        if normalize:
            returns = (returns - returns.mean()) / returns.std()
            
        return returns

Also can you please briefly describe the Generalized Advantage Estimation (GAE) while calculating the advantages.

Adding a sample_action method for ActorCritic

Hello! I've been learning how to code RL form your repo. I've replace duplicating code lines from
def train
def update_policy

to agent's method self.sample_action(). And it seems that agent now solves Cart-Pole problem x2 slower(num of episodes). And it happes everytime. I have no idea what happens with torch and havn't found anything on Internet.
Can you pls help me?

https://github.com/lemikhovalex/pytorch-rl
5_tr - Proximal Policy Optimization (PPO) [CartPole]-Copy1.ipynb

actor critic possible error

Hello, Ben!
Thank you for a great tutorial series. I have a question regarding your actor-critic notebook.
In function update_policy

def update_policy(returns, log_prob_actions, values, optimizer):

    returns = returns.detach()
    
    policy_loss = - (returns * log_prob_actions).sum()
    
    value_loss = F.smooth_l1_loss(returns, values).sum()
    
    optimizer.zero_grad()
    
    policy_loss.backward()
    value_loss.backward()
    
    optimizer.step()
    
    return policy_loss.item(), value_loss.item()

in policy loss you calculate the usual policy gradient for agent, in value loss you calculate loss for the critic. They seem to be independent, the critic does not affect the agent at all. Shouldn't returns for policy loss be calculated with values from critic or something like that?

bentrevett / pytorch-rl Goto Github PK

pytorch-rl's Introduction

PyTorch Reinforcement Learning

Getting Started

Tutorials

References

pytorch-rl's People

Contributors

Stargazers

Watchers

Forkers

pytorch-rl's Issues

3 - Advantage Actor Critic (A2C) [CartPole].ipynb - Returns do not need to be detached

Taking 'done' into consideration while calculating returns

Adding a sample_action method for ActorCritic

actor critic possible error

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent