Git Product home page Git Product logo

Comments (3)

DanielTakeshi avatar DanielTakeshi commented on August 24, 2024

One more thing the examples script has code like this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L22-L24

and we are using Tanh policies:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L35-L39

Just wondering, is the NormalizedBoxEnv needed in this case? Perhaps it was just added to let us know what we could do with it later? By default it seems like we are not normalizing observations or returns. Thus, NormalizedBoxEnv would only serve to clip actions in [-1,1] for each component. But the tanh will naturally force it in that range anyway.

The only other possibility I can think of for the NormalizedBoxEnv is if the extra noise injected into the exploration policy causes the actions to exceed the [-1,1] range in some components. But after inserting some print and assertion checks in the normalized box env stepping method, and running python examples/ddpg.py, shows that no actions are outside the range so presumably the action+noise for exploration is clipped somewhere before that.

from rlkit.

vitchyr avatar vitchyr commented on August 24, 2024

from rlkit.

ZhenhuiTang avatar ZhenhuiTang commented on August 24, 2024

Hi, I was wondering what is the difference between the exploration policy and the evaluation policy? Which one is common used in RL paper? I mean, is the training curve in the SAC paper is based on the exploration policy which corresponds to 'expl/Average Returns'? Why rewards from evaluation policy tends to better than that from the exploration policy?

I really look forward to your reply!

Hi @vitchyr

Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.

For installation, I actually didn't entirely follow the installation instructions, but here's what I did:

  • I used a Python 3.6.7 pip virtualenv, and just manually installed the packages I saw in your installation yml file. I used torch 0.4.1 as recommended.
  • I actually used MuJoCo 2.0, so I was using the -v2 instances of the environments.
  • I used gym 0.12.5 and mujoco-py 2.0.2.2

I took the master branch from 5565dd5 and then adjusted the examples/td3.py and examples/ddpg.py so that they also imported other MuJoCo environments. In addition, for TD3 only, I adjusted the hyperparameters in the "algorithm_kwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L71-L79

And TD3 uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/td3.py#L104-L111

I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.

If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.

I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:

$ ls -lh data/
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Walker2d-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Walker2d-v2
$ ls -lh data/rlkit-ddpg-Ant-v2/
drwxrwxr-x 2 daniel daniel 4.0K Jun 20 20:49 rlkit-ddpg-Ant-v2_2019_06_20_20_49_44_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_20_53_49_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_44_22_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_49_37_0000--s-0
$ 

// other env results presented in a similar manner

For this I used the following plotting script where I just call it like python [script].py Ant-v2 and similarly for the other environments:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import pandas as pd
import os
import numpy as np
from os.path import join

# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
error_region_alpha = 0.25


def smoothed(x, w):
    """Smooth x by averaging over sliding windows of w, assuming sufficient length.
    """
    if len(x) <= w:
        return x
    smooth = []
    for i in range(1, w):
        smooth.append( np.mean(x[0:i]) )
    for i in range(w, len(x)+1):
        smooth.append( np.mean(x[i-w:i]) )
    assert len(x) == len(smooth), "lengths: {}, {}".format(len(x), len(smooth))
    return np.array(smooth)


def plot(args):
    """Load the progress csv file, and plot.

    Plot:
      'exploration/Returns Mean',
      'exploration/num steps total',
      'evaluation/Returns Mean',
      'evaluation/num steps total',
    """
    nrows, ncols = 1, 2
    fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
                           figsize=(11*ncols,6*nrows))

    algorithms = sorted([x for x in os.listdir('data/') if args.env in x])
    assert len(algorithms) == 2
    colors = ['blue', 'red']

    for idx,alg in enumerate(algorithms):
        print('Currently on algorithm: ', alg)
        alg_dir = join('data', alg)
        progfiles = sorted([
                join(alg_dir, x, 'progress.csv') for x in os.listdir(alg_dir)
        ])
        expl_returns = []
        eval_returns = []
        expl_steps = []
        eval_steps = []

        for prog in progfiles:
            df = pd.read_csv(prog, delimiter = ',')

            expl_ret = df['exploration/Returns Mean'].tolist()
            expl_returns.append(expl_ret)
            eval_ret = df['evaluation/Returns Mean'].tolist()
            eval_returns.append(eval_ret)

            expl_sp = df['exploration/num steps total'].tolist()
            expl_steps.append(expl_sp)
            eval_sp = df['evaluation/num steps total'].tolist()
            eval_steps.append(eval_sp)

        expl_returns = np.array(expl_returns)
        eval_returns = np.array(eval_returns)
        xs = expl_returns.shape[1]
        expl_ret_mean = np.mean(expl_returns, axis=0)
        eval_ret_mean = np.mean(eval_returns, axis=0)
        expl_ret_std = np.mean(expl_returns, axis=0)
        eval_ret_std = np.mean(eval_returns, axis=0)

        w = 10
        label0 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(expl_ret_mean[-w:]))
        label1 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(eval_ret_mean[-w:]))
        ax[0,0].plot(np.arange(xs), smoothed(expl_ret_mean, w=w),
                     color=colors[idx], label=label0)
        ax[0,1].plot(np.arange(xs), smoothed(eval_ret_mean, w=w),
                     color=colors[idx], label=label1)

        # This can be noisy.
        if False:
            ax[0,0].fill_between(np.arange(xs),
                                 expl_ret_mean-expl_ret_std,
                                 expl_ret_mean+expl_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])
            ax[0,1].fill_between(np.arange(xs),
                                 eval_ret_mean-eval_ret_std,
                                 eval_ret_mean+eval_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])

    for i in range(2):
        ax[0,i].tick_params(axis='x', labelsize=ticksize)
        ax[0,i].tick_params(axis='y', labelsize=ticksize)
        leg = ax[0,i].legend(loc="best", ncol=1, prop={'size':legendsize})
        for legobj in leg.legendHandles:
            legobj.set_linewidth(5.0)
    ax[0,0].set_title('{} (Exloration)'.format(args.env), fontsize=ysize)
    ax[0,1].set_title('{} (Evaluation)'.format(args.env), fontsize=ysize)

    plt.tight_layout()
    figname = 'fig-{}.png'.format(args.env)
    plt.savefig(figname)
    print("\nJust saved: {}".format(figname))


if __name__ == "__main__":
    pp = argparse.ArgumentParser()
    pp.add_argument('env', type=str)
    args = pp.parse_args()
    plot(args)

Here are the curves. Left is the exploration policy, and right is the evaluation policy.

fig-Ant-v2

fig-HalfCheetah-v2

fig-Hopper-v2

fig-InvertedPendulum-v2

fig-Reacher-v2

fig-Walker2d-v2

The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.

I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!

from rlkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.