openai / spinningup Goto Github PK

View Code? Open in Web Editor NEW

9.9K 226.0 2.2K 31.02 MB

An educational resource to help anyone learn deep reinforcement learning.

Home Page: https://spinningup.openai.com/

License: MIT License

Python 99.66% Shell 0.34%

spinningup's Introduction

Status: Maintenance (expect bug fixes and minor updates)

Welcome to Spinning Up in Deep RL!

This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL).

For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning.

This module contains a variety of helpful resources, including:

a short introduction to RL terminology, kinds of algorithms, and basic theory,
an essay about how to grow into an RL research role,
a curated list of important papers organized by topic,
a well-documented code repo of short, standalone implementations of key algorithms,
and a few exercises to serve as warm-ups.

Get started at spinningup.openai.com!

Citing Spinning Up

If you reference or use Spinning Up in your research, please cite:

@article{SpinningUp2018,
    author = {Achiam, Joshua},
    title = {{Spinning Up in Deep Reinforcement Learning}},
    year = {2018}
}

spinningup's People

Stargazers

Watchers

Forkers

tigerneil stasulam yngtodd hbcbh1999 ml-lab rikrd jeroenmoons supercurious genesisxyl bellamn egbertguo levelsethu mrluin haohaohaohaohaohaozhang brettwy861 xingbaji linyuanthocr aiyumi19960310 ianchen28 blue1881eulb ankeshanand williammo2016 playgamelxh buaazb bigbar zhencaihu arunkumarramanan victorleelk jingweiz pplus zeitgeistqian o7s8r6 mozhouting albertchen121 compionezhao shivanshmundra alibaheri jbdatascience akashadhikari caolegebi cstghitpku tanwanirahul 2000222 bleyddyn colemakdvorak zzzz123321 hzhe18 tonydeep christinaliang chao1981 yutarochan ssghost johnzchgrd justindixon crazyguitar spearous allensmile jarvis-k chen0031 rayed-therap zhengweisrc cheenng abdelpakey hack121 hadryan gardenermike sirejdua mbrukman jaynoel wh-forker xingzai0617 stjordanis benbenji lturing cloud-robotics nicholasjoseph1994 mimoralea nanaakwasiabayieboateng weiddeng hhy5277 landoufulxf jamqd leac123 tidesq collector-m ffboy jens0512 rcshubhadeep mjanja ellefife wzhang1 devhttps gpsbird yatharthsharma tmomose csinpi zhjmcjk lukeroberto trantorrepository duuubidubi

spinningup's Issues

MuJoCo alternatives?

The free licence is only 30 days. Can spinningup use any free or open source alternatives such as Bullet(https://github.com/bulletphysics/bullet3)?

I followed the installation process on this page (https://spinningup.openai.com/en/latest/user/installation.html). After I run pythonw -m spinup.run plot data/installtest/installtest_s0, I am getting the following error message. I'm working in a conda environment.

Traceback (most recent call last):
File "/anaconda3/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/anaconda3/lib/python3.6/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/Users/tf/Documents/reinforcement_learning/spinningup/spinup/init.py", line 2, in
from spinup.algos.ddpg.ddpg import ddpg
File "/Users/tf/Documents/reinforcement_learning/spinningup/spinup/algos/ddpg/ddpg.py", line 2, in
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'

Any idea why tensorflow is not available? I also tried a conda install and the same error. The other testing commands seem to run ok.

The random seed doesn't work

Even if I set the same random seed, the result is different, and you can test it on ddpg. I think tf.set_random_seed(seed) doesn't work, but I don't know how to solve it.

Minimizing over global variables instead of trainable variables

In https://github.com/openai/spinningup/blob/master/spinup/algos/sac/sac.py
L185
train_pi_op = pi_optimizer.minimize(pi_loss, var_list=get_vars('main/pi'))

def get_vars(scope):
return [x for x in tf.global_variables() if scope in x.name]

you minimize over the global variables, instead of trainable variables as it should (because the network is user defined).

Missing Infinitesimals

I noticed in many (maybe all?) places the dx of \int_x f(x) dx was missing. Is this by choice?

failed to check the installation with mujoco envs

I'm using Ubuntu 14.04 + Python 3.6.7 (Anaconda3-5.3.0-Linux-x86_64).

I have successfully installed mujoco, mujoco_py, gym and spinningup.

mujoco                    150
mujoco-py                 1.50.1.68
gym                       0.10.9 
spinup                    0.1

And the following code could work:

env = gym.make('Humanoid-v2') 
env.reset() 
env.render()

But when I checked the spinningup installation by:

python -m spinup.run ppo --hid [32,32] --env Walker2d-v2 --exp_name installtest

I got the following errors:

Import error. Trying to rebuild mujoco_py.
running build_ext
building 'mujoco_py.cymj' extension
gcc -pthread -B /home/pxlong/anaconda3/envs/spinningup/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py -I/home/pxlong/.mujoco/mjpro150/include -I/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/numpy/core/include -I/home/pxlong/anaconda3/envs/spinningup/include/python3.6m -c /home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.c -o /home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.o -fopenmp -w
gcc -pthread -B /home/pxlong/anaconda3/envs/spinningup/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py -I/home/pxlong/.mujoco/mjpro150/include -I/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/numpy/core/include -I/home/pxlong/anaconda3/envs/spinningup/include/python3.6m -c /home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.c -o /home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.o -fopenmp -w
gcc -pthread -shared -B /home/pxlong/anaconda3/envs/spinningup/compiler_compat -L/home/pxlong/anaconda3/envs/spinningup/lib -Wl,-rpath=/home/pxlong/anaconda3/envs/spinningup/lib -Wl,--no-as-needed -Wl,--sysroot=/ /home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.o /home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.o -L/home/pxlong/.mujoco/mjpro150/bin -Wl,-R/home/pxlong/.mujoco/mjpro150/bin -lmujoco150 -lglewosmesa -lOSMesa -lGL -o /home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/lib.linux-x86_64-3.6/mujoco_py/cymj.cpython-36m-x86_64-linux-gnu.so -fopenmp
Traceback (most recent call last):
File "/home/pxlong/Dropbox/git/gym/gym/envs/mujoco/mujoco_env.py", line 11, in
import mujoco_py
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/init.py", line 3, in
from mujoco_py.builder import cymj, ignore_mujoco_warnings, functions, MujocoException
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/builder.py", line 503, in
cymj = load_cython_ext(mjpro_path)
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/builder.py", line 106, in load_cython_ext
mod = load_dynamic_ext('cymj', cext_so_path)
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/builder.py", line 124, in load_dynamic_ext
return loader.load_module()
ImportError: dlopen: cannot load any more object with static TLS

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/pxlong/Dropbox/git/spinningup/spinup/utils/run_entrypoint.py", line 11, in
thunk()
File "/home/pxlong/Dropbox/git/spinningup/spinup/utils/run_utils.py", line 162, in thunk_plus
thunk(**kwargs)
File "/home/pxlong/Dropbox/git/spinningup/spinup/algos/ppo/ppo.py", line 175, in ppo
env = env_fn()
File "/home/pxlong/Dropbox/git/spinningup/spinup/utils/run_utils.py", line 155, in
kwargs['env_fn'] = lambda : gym.make(env_name)
File "/home/pxlong/Dropbox/git/gym/gym/envs/registration.py", line 167, in make
return registry.make(id)
File "/home/pxlong/Dropbox/git/gym/gym/envs/registration.py", line 119, in make
env = spec.make()
File "/home/pxlong/Dropbox/git/gym/gym/envs/registration.py", line 85, in make
cls = load(self._entry_point)
File "/home/pxlong/Dropbox/git/gym/gym/envs/registration.py", line 14, in load
result = entry_point.load(False)
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/pkg_resources/init.py", line 2343, in load
return self.resolve()
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/site-packages/pkg_resources/init.py", line 2349, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/pxlong/Dropbox/git/gym/gym/envs/mujoco/init.py", line 1, in
from gym.envs.mujoco.mujoco_env import MujocoEnv
File "/home/pxlong/Dropbox/git/gym/gym/envs/mujoco/mujoco_env.py", line 13, in
raise error.DependencyNotInstalled("{}. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)".format(e))
gym.error.DependencyNotInstalled: dlopen: cannot load any more object with static TLS. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)

================================================================================ There appears to have been an error in your experiment.
Check the traceback above to see what actually went wrong. The
traceback below, included for completeness (but probably not useful
for diagnosing the error), shows the stack leading up to the
experiment launch.
================================================================================

Traceback (most recent call last):
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/pxlong/Dropbox/git/spinningup/spinup/run.py", line 230, in
parse_and_execute_grid_search(cmd, args)
File "/home/pxlong/Dropbox/git/spinningup/spinup/run.py", line 162, in parse_and_execute_grid_search
eg.run(algo, **run_kwargs)
File "/home/pxlong/Dropbox/git/spinningup/spinup/utils/run_utils.py", line 546, in run
data_dir=data_dir, datestamp=datestamp, **var)
File "/home/pxlong/Dropbox/git/spinningup/spinup/utils/run_utils.py", line 171, in call_experiment
subprocess.check_call(cmd, env=os.environ)
File "/home/pxlong/anaconda3/envs/spinningup/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python', '/home/pxlong/Dropbox/git/spinningup/spinup/utils/run_entrypoint.py', 'eJyNUs9P1EAUnm6XpYCgEaORE4mX5eCWkHgDLmu8NOGAJp7MpNuZtmOnM7Uzg2BCYoLAkky88PTiP+vrLgLe6KTT977X7/3+3v81Dcns8S8yqR1rRFZJProng1+juZCS5k5lVmgF1zAsiX9M67Ti1FR8bkIY0VU6cUJaoag9aThiPhprxj90yjkcwjAhsxMm6+N3LJj2qkG7zXqMSHKFGgtekitySS6DvMdC1v8RIbawSzrLHrHBNDwLAjLt54j8RImR9wSGBz7i6oiqtOaQkHLpX5AeBlmeBp/JlJwF+OPBOfh+lzWmsuX34lLXPG6OpVZF/LbVzUQfx4WwsWmEUkIVrpmJ+HFWSBO3TtGZNGpOsLBdmdYTlu5D8mdMoFzxYXFSwwVsWSzUv8lSbBo/bngraq7saFdqhMz+yJZOVbSRztxht778oCslV2ChXPdR3Qia67bq3JZP/PIdF5Lf4zXSi4LuPA0GYRQiufqatoUBv6hcTbPGgV+YUeASEyzXL6DLLQlPUT+EUxj6xULqCaaASrnhn8/rHc2r7IJbrdFYblyD8RHjeeqkNXDg+0xkFkl+UGvmujX5n3vbK9wf3CbjWk6PUum4gU8wxMjYr5WPqax4u8NeH+2AX0ozepM+un1UCsa4okZ8m1OSzWSTG983nDOcsl+Vuih4e0cZ+mXtbOMsZaIFP37IeFlq01goY3FWlht7X6ZmG4eMA5wvll+5ZwPnksA/u6k4lYXGnWh094IPu/sauI++uFTOya8esA3g7OHoL/bmT1k=']' returned non-zero exit status 1.

How to install without Mujoco

I cannot afford a Mujoco licence and cannot seem to install without it. Is it possible to install without Mujoco dependencies?

Installation issue

I get a weird error when i test my installation using

python -m spinup.run ppo --hid [32,32] --env Walker2d-v2 --exp_name installtest

Error :

`================================================================================
ExperimentGrid [installtest] runs over parameters:

env_name [env]

Walker2d-v2

ac_kwargs:hidden_sizes [ac-hid]

[32, 32]

Variants, counting seeds: 1
Variants, not counting seeds: 1

================================================================================

Preparing to run the following experiments...

installtest

================================================================================

Launch delayed to give you a few seconds to review your experiments.

To customize or disable this behavior, change WAIT_BEFORE_LAUNCH in
spinup/user_config.py.

================================================================================
Running experiment:

installtest

with kwargs:

{
"ac_kwargs": {
"hidden_sizes": [
32,
32
]
},
"env_name": "Walker2d-v2",
"seed": 0
}

Traceback (most recent call last):
File "/Users/haresh/Documents/spinningup/spinup/utils/run_entrypoint.py", line 10, in
thunk = pickle.loads(zlib.decompress(base64.b64decode(args.encoded_thunk)))
File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 892, in load_proto
raise ValueError, "unsupported pickle protocol: %d" % proto
ValueError: unsupported pickle protocol: 4

================================================================================

There appears to have been an error in your experiment.

Check the traceback above to see what actually went wrong. The
traceback below, included for completeness (but probably not useful
for diagnosing the error), shows the stack leading up to the
experiment launch.

================================================================================

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/haresh/Documents/spinningup/spinup/run.py", line 230, in
parse_and_execute_grid_search(cmd, args)
File "/Users/haresh/Documents/spinningup/spinup/run.py", line 162, in parse_and_execute_grid_search
eg.run(algo, **run_kwargs)
File "/Users/haresh/Documents/spinningup/spinup/utils/run_utils.py", line 546, in run
data_dir=data_dir, datestamp=datestamp, **var)
File "/Users/haresh/Documents/spinningup/spinup/utils/run_utils.py", line 171, in call_experiment
subprocess.check_call(cmd, env=os.environ)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python', '/Users/haresh/Documents/spinningup/spinup/utils/run_entrypoint.py', 'eJyNUs9P1EAUnm6XpYCgEaORE4mX5eAuIfG2cnCNlyYcUOPJTIbOdDt2OlM7MygmJCYILMnEC08v/rO+7iLgjU46fe97/d7v791fZzGZPeFJpozntcxKJQa3ZAhrNJdK0dzrzEmj4RL6BQn3acVKQW0p5iaEEV2lB14qJzV1R7VALCRjw8W7VjmFfeinZHbidH38hkfTTtlrtnmHE0UuUOPRU3JBzsl5lHd4zLs/EsQWRqS1vCQumsYnUUSm3RyRnyhx8pZAfy8kQh9SzSoBKSmW/gXpYJDlafSJTMlJhD/unULotlljKlthNHxvRWOHBWuELYavTeYroZ0d2lpqLfXE1zMRP95JZYeN13QmDeojrGukWHXA2S6kf8YEipUQT44qOIMth3WGFxnDnomvtWhk63YwUgYhuztwhdclrZW3N9i1r9BrK8k1OCjWQ1LVkuamKVu3xYOwfMOF9Pd4jXSSqD0Po16cxEguv7BmYiEsal/RrPYQFmYUOMcEi/UzaHNL42PU9+EY+mFxoswBpoBKsREez+sdzKtsgztj0FhsXIINCRc588pZ2AtdLjOHpNCrDPftlvzPve4Vrg8uk/WNoIdMeWHhI/QxMvZr5QNTpWh2+PPDHQhLLKNX6aPbe4XkXGhq5bc5Jd1MN4UNXSsExyGHVWUmE9HcUPph2XhXe0e5bCC8usN0OXNsKLV1OConrLstU7uNM8b5zdcqrNyygfdpFB5dFczUxOBK1KZ9IcTtfQkiJJ89U3PyszssA3i3P/gLgUhOrw==']' returned non-zero exit status 1.`

vpg: possible redundant code

in vpg.py.

the function: mlp_actor_critic

will receive observations and ation

x, a

return 4 items:

pi, logp, logp_pi, v
pi: action choose by policy
logp_pi: log probability of pi
logp: log probability of input action a
v: value of this x according to value network.

when running the trajectory, vpg will evaluate pi and logp_pi of the compuation graph, and save the returned action and it's log probability.
pi is the action we will take in environment. logp_pi is its log probability.

when update the policy, vpg will evaluate following item:

pi_loss = -tf.reduce_mean(logp * adv_ph)

this is said, vpg will feed action to computation graph and get it's log probability, but since policy will only be updated once, so the log probability of action is the same as we running trajectory, we can just use our saved log probability.

Can any one clarity it if I am wrong?

Use matplotlib's non-interactive mode as default

The python -m spinup.run plot exits for me without showing plots. This is due to matplotlib's interactive mode being turned on by default (from my matplotlibrc). I think the plot command assumes interactive mode is turned off so that plt.show is blocking. The fix is to add plt.ioff() to the plot.py script before plotting anything.

Docker support would be great

Roboschool support

Hi!
Any changes spinningup is going to support Roboschool as an alternative to MuJoCo?

Why performance metric of on-policy methods use training trajectories ?

Would it be more fair to evaluate the policy for certain trajectories on certain training intervals?

Issue with the order of updates operation in SAC ?

I think there is an issue with the order of updates operation in SAC.

In sac.py L191

with tf.control_dependencies([train_pi_op]):
        train_value_op = value_optimizer.minimize(value_loss, var_list=value_params)

We update the policy parameters, before we update the value parameters.
In the paper https://spinningup.openai.com/en/latest/_images/math/7ff0536d2e479ae833174c13ccff07b043d7cb55.svg
it is the opposite order.

If I understand the paper update scheme (an application of Gibbs sampling), it intends to compute Q using old P weight, then compute the update of P using updated Q weights.

Mathematically you probably can do it in the opposite order but you must use an old weight then an updated weight.
This means that if you do it first P then Q the targets computations must use updated weights for P when updating Q (invoke the actor_critic function and compute the targets inside the tf.control_dependencies block (specifically the term logp_pi inside yv) ).

The control dependencies don't recompute all dependent operations as the following code helps understand.

import tensorflow as tf

x = tf.Variable(0.0)

targetx = 3 * x 
updatex = tf.assign(x,x+1)

with( tf.control_dependencies([updatex]) ):
    res = targetx

sess = tf.Session()
sess.run(tf.global_variables_initializer())
print( sess.run( res ) ) #print 0 targetx is not recomputed

DDPG training error

When attempting to train a DDPG agent in a Pendulum-v0 environment, I ran into this error:

Traceback (most recent call last): File "/path/to/spinningup/spinup/utils/run_entrypoint.py", line 11, in <module> thunk() File "/path/to/spinningup/spinup/utils/run_utils.py", line 162, in thunk_plus thunk(**kwargs) File "/path/to/spinningup/spinup/algos/ddpg/ddpg.py", line 224, in ddpg replay_buffer.store(o, a, r, o2, d) File "/path/to/spinningup/spinup/algos/ddpg/ddpg.py", line 25, in store self.obs2_buf[self.ptr] = next_obs ValueError: could not broadcast input array from shape (3,1) into shape (3)

The command I used was

python -m spinup.run ddpg --env Pendulum-v0 --exp_name ddpgtest

Python version is Python 3.6.7.

The installation link for Mujoco points to a wrong location

The text reads like the following -

First, go to the mujoco-py github page. Follow the installation instructions in the README, 
which describe how to install the MuJoCo physics engine and the mujoco-py 
package (which allows the use of MuJoCo from Python).

However, when someone clicks on the link it brings them here - https://github.com/openai/mujoco-py/tree/master/mujoco_py

Whereas I believe it should bring them here - https://github.com/openai/mujoco-py
(Where the readme is)

Can you elaborate on running SAC on discrete action space

In the docs, it is mentioned about an alternate version of SAC with slight change can be used for discrete action space. Please elaborate with some more details.

Plotting error after first installation

As a heads up some users might run into the following error when attempting to plot the performance of their agent when checking their installation the first time:

ImportError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information.

In order to overcome this, they can follow the steps outlined here

I have no clue if this only happens in a virtual environment (I was personally using a Conda environment) and didn't know where else to put it. Hope this helps!

Episode Length appears capped at 1000

I'm having trouble setting the episode length above 1000. I am running the command
python -m spinup.run ppo --hid "[8,8]" --env HalfCheetah-v2 --exp_name train_doggo --gamma 0.999 --max_ep_len 2000 --steps_per_epoch 4000
to use PPO on the Half Cheetah environment with a max_ep_len set to 2000. However, when I run the algorithm, the episode length is always 1000. This is an example output:

---------------------------------------
|             Epoch |              48 |
|      AverageEpRet |        7.02e+03 |
|          StdEpRet |        1.76e+04 |
|          MaxEpRet |         3.1e+04 |
|          MinEpRet |       -1.74e+04 |
|             EpLen |           1e+03 |
|      AverageVVals |            19.1 |
|          StdVVals |            31.2 |
|          MaxVVals |              40 |
|          MinVVals |           -33.1 |
| TotalEnvInteracts |        1.96e+05 |
|            LossPi |        2.81e-08 |
|             LossV |        1.09e+08 |
|       DeltaLossPi |         -0.0105 |
|        DeltaLossV |       -1.07e+04 |
|           Entropy |            3.52 |
|                KL |          0.0121 |
|          ClipFrac |           0.103 |
|          StopIter |              79 |
|              Time |             127 |
---------------------------------------

If I set the max_ep_len to less than 1000, say 500, the EpLen does in fact match the max_ep_len. Additionally, the half cheetah environment always returns done=False, and I am not getting the "Trajectory cut off by epoch" warning message.

Any help would be appreciated in solving this issue!

Installation issue with Python 3.7

There is an issue installing tensorflow with python 3.7 as discussed here

I installed Anaconda with python 3.7(MacOS) and it gave me error:

Could not find a version that satisfies the requirement tensorflow>=1.8.0 (from spinup==0.1.1) (from versions: )
No matching distribution found for tensorflow>=1.8.0 (from spinup==0.1.1)

The way I resolved it is by changing the version of python in Anaconda to 3.6 using this command:
conda install python=3.6

I understand there is a solution to this issue here but since spinningup is focused on helping beginners get started, I suppose the readme should aware users to install Anaconda with python 3.6 till the bug disappears in python 3.7.

Cannot run the first example on Mac OS.

I am installing Spinning Up on Mac OS. I followed this link
After this, to see if you’ve successfully installed Spinning Up, I tried running PPO in the LunarLander-v2 environment with
python -m spinup.run ppo --hid "[32,32]" --env LunarLander-v2 --exp_name installtest --gamma 0.999
Then an error pops up

I have no idea about this error. Can anyone give me help?

Using tf.keras instead of tf.layers breaks exercise 1.2

Starting from the correct solution for exercise 1.2, replacing mlp from

    for h in hidden_sizes[:-1]:
        x = tf.layers.dense(x, units=h, activation=activation)
    return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

    for h in hidden_sizes[:-1]:
        x = tf.keras.layers.Dense(units=h, activation=activation)(x)
    return tf.keras.layers.Dense(units=hidden_sizes[-1], activation=output_activation)(x)

resulted in the solution recognized as being incorrect consistently over 10 runs. It was correct 10/10 runs before the change.

Controlling initialization with kernel_initializer=tf.glorot_uniform_initializer(), bias_initializer=tf.zeros_initializer() in any of the two variants for all layers does not influence the result.

Comparing tf.keras.layers.Dense and tf.layers.dense in isolated test resulted in the same empirical initialization statistics, and when starting from the same initialization, the same optimization behaviour.

Comparing the code for tf.keras.layers.Dense and tf.layers.dense, the only difference is that tf.layers.dense returns a Dense object that inherits not only from tf.keras.layers.Dense, but also from the legacy base.Layer, which therefore is certainly related to the cause.

Increasing episodes from 20 to 25 seems to solve the issue, suggesting that for some reason tf.keras results in slightly slower learning. Since the Keras API will become standard in Tensorflow 2.0, I am keen to understand why this is happening.

16 undefined names

flake8 testing of https://github.com/openai/spinningup on Python 3.7.1

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./spinup/exercises/problem_set_1/exercise1_3.py:242:41: F821 undefined name 'pi_loss'
    train_pi_op = pi_optimizer.minimize(pi_loss, var_list=get_vars('main/pi'))
                                        ^
./spinup/exercises/problem_set_1/exercise1_3.py:243:39: F821 undefined name 'q_loss'
    train_q_op = q_optimizer.minimize(q_loss, var_list=get_vars('main/q'))
                                      ^
./spinup/exercises/problem_set_1/exercise1_3.py:258:79: F821 undefined name 'pi'
    logger.setup_tf_saver(sess, inputs={'x': x_ph, 'a': a_ph}, outputs={'pi': pi, 'q1': q1, 'q2': q2})
                                                                              ^
./spinup/exercises/problem_set_1/exercise1_3.py:258:89: F821 undefined name 'q1'
    logger.setup_tf_saver(sess, inputs={'x': x_ph, 'a': a_ph}, outputs={'pi': pi, 'q1': q1, 'q2': q2})
                                                                                        ^
./spinup/exercises/problem_set_1/exercise1_3.py:258:99: F821 undefined name 'q2'
    logger.setup_tf_saver(sess, inputs={'x': x_ph, 'a': a_ph}, outputs={'pi': pi, 'q1': q1, 'q2': q2})
                                                                                                  ^
./spinup/exercises/problem_set_1/exercise1_3.py:261:22: F821 undefined name 'pi'
        a = sess.run(pi, feed_dict={x_ph: o.reshape(1,-1)})
                     ^
./spinup/exercises/problem_set_1/exercise1_3.py:323:31: F821 undefined name 'q_loss'
                q_step_ops = [q_loss, q1, q2, train_q_op]
                              ^
./spinup/exercises/problem_set_1/exercise1_3.py:323:39: F821 undefined name 'q1'
                q_step_ops = [q_loss, q1, q2, train_q_op]
                                      ^
./spinup/exercises/problem_set_1/exercise1_3.py:323:43: F821 undefined name 'q2'
                q_step_ops = [q_loss, q1, q2, train_q_op]
                                          ^
./spinup/exercises/problem_set_1/exercise1_3.py:329:38: F821 undefined name 'pi_loss'
                    outs = sess.run([pi_loss, train_pi_op, target_update], feed_dict)
                                     ^
./spinup/exercises/problem_set_1/exercise1_2.py:84:47: F821 undefined name 'mu'
    logp = exercise1_1.gaussian_likelihood(a, mu, log_std)
                                              ^
./spinup/exercises/problem_set_1/exercise1_2.py:84:51: F821 undefined name 'log_std'
    logp = exercise1_1.gaussian_likelihood(a, mu, log_std)
                                                  ^
./spinup/exercises/problem_set_1/exercise1_2.py:85:47: F821 undefined name 'pi'
    logp_pi = exercise1_1.gaussian_likelihood(pi, mu, log_std)
                                              ^
./spinup/exercises/problem_set_1/exercise1_2.py:85:51: F821 undefined name 'mu'
    logp_pi = exercise1_1.gaussian_likelihood(pi, mu, log_std)
                                                  ^
./spinup/exercises/problem_set_1/exercise1_2.py:85:55: F821 undefined name 'log_std'
    logp_pi = exercise1_1.gaussian_likelihood(pi, mu, log_std)
                                                      ^
./spinup/exercises/problem_set_1/exercise1_2.py:86:12: F821 undefined name 'pi'
    return pi, logp, logp_pi
           ^
16    F821 undefined name 'mu'
16

some equations in The pdf doc of RL spinningup are wrong

Thank you very much for your resources that help people learn fast. But one thing I found is that some equations in the pdf doc is not the same as those on html. And some figures are lost. For example,

Add DQN algorithms

no matches found: [32,32]

I install according to the instructions and get:

zsh: no matches found: [32,32]

Failed building wheel for box2d-py [solved]

I'm on MacOS with Anaconda

Initially had error Failed building wheel for box2d-py

Solution was to brew install swig

Test script failed in FetchPush-v1 and HandManipulatePen-v0

python -m spinup.run ppo --hid [32,32] --env FetchPush-v1 --exp_name installtestFetchPushv0
python -m spinup.run ppo --hid [32,32] --env HandManipulatePen-v0 --exp_name installtestFetchPushv0

'Traceback (most recent call last):
File "/Users/mjphieli/Documents/dataScience/opait/spinningup/spinup/utils/run_entrypoint.py", line 11, in
thunk()
File "/Users/mjphieli/Documents/dataScience/opait/spinningup/spinup/utils/run_utils.py", line 162, in thunk_plus
thunk(**kwargs)
File "/Users/mjphieli/Documents/dataScience/opait/spinningup/spinup/algos/ppo/ppo.py", line 183, in ppo
x_ph, a_ph = core.placeholders_from_spaces(env.observation_space, env.action_space)
File "/Users/mjphieli/Documents/dataScience/opait/spinningup/spinup/algos/ppo/core.py", line 27, in placeholders_from_spaces
return [placeholder_from_space(space) for space in args]
File "/Users/mjphieli/Documents/dataScience/opait/spinningup/spinup/algos/ppo/core.py", line 27, in
return [placeholder_from_space(space) for space in args]
File "/Users/mjphieli/Documents/dataScience/opait/spinningup/spinup/algos/ppo/core.py", line 24, in placeholder_from_space
raise NotImplementedError
NotImplementedError'

Problem passing list of integers as command line argument

For some weird reason, when I spin up the walker environment passing in the sizes of the hidden layers as a list of integers as shown in the installation documentation:

python -m spinup.run ppo --hid [32,32] --env Walker2d-v2 --exp_name installtest

I get the following error:

zsh: no matches found: [32,32]

If I remove the square brackets it works. Can anyone demystify this for me please? Thank you!

Reason for training value and action heads separately?

More of a question of why things are done a certain way than an issue, forgive me if this is the wrong forum.

spinningup/spinup/algos/ppo/ppo.py

Line 233 in 8b92b8a

for i in range(train_pi_iters):

We see first the action is updated train_pi_iters times (or until max KL reached) and then value head is updated train_v_iters times. Why are these done separately and not jointly? Especially if parameters are shared between the policy and value networks then training only one objective at a time for a significant number of steps would surely cause it to worsen at the other objective. Why not train jointly (until max KL reached, in which case then just train value head for the remaining steps)?

Simple policy gradient: mean over episodes?

In the simple policy gradient implementation here, all of the observations, actions, and rewards for one "epoch" (potentially consisting of multiple episodes) are gathered into the same list). These are then all passed into the MLP that parametrizes the policy. Then we have this line

loss = -tf.reduce_mean(weights_ph * log_probs)

where log_probs is just a 1D vector of the probabilities of all actions in this epoch (regardless of which episode they were from). But in the page in the introduction, we have this equation

It seems like the implementation takes the mean over the number of elements in both sums (the number of total actions) instead of just over the number of elements in the first sum (the number of episodes). Am I misunderstanding this or is this an error? Is it just implemented like this for simplicity? (since scaling the loss by a constant factor will just scale the gradient but not change the direction) It's also clearly nicer to store all observations, etc. in single vectors because there will probably be a different number of steps in different episodes, so we can't, e.g., have a matrix of vectors of actions for each episode without padding. But we could, e.g., pass in the number of episodes directly to the model and then even if all actions, etc. are in single vectors, we could still get the proper result shown in the equation above by summing across the vector and then dividing by the number of episodes given.

Thanks for your help understanding this!

Using ExperimentGrid error: --cpu more than 1 can lead to more algorithm runs

Code to reproduce the error:

from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo
import tensorflow as tf

if __name__ == '__main__':
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument('--cpu', type=int, default=4)
    parser.add_argument('--num_runs', type=int, default=1)
    args = parser.parse_args()

    eg = ExperimentGrid(name='ppo-bench')
    eg.add('env_name', 'CartPole-v0', '', True)
    eg.add('seed', [10*i for i in range(args.num_runs)])
    eg.add('epochs', 10)
    eg.add('steps_per_epoch', 4000)
    eg.add('ac_kwargs:hidden_sizes', [(32,)], 'hid')
    eg.run(ppo, num_cpu=args.cpu)

Note that --cpu is 4, and --num_runs (number of seeds) is 1.
Simply run this code, and you will find that it first logs:

Variants, counting seeds: 1
Variants, not counting seeds: 1

This is correct, since for each hyperparameter and the seed we only give one entry. but then the whole ppo algorithm actually gets run 4 times. You will see 4 different logs coming to std output together (if you also seed the environment, you will see 4 exactly same logs), but since we only have one variant here, I believe the program should only run one ppo instance?
Many thanks!

[Question]: Mujoco-specific, not recommended to constraint action within valid bound ?

It seems the SpinningUp implementations do not constraint actions according to env.observation_space.low and env.observation_space.high. I guess it's even discouraged to do action clipping after sampling from learned Gaussian distribution and this might be specific to Mujoco environment where the reward function contains penalty to the magnitude of actions i.e. 'reward_ctrl', so it automatically penalize when the action is out of bounds [-1, 1].

Silent bug in ppo implementation - rewards shifted

Looking at the PPO code I noticed that in each buffer slot the reward before the corresponding observation is stored instead of the reward after the observation (OpenAI baselines implementation as correct implementation reference here: https://github.com/openai/baselines/blob/master/baselines/ppo2/runner.py). This leads to a bug in the finish_path method, where deltas are calculated as
deltas = rews[:-1] + self.gamma * vals[1:] - vals[:-1]
which translates to r_t-1 + gamma * v_t+1 - v_t, i.e., the rewards are shifted and the value function has to approximate the rewards-to-go from t-1 instead of the rewards-to-go from t. This bug is silent in most control problems (where rewards are dense), but still might hinder performance. Can you confirm that this is indeed a bug?
(Also I didn't took an as close look at the other algorithms yet, maybe the same bug is present there as well?)

A problem in VPG

Why store the experience before excuting the action with a fake reward in the first step of each episode?

multi-cpu problem in experiment grid

@jachiam Hi! It's me again! 2 days ago I posted an issue on using multiple cpu on ExperimentGrid that seems to only give the wrong log when run in Pycharm but fine in terminal. I did some more experiments today, and found that although when run from terminal, the log seems to be correct, but there could still be a bug, the proof is that a multi-cpu run will always use more time compared to single cpu run, when solving the same number of jobs.

Here is the code I tried, when you have time, could you try run it and see if you can reproduce the bug?
My machine has 4 cores, and using 2 cores to run 4 jobs should be faster than using 1 core to run 4 jobs.
The result I get is: single-cpu about 15 seconds, multi-cpu about 20 seconds.
I have run this code from terminal, the result indicates that with the same number of jobs, multi-cpu uses more time than single-cpu, which seems a bug to me.
I believe what the code might be doing is that for every variant, the same variant has been run num_cpu times.

from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo

import time
if __name__ == '__main__':
    total_num_runs = 4
    import argparse

    ## try multi-cpu case
    parser = argparse.ArgumentParser()
    parser.add_argument('--cpu', type=int, default=2)
    parser.add_argument('--num_runs', type=int, default=total_num_runs)
    args = parser.parse_args()

    ## reset timing
    starttime = time.time()
    eg = ExperimentGrid(name='ppo-bench')
    eg.add('env_name', 'CartPole-v0', '', True)
    eg.add('seed', [10 * i for i in range(args.num_runs)])
    eg.add('epochs', 1)
    eg.add('steps_per_epoch', 200)
    eg.run(ppo, num_cpu=args.cpu)
    multi_cpu_time = time.time() - starttime

    ## try single-cpu case
    args.cpu = 1

    ## reset timing
    starttime = time.time()
    eg = ExperimentGrid(name='ppo-bench')
    eg.add('env_name', 'CartPole-v0', '', True)
    eg.add('seed', [10 * i for i in range(args.num_runs)])
    eg.add('epochs', 1)
    eg.add('steps_per_epoch', 200)
    eg.run(ppo, num_cpu=args.cpu)
    single_cpu_time = time.time() - starttime

    print('single-cpu',single_cpu_time,'multi-cpu',multi_cpu_time)

Many thanks!

Running spinningup in Linux Subsystem on Windows (Success)

For people who are on windows 10 and do not have linux but want to make things work.

you can enable the WSL in your windows 10 following this.
Install Xming X window server for windows from here. and make sure it is running.
Once WSL is working :
Open cmd, type in "bash", this will switch the cmd to WSL terminal, then run the following it will enable GUI for WSL .
Copied from this stackoverflow answer.

    sudo apt-get install x11-apps
    export DISPLAY=localhost:0.0 
    nano ~/.bashrc  #(add  export DISPLAY=localhost:0.0   at the end. Ctrl+X to exit/save)
    sudo apt-get install gnome-calculator #will get you GTK

Download miniconda for linux from here. It will be an ".sh" file.
from the terminal go to the folder you downloaded the file to and run "bash <name_of_downloaded_file>", this will install conda.
follow the spinningup tutorial for rest of installation.

'--cpu' flag causes IndexError: list index out of range

Run:
python -m spinup.run ppo --hid [32,32] --env LunarLander-v2 --exp_name installtest --gamma 0.999 --cpu 12 --seed 42

After a random number of epochs, 'IndexError: list index out of range' occurs:
File "/home/steve/spinningup/spinup/utils/logx.py", line 321, in log_tabular
vals = np.concatenate(v) if isinstance(v[0], np.ndarray) and len(v[0].shape)>0 else v
IndexError: list index out of range

Despite passing --seed, this is not deterministic, but always seems to happen within the first ~20 epochs.
The problem appears to be that v is [], hence the attempt to access v[0] fails.

--cpu auto also has the problem
Only --cpu 1 seems to be safe.

Check your Installation error

I am getting this error. my python version is 3.6.6.
As mentioned in the installation error #13
I downloaded the python 3.7 version and downgraded using conda install python=3.6

Originally posted by @karthikraja95 in #13 (comment)

Failed to load the native TensorFlow runtime.

python -m spinup.run test_policy data/installtest/installtest_s0
Traceback (most recent call last):
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: dlopen(/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so, 6): Symbol not found: _clock_gettime
Referenced from: /Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so (which was built for Mac OS X 10.12)
Expected in: /usr/lib/libSystem.B.dylib
in /Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/Users/pritee/spinningup/spinup/init.py", line 2, in
from spinup.algos.ddpg.ddpg import ddpg
File "/Users/pritee/spinningup/spinup/algos/ddpg/ddpg.py", line 2, in
import tensorflow as tf
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/init.py", line 24, in
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/init.py", line 49, in
from tensorflow.python import pywrap_tensorflow
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: dlopen(/Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so, 6): Symbol not found: _clock_gettime
Referenced from: /Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so (which was built for Mac OS X 10.12)
Expected in: /usr/lib/libSystem.B.dylib
in /Users/pritee/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.

Passing training flag for batch normalization ?

Hello,

I am wondering how do you provide the training flag for batch normalization layers in your architecture specifying an actor critic function.

During inference when generating the actions you want the training flag set to false, but when training you want it set to true.

This means that you need to use a different "Op" for generating the actions during the run of the episode, than when you train from the experience replay buffer.
But otherwise this "Op" for generating the actions must use exactly the same weights as the "main" network.

So you either must carefully name your Ops, and use a variable scope with reuse=tf.AUTO_REUSE to make them share the weights, or recreate an Op and manage the copy from the "main" weights to the weights this new network before running a new episode.

How would you do it ?

OMP: Error #15: Initializing libiomp5.dylib

Followed Mujoco install, followed by full Gym installation.

Used Miniconda, py 3.6

On fresh OSX Mojave (right out of the box!)

Ran the following code:

import gym
import tensorflow as tf
from spinup import ddpg

env_name = 'Pendulum-v0'
env_fn       = lambda : gym.make(env_name)

ac_kwargs = {
        'hidden_sizes':[64,64], 
        'activation'  :tf.nn.relu
    }

logger_kwargs = {
    'output_dir'  : 'logs', 
    'exp_name'  :'pendulum_test'
    }

addl_kwargs = {
    'seed' : 42
}

ddpg(env_fn, ac_kwargs=ac_kwargs, logger_kwargs=logger_kwargs, **addl_kwargs)

Received following error on first build:

INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: logs/simple_save/saved_model.pb

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.

OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into theprogram. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

[MBP:05587] *** Process received signal ***
[MBP:05587] Signal: Abort trap: 6 (6)
[MBP:05587] Signal code: (0)
[MBP:05587] [ 0] 0 libsystem_platform.dylib 0x00007fff6ad79b3d _sigtramp + 29
[MBP:05587] [ 1] 0 libiomp5.dylib 0x0000000110b7b018 __kmp_openmp_version + 88572
[MBP:05587] [ 2] 0 libsystem_c.dylib 0x00007fff6ac381c9 abort + 127
[MBP:05587] [ 3] 0 libiomp5.dylib 0x0000000110b24df3 __kmp_abort_process + 35
[MBP:05587] *** End of error message ***
Abort trap: 6

The workaround is including:

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

However, is there a more permanent fix? Why might there be multiple instances of OpenMP?

Is there a good reason why often policy & value networks separated for Mujoco and shared for Atari ?

I am wondering if there is a rule of thumb for this. Maybe it is something related to the following reasonings ?

Atari: shared parameters between policy and value networks. Because Atari uses pixel inputs, so the shared layers are regarded as a feature extractor.
Mujoco: separated parameters between policy and value networks. Because Mujoco uses raw configurations, and it has relatively shorter horizons, so by separating them out, it allowed value network to be trained with more gradient updates or independent learning rate to make it learning faster, as it is basically a regression task.

Running with Python3.7 error

When I tried to the installation test after example

python -m spinup.run ppo --hid "[32,32]" --env LunarLander-v2 --exp_name installtest --gamma 0.999

I'm receiving this error:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/Users/eax/worx/deeplearning/openai/spinningup/spinup/__init__.py", line 2, in <module>
    from spinup.algos.ddpg.ddpg import ddpg
  File "/Users/eax/worx/deeplearning/openai/spinningup/spinup/algos/ddpg/ddpg.py", line 2, in <module>
    import tensorflow as tf
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/__init__.py", line 22, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 114
    def TFE_ContextOptionsSetAsync(arg1, async):
                                             ^
SyntaxError: invalid syntax

I've installed spinning to a brand new virtualenv with python3.7.

GPU Installation conflict

Hello,

Boring install problem here but when I install spinning up with the site instruction

git clone https://github.com/openai/spinningup.git
cd spinningup
pip install -e .

It installs tensorflow without gpu which may conflict with my existing tensorflow-gpu installation. (Everything work but is not using GPU anymore)

Maybe removing tensorflow from the depencies and letting the user install manually the desired version.

The speed up of GPU isn't necessary great when working with small layersize.

I sorted it by manual uninstalling and reinstalling with --force tensorflow-gpu

Thanks

VPG: standardize advantages + standardize returns ?

In VPG code, the advantage estimate is standardized, I guess the reason is it uses GAE ? I am wondering if it still gains benefits from standardizing returns which also often used in VPG implementation (when not using any bootstrapping), by standardizing returns, it might be viewed as making value function learn faster to fit a standardized 'dataset'.

Is it correct that when using bootstrapping, it is better not to standardize returns anymore ? Because in the next iteration, when calculating bootstrapped returns, the reward at each time step is raw magnitude but the last state value is scaled due to the training in previous iteration.
In VecNormalize, it turns out to be very helpful to accelerate training by standardize observations with a running average. But it also has an option to standardize rewards by a running average of episodic returns, I am thinking when using this option, it has the identical effect to standardize returns in principle ?

Exercise 1.1: adding epsilon to std

Hello,

the provided solution spinup/exercises/problem_set_1_solutions/exercise1_1_soln.py includes the following snippet of code:

...((x-mu)/tf.exp(log_std)+EPS) ** 2...

which I'm a little curious about. The test passes without including the term EPS in my own solution (so I figure it doesn't matter), but I don't seem to find any mention of this epsilon parameter in the docs, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#stochastic-policies.

I'd like to know if there is any reasoning behind this term?

Use masks to compute `discount_cumsum` with lfilter

This is a rather quick question, if we want to use mask when calculating discounted returns, is there a trick that we can still use lfilter ? Because its speedup is rather significant compared with implementation via for loop.

Updates before start_steps

Hello,

In sac.py,

        if t > start_steps:
            a = get_action(o)
        else:
            a = env.action_space.sample()

You use random policy before start_steps. But you nevertheless start updating model parameters immediately, using a small replay memory dataset.
It seems that a cautious approach would only update the model parameters once a sufficient dataset has been collected.

Currently we do start_steps model updates with a small dataset which mean we risk initially over-fitting the parameters, to this small dataset, which may take a long time to recover from.

It is particularly insidious, because when you have a slow network architecture you won't see a problem, but once you try a faster architecture you will overfit to the small dataset and take a long time to recover. It is also environment dependent and may depend on the luck of the first few episodes.

shape error

Hi everyone,

When I use the ppo algorithm, environment set "AirRaid-v0", but I get a error:
ValueError: Shape must be rank 2 but is rank 4 for 'pi/multinomial/Multinomial' (op: 'Multinomial') with input shapes: [?,250,160,6], [].

openai / spinningup Goto Github PK

spinningup's Introduction

Welcome to Spinning Up in Deep RL!

Citing Spinning Up

spinningup's People

Stargazers

Watchers

Forkers

spinningup's Issues

Recommend Projects

Recommend Topics

Recommend Org