hill-a / stable-baselines Goto Github PK

This project forked from openai/baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Home Page: http://stable-baselines.readthedocs.io/

License: MIT License

Python 99.51% Shell 0.28% Dockerfile 0.13% Makefile 0.07%

baselines data-science gym machine-learning openai python reinforcement-learning reinforcement-learning-algorithms toolbox

stable-baselines's People

Contributors

Stargazers

Watchers

Forkers

lookatator penguinezhang jfsantos dt-fork safrooze gm2622 batu vasu-kukkapalli lepy wonseokjung esmaeilinia sbarman25 iandanforth trendingtechnology pyadmell newenglandml thydnguyen asprenger suraj-nair-1 iambenzo robertmacyiii akki2825 170928 spitis user01 adolfogonzalez3 modulabs-ctrl itaicaspi-intel ashigirl96 jtoledo1974 gyh75520 xlwoo1 hsdeveloper147 jrjbertram ernestum lanseyege rohansaphal97 lizhihao6 sophiaas yuchen-sky abhiskk aloshkad dabalda yangpeiren jinghuanxianren megadarkparticle antoine-galataud nathanmargaglio jihao2010 mrakgr doviettung96 tharaldyo nddoshi lionkt wyngjf john2912 clairelc jucho2725 mariahelga brucek4t1qbit sibnick megayeye ccolas clementrolinat liyuanqi123 miffyli singulaire qhuang-pnl 29ayush neu-shuai tperol rgring pvarin eivindeb coopshop fk2wong kyr7 xmaster96 notanymike hyzcn bschreck anxietyyoungpoet alpa-industry-and-technology rerrayne jhoare dhruvramani ashwinipokle nasafei jsrimr quark2019 padalous patrickwalter214 kantneel yutingsz imraviagrawal sc420 shu13720902 vashishtmadhavan liuyangdh sezan92

stable-baselines's Issues

Wrong paper linked in docs

Hi, sorry if this isn't the right place to report this but I've noticed that the PPO1 page in the docs links to the TRPO paper instead of the PPO paper.

so here
https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html
"original paper" links to
https://arxiv.org/abs/1502.05477
whereas it should be
https://arxiv.org/abs/1707.06347

Action not clipped for A2C

Using A2C with continuous actions (ex; LunarLanderContinuous-v2), actions are not clipped, leading to out of bound error.

Step function in LstmPolicy is called without masks
I am using ppo1 with LstmPolicy in an environment based on gym. After setup up of model in pposgd_simple.py, trpo_mpi.utils.traj_segment_generatoris called in learn function, and then LstmPolicy.step() is called without masks in traj_segment_generator(), while masks is need to be feed in LstmPolicy.step(), and error was occur here.
I also find step() is also called by a2c.py while it get mask from runner(), So I am trying to write some code follow a2c.py. While I want to know whether there are easier way to fixed this.

Relate code

Train function:

def train(env, args):

    env.reset()
    model = PPO1(
        LstmPolicy,
        env,
        timesteps_per_actorbatch=int(args.actor_batch),
        clip_param=0.2,
        entcoeff=0.01,
        optim_epochs=5,
        optim_stepsize=args.learning_rate,
        optim_batchsize=int(args.optim_batch),
        gamma=0.99,
        lam=0.95,
        schedule='linear',
    )
    model.learn(
        total_timesteps=int(args.num_timesteps),
    )
    model.save(args.save_filename)
    return model

In ppo1.pposgd_simple learn()

from stable_baselines.trpo_mpi.utils import traj_segment_generator, add_vtarg_and_adv, flatten_lists

    def learn(self, total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name="PPO1"):
        with SetVerbosity(self.verbose), TensorboardWriter(self.graph, self.tensorboard_log, tb_log_name) as writer:
            self._setup_learn(seed)

            assert issubclass(self.policy, ActorCriticPolicy), "Error: the input policy for the PPO1 model must be " \
                                                               "an instance of common.policies.ActorCriticPolicy."

            with self.sess.as_default():
                self.adam.sync()

                # Prepare for rollouts
                seg_gen = traj_segment_generator(self.policy_pi, self.env, self.timesteps_per_actorbatch)

In trpo_mpi.utils traj_segment_generator():

 while True:
        prevac = action
        action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
        # Slight weirdness here because we need value function at time T
        # before returning segment [0, T-1] so we get the correct
        # terminal value
        if step > 0 and step % horizon == 0:
            # Fix to avoid "mean of empty slice" warning when there is only one episode
            if len(ep_rets) == 0:
                ep_rets = [cur_ep_ret]
                ep_lens = [cur_ep_len]
                ep_true_rets = [cur_ep_true_ret]
                total_timesteps = cur_ep_len
            else:
                total_timesteps = sum(ep_lens) + cur_ep_len

In common.policies LstmPolicy.step()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            return self.sess.run([self.deterministic_action, self._value, self.snew, self.neglogp],
                                 {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
        else:
            return self.sess.run([self.action, self._value, self.snew, self.neglogp],
                                 {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})

Error information

Traceback (most recent call last):
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 142, in <module>
    main()
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 136, in main
    train(env, args)
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 74, in train
    total_timesteps=int(args.num_timesteps),
  File "/workspace/rl/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 215, in learn
    seg = seg_gen.__next__()
  File "/workspace/rl/stable-baselines/stable_baselines/trpo_mpi/utils.py", line 58, in traj_segment_generator
    action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
  File "/workspace/rl/stable-baselines/stable_baselines/common/policies.py", line 226, in step
    {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1111, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape () for Tensor 'input/masks_ph:0', which has shape '(1,)'

Render issue with vecEnv

Describe the bug
Error when calling render on vecenv, the bug was not present in version 1.0.7.

Code example

from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.vec_env import VecFrameStack
from stable_baselines import A2C

env = make_atari_env('BreakoutNoFrameskip-v4', num_env=9, seed=0)
# Stack 4 frames
env = VecFrameStack(env, n_stack=4)

model = A2C('CnnPolicy', env)
# model = A2C.load("breakout_a2c.pkl")

obs = env.reset()
while True:          
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Traceback

    env.render()
  File "/home/antonin/anaconda3/lib/python3.6/site-packages/stable_baselines/common/vec_env/base_vec_env.py", line 147, in render
    return self.venv.render(**kwargs)
  File "/home/antonin/anaconda3/lib/python3.6/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 92, in render
    bigimg = tile_images(imgs)
  File "/home/antonin/anaconda3/lib/python3.6/site-packages/stable_baselines/common/tile_images.py", line 15, in tile_images
    n_images, height, width, n_channels = img_nhwc.shape
ValueError: not enough values to unpack (expected 4, got 1)

System Info
Describe the characteristic of your environment:

Installed via pip (version 1.0.8.rc1)
cpu
python 3.6
tf v 1.8.0

[Question] Why does DDPG not allow Discrete action spaces?

Is there a particular reason why this is not possible or is it just an issue of the current implementation?
And does anyone know of a way to "bypass" (or adapt to) this restriction when your actions are a discrete set?

Example with CnnLstmPolicy

I tried to run with "Breakout-v0" and got this error.

ValueError: Cannot reshape a tensor with 16384 elements to shape [0,128,?] (0 elements) for 'model_1/Reshape_1' (op: 'Reshape') with input shapes: [32,512], [3] and with input tensors computed as partial shapes: input[1] = [0,128,?].

Dueling missing for DQN

It seems that the new refactoring of DQN policies deleted the 'dueling' option.
See https://github.com/openai/baselines/blob/master/baselines/deepq/models.py#L53.

Action Probability broken in DQN

Describe the bug
Action Probability is broken in DQN.

@hill-a I think we should add a test for action_proba for all models.

Code example

from stable_baselines import DQN
model = DQN('MlpPolicy', 'CartPole-v1')
obs = model.env.reset()
model.action_probability(obs)

Stack trace:

----> 1 model.action_probability(obs)

~/Documents/stable-baselines/stable_baselines/deepq/dqn.py in action_probability(self, observation, state, mask)
    261         # Get the tensor just before the softmax function in the TensorFlow graph,
    262         # then execute the graph from the input observation to this tensor.
--> 263         tensor = self.graph.get_tensor_by_name('deepq/q_func/fully_connected_2/BiasAdd:0')
    264         if vectorized_env:
    265             return self._softmax(self.sess.run(tensor, feed_dict={'deepq/observation:0': observation}))

~/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in get_tensor_by_name(self, name)
   3764       raise TypeError("Tensor names are strings (or similar), not %s." %
   3765                       type(name).__name__)
-> 3766     return self.as_graph_element(name, allow_tensor=True, allow_operation=False)
   3767 
   3768   def _get_tensor_by_tf_output(self, tf_output):

~/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in as_graph_element(self, obj, allow_tensor, allow_operation)
   3588 
   3589     with self._lock:
-> 3590       return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
   3591 
   3592   def _as_graph_element_locked(self, obj, allow_tensor, allow_operation):

~/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in _as_graph_element_locked(self, obj, allow_tensor, allow_operation)
   3630           raise KeyError("The name %s refers to a Tensor which does not "
   3631                          "exist. The operation, %s, does not exist in the "
-> 3632                          "graph." % (repr(name), repr(op_name)))
   3633         try:
   3634           return op.outputs[out_n]

KeyError: "The name 'deepq/q_func/fully_connected_2/BiasAdd:0' refers to a Tensor which does not exist. The operation, 'deepq/q_func/fully_connected_2/BiasAdd', does not exist in the graph."

System Info
Describe the characteristic of your environment:

Pip and master version should be affected

[Question] Reset to a Pickled Snapshot instead of the Initial State

Hello,

I am trying to extend stable-baselines to allow resetting to arbitrary "snapshot points" instead of the start to increase exploratory capabilities of an agent.

I wanted to ask you the best way to go about doing this. Here is my current approach:

I created a SnapshotVecEnv inheriting from VecEnv. In addition to everything in the DummyVecEnv (I hope to multi thread it at some point) it has several additional functions mostly based on this PR:

save_snapshot(self):
    # use pickle to return a snapshot of the whole environment 
    # This will be called, for now, at random points within training.
    # I plan on keeping the saved snapshots within the SnapshotVecEnv per environment.

load_snapshot(self, snapshot):
   # set the current environment to the loaded snapshot
   # This will be called whenever the environment resets.

The main question I have is about the way the resetting is handled. Couldn't seem to figure out just yet. If you have time can you give me an overview of it?

Also, I would be glad if you could point any possible flaws you see.

Thank you very much.

Fork means unsearchable on github

Trying to search the codebase on github returns "Sorry, forked repositories are not currently searchable"

I think you can maintain openai/baselines as a remote, yet not have this repo marked as a "fork" in github, to allow for search.

For now I just grep my clone of your repo in bash. But this repo is clearly substantial enough to be a first-class repo in github, and apparently forks are not treated as first class repos by github.

[Request] Record video in Colab

First of all, excellent work, just what I have been looking for!
Love it that I can experiment and have well documented code.

I work mainly in Google Colab, mainly because I got tired of pip install and it removes the hassle of infrastructure.

See you published very nice Colabs, many thanks.

I figured out a way with wrappers to have Gym create video's that is also working in Colab (with virtual monitor) , nice to have a graph explaining the score is increasing but also want to see games with my own eyes. Unfortunately that does not work anymore with Stable baselines (at least I did not yet spot it).

Any ideas how to solve this for a Colab environments with Stable-baselines?

Below example how it works for e in a standard gym environment:
`

@title Set-up the virtual display environment

!apt-get update
!apt-get install python-opengl -y
!apt install xvfb -y
!pip install pyvirtualdisplay
!pip install piglet
!apt-get install ffmpeg

@title Start the virtual monitor

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

@title play a random game and create video

env = gym.make("PongNoFrameskip-v4")
monitor_dir ='/content/test5'

#Setup a wrapper to be able to record a video of the game
record_video = True
should_record = lambda i: record_video
env = wrappers.Monitor(env, monitor_dir, video_callable=should_record, force=True)

#Play a game
state = env.reset()
done = False
while not done:
action = env.action_space.sample() #random action, replace by the prediction of the model
state, reward, done, _ = env.step(action)

record_video = False
env.close()

download videos

from google.colab import files
import glob
os.chdir(monitor_dir) # change directory to get the files
!pwd #show file path
!ls # show directory content

for file in glob.glob(".mp4"):
print(file)
files.download(file)

os.chdir('..') #backout the video directory
`

Wrapping the whole stable-baseline env does not work.

Tried also something like:

`
monitor_dir ='/content/test7'
env_show = gym.make('LunarLander-v2')
record_video = True
should_record = lambda i: record_video
env_show = wrappers.Monitor(env_show, monitor_dir, video_callable=should_record, force=True)

obs = env_show.reset()
dones = False
while not dones:
action, _states = model.predict(obs)
obs, rewards, dones, info = env_show.step(action)
env_show.render()

record_video = False
env_show.close()
`
Which works for Lunar Lander but not for an Atari game: Error: Unexpected observation shape (210, 160, 3) for Box environment, please use (84, 84, 4) or (n_env, 84, 84, 4) for the observation shape.

Any pragmatic elegant solution or small enhancement for Colab users?

Thank you

Loaded model from file does not consider its action space continuous: error when running in `LunarLanderContinuous-v2` environment

I train a model with PPO2 on the LunarLanderContinuous-v2 environment, then save it to a file.

Then I load the file, just like in the example, and run it again with the same environment to render the process, again just like in the example.

But an error occurs which indicates that the loaded/inferred model has as an attribute self.continuous=False or similar.

The equivalent process works just fine without error if env_name='LunarLander-v2'.

MWE of bug:

import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2

env_name = 'LunarLanderContinuous-v2'
policy = MlpPolicy
rl_alg = PPO2
rl_name = 'ppo2'

env = gym.make(env_name)
env = DummyVecEnv([lambda:env])

model = rl_alg.load(rl_name+"_"+env_name)

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Stack trace:

WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Loading a model without an environment, this model cannot be trained until it has a valid environment.

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-e900d1547adb> in <module>()
     12 while True:
     13     action, _states = model.predict(obs)
---> 14     obs, rewards, dones, info = env.step(action)
     15     env.render()

/usr/local/lib/python3.5/dist-packages/stable_baselines/common/vec_env/base_vec_env.py in step(self, actions)
     92         """
     93         self.step_async(actions)
---> 94         return self.step_wait()
     95 
     96     def get_images(self):

/usr/local/lib/python3.5/dist-packages/stable_baselines/common/vec_env/dummy_vec_env.py in step_wait(self)
     45         for env_idx in range(self.num_envs):
     46             obs, self.buf_rews[env_idx], self.buf_dones[env_idx], self.buf_infos[env_idx] =\
---> 47                 self.envs[env_idx].step(self.actions[env_idx])
     48             if self.buf_dones[env_idx]:
     49                 obs = self.envs[env_idx].reset()

/usr/local/lib/python3.5/dist-packages/gym/wrappers/time_limit.py in step(self, action)
     29     def step(self, action):
     30         assert self._episode_started_at is not None, "Cannot call env.step() before calling reset()"
---> 31         observation, reward, done, info = self.env.step(action)
     32         self._elapsed_steps += 1
     33 

/usr/local/lib/python3.5/dist-packages/gym/envs/box2d/lunar_lander.py in step(self, action)
    236 
    237     def step(self, action):
--> 238         assert self.action_space.contains(action), "%r (%s) invalid " % (action,type(action))
    239 
    240         # Engines

AssertionError: array([0.87050533, 1.862139  ], dtype=float32) (<class 'numpy.ndarray'>) invalid

System Info
Model was trained/saved on Python 3.6 and loaded on Python 3.5
Library was installed via pip install stable-baselines
Trained on GPU, loaded and run on CPU only
All packages are up-to-date except I must run Tensorflow 1.5.0

Additional context
The model was trained and saved on a Google Compute Engine instance and run locally on an old Thinkpad laptop.

Images not scaled by default for DQN (cnn policy)

Designating Epsilon (Exploration rate)

Where can one establish the epsilon (exploration rate on greedy policies) and its decay function?

PPO2 - network diverges, outputs become NaNs

Installed via pip, running python3.6, TensorFlow 1.9

Environment is HalfCheetah-v2 (yet observed also in other environments).

env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)
model = PPO2(MlpPolicy, env, gamma=0.99, n_steps=2048, ent_coef=0.0, learning_rate=3e-4, max_grad_norm=0.5, lam=0.95, nminibatches=32, noptepochs=10, cliprange=0.2, vf_coef=1.0, verbose=2)
model.learn(total_timesteps=int(1e6))

Also tried using the default parameters and same result.

TF Error with ACKTR: FIFOQueue '_0_fifo_queue' is closed

To reproduce the bug: call model.learn twice with an acktr model

Tensorboard: Continue training curves

Currently each time we .learn() it starts a new curve on tensorboard. This makes continuing training (in a loop, or reloading later) difficult to visualize.

I was able to change logic in TensorboardWriter (_get_latest_run_id) to avoid starting a new curve with numbered postfix.

However the global_step is still reset each time, resulting in jumbled curves.

I would like to avoid starting the timeline from zero. It appears acktr is the only agent type that mentions global_step. Is that the solution for other agent types?

Fix Deepq warning

baselines/common/tests/test_atari.py::test_deepq
 /root/venv/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
   out=out, **kwargs)
 /root/venv/lib/python3.5/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
   ret = ret.dtype.type(ret / rcount)

Unconsistent Behavior in automatic env creation

Describe the bug
When creating an env passing a string, using model.env for testing won't work for DeepQ, PPO1 and TRPO.

Code example

import gym

from stable_baselines import DeepQ
from stable_baselines.common.vec_env import DummyVecEnv

model = DeepQ(policy="MlpPolicy", env='CartPole-v1')

# env = DummyVecEnv([lambda: gym.make('CartPole-v1')])
env = model.env
obs = env.reset()
for _ in range(100):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)

raises

~/anaconda3/lib/python3.6/site-packages/gym/envs/classic_control/cartpole.py in step(self, action)
     52 
     53     def step(self, action):
---> 54         assert self.action_space.contains(action), "%r (%s) invalid"%(action, type(action))
     55         state = self.state
     56         x, x_dot, theta, theta_dot = state

AssertionError: array([1]) (<class 'numpy.ndarray'>) invalid

because the action is an array instead of an int.
However, replacing env = model.env by env = DummyVecEnv(...) makes things work.

System Info
Describe the characteristic of your environment:

Installed via pip or from source

Cannot load model with custom policy

Cannot load custom policy.

Loading a model without an environment, this model cannot be trained until it has a valid environment.
Traceback (most recent call last):
  File "/snap/pycharm-professional/89/helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/snap/pycharm-professional/89/helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/snap/pycharm-professional/89/helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/snap/pycharm-professional/89/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/SRC/pathway/rl-EXTERNAL/stable-baselines/train_test_load_custompolicy.py", line 58, in <module>
    model = A2C.load("a2c_lunar")
  File "/usr/local/HAMMER/DYNAMIC/SRC/pathway/rl-EXTERNAL/stable-baselines/stable_baselines/common/base_class.py", line 361, in load
    model.setup_model()
  File "/usr/local/HAMMER/DYNAMIC/SRC/pathway/rl-EXTERNAL/stable-baselines/stable_baselines/a2c/a2c.py", line 102, in setup_model
    n_batch_step, reuse=False)
  File "/SRC/pathway/rl-EXTERNAL/stable-baselines/train_test_load_custompolicy.py", line 16, in __init__
    super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=0,
TypeError: super(type, obj): obj must be an instance or subtype of type

Code example

import gym
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
from stable_baselines.common.policies import ActorCriticPolicy
import tensorflow as tf

# Create and wrap the environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])


class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=0,
                                           reuse=reuse, scale=True)

        with tf.variable_scope("model", reuse=reuse):
            activ = tf.nn.relu

            extracted_features = tf.layers.flatten(self.processed_x)
            latent = activ(tf.layers.dense(extracted_features, 64, name='latent_fc'))
            pi_latent= activ(tf.layers.dense(latent, 64, name='pi_fc'))
            vf_latent = activ(tf.layers.dense(latent, 64, name='vf_fc'))

            value_fn = tf.layers.dense(vf_latent, 1, name='vf')

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=True):
        action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp], {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self._value, {self.obs_ph: obs})


model = A2C(CustomPolicy, env, ent_coef=0.1, verbose=1)
# Train the agent
model.learn(total_timesteps=500)

# Save the agent
model.save("a2c_lunar")
del model  # delete trained model to demonstrate loading

# Load the trained agent
model = A2C.load("a2c_lunar")


# Enjoy trained agent
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

System Info

stable-baselines commit dafd3d8 (2 after tag 2.1.0)
installed via github clone
1070 gpu, gpustat working fine
Python 3.6
Tensorflow 1.10.0
ubuntu 18.04

predict() selects random actions

trained_model.predict() selects random actions for the same input. It is probably the expected behavior, but I need deterministic actions in some cases(I am using this library for algorithmic trading). For now I have implemented it with action_probability(), but it would be nice to have a method to return deterministic actions for test run.

[question] Adding replay buffer to DDPG and TD error question

Hello. I want to add prioritization to replay buffer (similar to one in deepq).

As far as i can see i can extend exitising Memory class. Seems quite straight forward.

The second thing that i want to do is to compute priorities based on TD error (like here). Unfortunately i see no explicit TD error definition in ddpg.

Can you please point me out how i can get TD error for DDPG?

Thanks

“async” is a reserved word in Python 3.7 and greater

Using async as a function parameter is a syntax error in Python >= 3.7. Other modules like PyTouch and Cuda have shifted to using non_blocking instead of async.

flake8 testing of https://github.com/hill-a/stable-baselines on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./stable_baselines/acktr/acktr_cont.py:83:68: E999 SyntaxError: invalid syntax
                               epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
                                                                   ^
./stable_baselines/acktr/value_functions.py:39:40: E999 SyntaxError: invalid syntax
                                   async=1, kfac_update=2, cold_iter=50,
                                       ^
./stable_baselines/acktr/acktr_disc.py:174:87: E999 SyntaxError: invalid syntax
                                                                stats_decay=0.99, async=1, cold_iter=10,
                                                                                      ^
./stable_baselines/acktr/kfac.py:15:74: E999 SyntaxError: invalid syntax
                 full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2,
                                                                         ^
4     E999 SyntaxError: invalid syntax
4

ACKTR Policies

During the refactoring, it seems that old policy of ACKTR was abandoned in favor of common.policies. I think we should double check because this may affect performances.

Results not replicating for Atari Environments

I tried training different models (ACER and A2C) on a couple of atari games like Space Invaders and Seaquest but the models don't seem to train on them -- they get rewards between 5-10 even after training for 1M+ steps. Has anyone had any success in training on Atari games?

For reference, I am using the code example that is similar to the atari example in the documentation

Setting up custom environments

Is there a way to set up a custom environment, so that one can use the framework for custom environments (apart from OpenAI-Gym/Mujoco) as well? Specifically, an example like https://rllab.readthedocs.io/en/latest/user/implement_env.html for the framework would be great.
Thanks!

Question about masks for predict method

Are masks (dones) only useful when using recurrent policies ?

Choosing a machine for Running This Library

I wish to train and display the results of moderately recent training sessions simultaneously. I'm new to multithreading and am wondering what kind of physical setup to get. I'd like for it to be as low-power as possible, so I was thinking of using something like a Rock64Pro, or lattePanda Alpha, but I'm not sure how important it is to have an Nivida GPU for training speed, or on an intel cpu architecture. I don't need training to be extremely fast, but I'd also like for it not to take months. Will the multithreading capabilities built into this library automatically tie up all cores, so that I will not be able to simulataneously train and emulate a game under training? Thank you for the help. I'm sorry if this is not the right place for this.

From Multi Env Training to Single Env Production: PPO2 and LSTM[question]

First of all, great work! I got some good results from this library.

I trained the model on 10 paralles environments with SubprocVecEnv. Now in production I would like to use it with just one environment. When I try to load the model file, it states that its not possible with the LSTM-Policy to load the model with a different amount of environments.

Did I miss something?

MPIAdam synchronization error in PPO1

Describe the bug
A simple run of PPO1 crashes. The assertion thetaroot == thetalocal fails, and it's not due to NaNs as the floats differ. This doesn't happen in baselines.

Code example
Minimal reproducible example:

import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO1

env = gym.make("CartPole-v1")
model = PPO1(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

System Info

Installed from source in virtual environment
No GPU
Python 3.6.5
mpi4py==3.0.0
tensorflow==1.8.0
Open MPI 3.1.1
commit 4983566

Stdout + Traceback

(venv) petersen33md:runs petersen33md$ mpirun -n 2 python ppo1_test.py 
********** Iteration 0 ************

...

********** Iteration 6 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00082 |      -0.00627 |     117.56442 |      8.56e-05 |       0.62709
     -0.00030 |      -0.00630 |     128.11664 |      7.79e-05 |       0.63015
Traceback (most recent call last):
  File "ppo1_test.py", line 160, in <module>
    model.learn(total_timesteps=10000, callback=callback)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 272, in learn
    self.adam.update(grad, self.optim_stepsize * cur_lrmult)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/mpi_adam.py", line 48, in update
    self.check_synced()
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/mpi_adam.py", line 83, in check_synced
    assert (thetaroot == thetalocal).all(), (thetaroot, thetalocal)
AssertionError: (array([ 0.04382617, -0.0679653 , -0.11690815, ...,  0.00065254,
        0.        ,  0.        ], dtype=float32), array([ 0.04383327, -0.06797152, -0.11691316, ...,  0.00065254,
        0.        ,  0.        ], dtype=float32))

Bug when learning DQN agent

Describe the bug
I am trying to run the DQN agent on a custom made environment (which follows the OpenAI template). The environment does not represent pictures and the state has dimension (1,8) while the number of potential actions is 4. I am following this example template https://colab.research.google.com/drive/1L_IMo6v0a0ALK8nefZm6PqPSy0vZIWBT in my code. The only thing different between this code and mine is just the environment I chose and I am using DQN instead of DDPG(i have adapted the configurations accordingly). Namely:

model = DQN(policy=MlpPolicy, env=env, verbose=1, tensorboard_log="/tmp/gym/logs").learn(total_timesteps=200000, callback=callback)
`
When I run the program I get this error:

InvalidArgumentError (see above for traceback): Tensor must be 4-D with last dim 1, 3, or 4, not [32,1,8]
	 [[{{node input_info/observation}} = ImageSummary[T=DT_FLOAT, bad_color=Tensor<type: uint8 shape: [4] values: 255 0 0...>, max_images=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](input_info/observation/tag, _arg_deepq/input/Ob_0_1)]]

The same happens even when I create a simple custom policy such as this:

class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           layers=[32, 32, 32],
                                           feature_extraction="mlp")

Why does this happen and how can I fix it? Does the whole library operate under an implicit assumption the environment/input is always going to be images?

Stacking ACER

We should either make n_stack=1 the default or warn the users about frame stacking for ACER. (also in the predict method)

Also, in the predict function, stacking is not done properly (it is a valid input but there is zeros instead of correct frame stacking).

acer_simple, line 477:

stacked_obs = np.array(observation).reshape((-1,) + (obs_shape[-1] * self.n_stack,))

why ? if stacking is not needed, you should not tweak observation, no ? Or is this only valid for MlpPolicies ? (it breaks things with images)

env.render() does not work in A2C code from stable_baselines documentation

[help wanted]
Describe the bug
My issue/question is regarding the example code in the documentation for A2C.
When I try to run it, it seems to work fine. Model is trained, saved and restored.
But when env.render() is called, I get an EOFError.

Also, when I try to call env.close() afterwards I get another error.
I understand it has something to do with the multiprocessing module.

But why is it not possible for me to render in the first place?
Without render() it seems to work fine the model plays CartPole. I tested with Print(obs, rewards, dones) which worked.

Code example
Stacktrace for EOFError when calling env.render():

File "<ipython-input-2-6953c62eb323>", line 1, in <module>
    runfile('/Users/erik/Documents/Thesis/03 Arbeitsordner/01 Code/ETA-Brain/erik/easy_agents/minitest.py', wdir='/Users/erik/Documents/Thesis/03 Arbeitsordner/01 Code/ETA-Brain/erik/easy_agents')

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/erik/Documents/Thesis/03 Arbeitsordner/01 Code/ETA-Brain/erik/easy_agents/minitest.py", line 29, in <module>
    env.render()

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 93, in render
    imgs = [pipe.recv() for pipe in self.remotes]

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 93, in <listcomp>
    imgs = [pipe.recv() for pipe in self.remotes]

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError

EOFError

Stacktrace for env.close()

env.close()
Traceback (most recent call last):

  File "<ipython-input-18-1baceacf4cb1>", line 1, in <module>
    env.close()

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 83, in close
    remote.send(('close', None))

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)

  File "/Users/erik/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)

BrokenPipeError: [Errno 32] Broken pipe

System Info
Describe the characteristic of your environment:

library was installed using pip
working on Mac, no GPU
Python version = 3.6.5, installed using anaconda
Tensorflow version = 1.7.0, installed in a anaconda virtual env

Test performance is different when reload the trained model

Hi hill-a,
thank you for your elegant work. According to your example(like PPO2, ACER), when learned a model and save it, then load it for test, it works well. But if I reload the saved .pkl file without learning, it doesn't work so well as before. Did I reload right? Thanks again.
That's what I did to test the learned model:
import gym

from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy,
CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import ACER

n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = ACER.load("acer_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()

Cannot run default implementation of GAIL

Hi all,

Thanks so much for putting this together, I'm very excited about it!

I am trying to run the default implementation of GAIL, but it is throwing an error about the input policy being the wrong type as shown below. Note that I already downloaded the default expert data for the deterministic Hopper expert, and that it is logging out the return stats on the data.

AssertionError: Error: the input policy for the TRPO model must be an instance of common.policies.ActorCriticPolicy.```

How to reproduce:
In command line: `python -m stable_baselines.gail.run_mujoco`

Please let me know if I'm missing something, thanks again!

AssertionError on load() when using MultiDiscrete

Describe the bug

stable-baselines/stable_baselines/common/base_class.py

Lines 89 to 90 in 70898bd

 assert self.action_space == env.action_space, \ 

 "Error: the environment passed must have at least the same action space as the model was trained on."

When using gym.spaces.MultiDiscrete as an action, AssertionError will occur at the above line because MultiDiscrete does not have __eq__ method. I think it is a problem on gym :(

Code example

from gym import spaces

# pass
assert spaces.Discrete(4) == spaces.Discrete(4)
# assertion error
assert spaces.MultiDiscrete([1, 2, 3]) == spaces.MultiDiscrete([1, 2, 3])

Documentation on what stable-baselines is/hopes to be

I love some of the edits and cleaning up of the openai/baselines code that you guys have done.

Do you hope to get these edits merged back into the main baselines code, or do you think this project will continue to persist as a separate fork?

If you are planning on remaining separate, you might want to update the README e.g. so that it doesn't include the instructions for installing from https://github.com/openai/baselines.git

[suggestion] Remove `gym[mujoco,atari,classic_control,robotics]` from setup.py

Keeping gym[mujoco,atari,classic_control,robotics] in the setup.py script causes it to install a lot of unnecessary things some of which may not be available. For example in the colab, i was unable to install mujoco. Letting the user explicitly decide what to include might be a good way to maintain a lightweight install.

Thanks!

If you have any questions, feel free to create an issue with the tag [question].
If you wish to suggest an enhancement or feature request, add the tag [feature request].
If you are submitting a bug report, please fill in the following details.

Describe the bug
A clear and concise description of what the bug is.

Code example
Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

System Info
Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, ...)
GPU models and configuration
Python version
Tensorflow version
Versions of any other relevant libraries

Additional context
Add any other context about the problem here.

Easily port models for Unity ml-agents [feature request]

I am trying to create reinforcement learning agents for the Unity ml-agents simulation environment. While Unity ml-agents has some built in RL model trainers, I've found that stable-baselines runs faster.

I would like to be able to pass the trained policy from stable-baselines back into Unity. To do this, Unity ml-agents requires a .bytes file. This file is a byte code protocol buffer that has both the frozen Tensorflow graph along with the latest session checkpoint containing the TensorFlow Variable values.

The Unity ml-agents documentation for creating this .bytes file is here Under the 'Using your own trained graphs' section. In the documentation they refer to this type of .bytes file as an 'Internal Brain.'

By the way, I love stable-baselines, it's great!

[question] Cartpole PPO1 example and alternate policies

From the provided example it appears as if you should be able to swap in different policy implementations for MlpPolicy and have the example code run. This does not appear to be the case, so I suspect I'm misunderstanding something. To use something other than MlpPolicy what should a user know? I haven't read all the docs thoroughly so I apologize if this is clearly spelled out somewhere!

System Info
Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, ...)

pip

GPU models and configuration

CPU only

Python version

3.6.4

Tensorflow version

1.8

Additional context
An example traceback of trying to use one of the other policies

/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/common/input.py:30: RuntimeWarning: overflow encountered in subtract
np.any((ob_space.high - ob_space.low) != 0)):
/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/common/input.py:33: RuntimeWarning: overflow encountered in subtract
processed_x = ((processed_x - ob_space.low) / (ob_space.high - ob_space.low))
Traceback (most recent call last):
File "agents/ppo.py", line 9, in
model = PPO1(CnnLnLstmPolicy, env, verbose=1)
File "/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/ppo1/pposgd_simple.py", line 77, in init
self.setup_model()
File "/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/ppo1/pposgd_simple.py", line 88, in setup_model
None, reuse=False)
File "/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/common/policies.py", line 349, in init
layer_norm=True, feature_extraction="cnn", **_kwargs)
File "/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/common/policies.py", line 192, in init
extracted_features = cnn_extractor(self.processed_x, **kwargs)
File "/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/common/policies.py", line 21, in nature_cnn
layer_1 = activ(conv(scaled_images, 'c1', n_filters=32, filter_size=8, stride=4, init_scale=np.sqrt(2), **kwargs))
File "/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/stable_baselines/a2c/utils.py", line 122, in conv
n_input = input_tensor.get_shape()[channel_ax].value
File "/Users/iandanforth/.pyenv/versions/3.6.4/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 612, in getitem
return self._dims[key]
IndexError: list index out of range

ACKTR models crash when using MlpLnLstmPolicy

Describe the bug
ACKTR example code crashes when modified to use MlpLnLstmPolicy

Code example

import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import ACKTR
n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = ACKTR(MlpLnLstmPolicy, env, verbose=1)
model.learn(total_timesteps=10)

results in:

  File "/home/*****/lib/python3.6/site-packages/stable_baselines/acktr/acktr_disc.py", line 94, in __init__
    self.setup_model()
  File "/home/*****/lib/python3.6/site-packages/stable_baselines/acktr/acktr_disc.py", line 178, in setup_model
    optim.compute_and_apply_stats(self.joint_fisher, var_list=params)
  File "/home/*****/lib/python3.6/site-packages/stable_baselines/acktr/kfac.py", line 333, in compute_and_apply_stats
    stats = self.compute_stats(loss_sampled, var_list=varlist)
  File "/home/*****/lib/python3.6/site-packages/stable_baselines/acktr/kfac.py", line 355, in compute_stats
    factors = self.get_factors(gradient_sampled, varlist)
  File "/home/*****/lib/python3.6/site-packages/stable_baselines/acktr/kfac.py", line 168, in get_factors
    found_factors = _search_factors(_grad, default_graph)
  File "/home/*****/lib/python3.6/site-packages/stable_baselines/acktr/kfac.py", line 122, in _search_factors
    print(len(np.unique(op_names)))
  File "/home/*****/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 233, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/home/*****/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 281, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'

System Info
Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, ...)
pip install stable_baselines
GPU models and configuration
nvidia P6000
Python version
python 3.6.6
Tensorflow version
tensorflow==1.5
Versions of any other relevant libraries

Additional context
Add any other context about the problem here.

(edited to remove sensitive info and tawdry Arrested Development allusion)

Compatibility with MultiDiscrete spaces as observation_space

Using a space_observation of type MultiDiscrete throws an error on PPO (also tested with ACKTR) use. The error is no present if a Discrete type observation_space is used.
For a very basic test environment the broken env uses:
self.observation_space = spaces.MultiDiscrete([[0, 12], [0, 250]])
With returned observation:

 def _get_obs(self):
 return np.array([self.variable1,  self.variable2])

Whereas for testing purposes, which works, a Discrete environment:
self.observation_space = spaces.Discrete(3000)
With returned observation:

def _get_obs(self):
return np.array([self.variable1])

When using a MultIDiscrete type observation_space, the thrown error is then:

Traceback (most recent call last):

File "", line 1, in
runfile('C:/Users/MainUser/Desktop/untitled0.py', wdir='C:/Users/X/Desktop')

File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/MainUser/Desktop/untitled0.py", line 44, in
env = DummyVecEnv([lambda: env]) # The algorithms require a vectorized environment to run

File "D:\Programs\Anaconda3\lib\site-packages\stable_baselines\common\vec_env\dummy_vec_env.py", line 35, in init
self.buf_obs = {k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys}

File "D:\Programs\Anaconda3\lib\site-packages\stable_baselines\common\vec_env\dummy_vec_env.py", line 35, in
self.buf_obs = {k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys}

TypeError: 'tuple' object cannot be interpreted as an integer

As for the code which originates said problem, it was simply retrieved from the examples page (the random agent, meaning, evaluate function works fine for both cases, the problem is in directly using the models) to ensure the problem wasn't in my code:

def evaluate(model, num_steps=1000):
  """
  Evaluate a RL agent
  :param model: (BaseRLModel object) the RL Agent
  :param num_steps: (int) number of timesteps to evaluate it
  :return: (float) Mean reward for the last 100 episodes
  """
  episode_rewards = [0.0]
  obs = env.reset()
  for i in range(num_steps):
      # _states are only useful when using LSTM policies
      action, _states = model.predict(obs)
      # here, action, rewards and dones are arrays
      # because we are using vectorized env
      obs, rewards, dones, info = env.step(action)
      
      # Stats
      episode_rewards[-1] += rewards[0]
      if dones[0]:
          obs = env.reset()
          episode_rewards.append(0.0)
  # Compute mean reward for the last 100 episodes
  mean_100ep_reward = round(np.mean(episode_rewards[-100:]), 1)
  print("Mean reward:", mean_100ep_reward, "Num episodes:", len(episode_rewards))
  
  return mean_100ep_reward

env = gym.make('Test-v0')
# vectorized environments allow to easily multiprocess training
# we demonstrate its usefulness in the next examples
env = DummyVecEnv([lambda: env])  # The algorithms require a vectorized environment to run

model = ACKTR(MlpPolicy, env, verbose=0)

# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

# Train the agent for 10000 steps
model.learn(total_timesteps=10000)

# Evaluate the trained agent
mean_reward = evaluate(model, num_steps=10000)

Multithreading broken pipeline on custom Env

First of all, thank you for this wonderful project, I can't stress it enough how badly baselines was in need of such a project.

Now, the Multiprocessing Tutorial created by stable-baselines (see) states that the following is to be used to generate multiple envs - as an example of course:

def make_env(env_id, rank, seed=0):
    """
    Utility function for multiprocessed env.
    
    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environment you wish to have in subprocesses
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    """
    def _init():
        env = gym.make(env_id)
        env.seed(seed + rank)
        return env
    set_global_seeds(seed)
    return _init

However, for some obscure reason, python never calls _init, for some obvious reason: even though it has no arguments, it is still a function hence, please replace it with 'return _init()'.

Secondly, even doing so results in an error when building the SubprocVecEnv([make_env(env_id, i) for i in range(numenvs)]), namely:

Traceback (most recent call last):

File "", line 1, in
runfile('C:/Users/X/Desktop/thesis.py', wdir='C:/Users/X/Desktop')

File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/X/Desktop/thesis.py", line 133, in
env = SubprocVecEnv([make_env(env_id, i) for i in range(numenvs)])

File "D:\Programs\Anaconda3\lib\site-packages\stable_baselines\common\vec_env\subproc_vec_env.py", line 52, in init
process.start()

File "D:\Programs\Anaconda3\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)

File "D:\Programs\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)

File "D:\Programs\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)

File "D:\Programs\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)

File "D:\Programs\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)

BrokenPipeError: [Errno 32] Broken pipe

Any ideas on how to fix this? I have implemented a simply Gym env, does it need to extend/implement SubprocVecEnv?

[enhancement] Consistent frequency and bookkeeping variables among callbacks

The callbacks feature is one of the major reasons to use stable-baselines. However, as is, it is difficult to create an algorithm-agnostic callback function (one that works similarly across all algorithms). A simple use case would be a callback that performs custom evaluation rollouts every 1000 completed training episodes. I'm working on a generic Callback class that I can use for all algorithms, similarly to how stable-baselines makes it easy to create a launcher that works the same for all algorithms, given algorithm name/kwargs. (I can provide an example of this, if helpful.)

There are currently two issues that make a model-agnostic callback function difficult: 1) callback frequency differs among algorithms, and 2) inconsistent bookkeeping variables among algorithms.

Callbacks have different frequency/timing. For example, DDPG calls back every step of every rollout, whereas PPO1 calls back only after every rollout. More subtly, DDPG calls after taking the actual step, whereas PPO1 calls before taking the actual rollout. In the use case described above, this can be handled by finding how many episodes have been completed thus far and only evaluating each time this passes a 1000-episode mark. Finding this information isn't obvious (see issue 2 below), but if using Monitor wrappers it can be obtained from len(env.episode_rewards).
It's difficult to access common bookkeeping information (e.g. the policy, env, or number of completed episodes) in an a way that works for all algorithms. The callback function, say callback(_locals, _globals), can be used to access most information, but it's not consistent among algorithms. For example, most algorithms track the total number of completed episodes, but they have different names, e.g. _locals['episodes'] for DDPG or _locals['episodes_so_far'] for PPO1. More important is accessing and stepping the policy itself. For DDPG the policy object is stepped with _locals['self'].policy_tf.step whereas for PPO1 the policy is _locals['self'].policy_pi.step (or `_locals['self'].step, which is set to the previous variable).

Most of these issues have straightforward fixes. For example, accessing the policy can be made consistent by either defining a step function for the algorithm (e.g. self.step = self.policy_pi.step in PPO1) or making the policy object variable names consistent. Things like counting the number of episodes could be made part of BaseRLModel in case the user isn't using the Monitor wrapper.

Atari Performance for PPO2

Describe the bug
PPO2 have bad performance on Atari games. For example, it does not converge on pong.

Code example
Using https://github.com/araffin/rl-baselines-zoo:

python train.py --algo ppo2 --env PongNoFrameskip-v4

System Info
Describe the characteristic of your environment:

Stable-Baselines v2.1.2
GPU
Python 3.6
Tensorflow 1.8

Training the same model after loading.

Hello again,

I was looking into continuing training after loading a model. Simply using
model.load("path-to-model")
model.learn(total_timesteps= 500000)
model.save("path-to-model")

However the training seems to reset as seen here:

I don't think learn is inherently resetting the parameters. Do you know why this might be the case?

Thank you!

HTML Documentation

This would be nice to have a proper documentation as Pytorch Documentation or PiCamera

[Question] DQN vs Open AI Baseline's Rainbow agent

The rainbow agent by default experienced the best base result in sonic for the OpenAI team by a large margin, if you exclude the ridiculously resource intensive parallel PPO training:

https://arxiv.org/pdf/1804.03720.pdf

Is the DQN agent provided by stable-baselines the rainbow model?

Mean Episode Reward and Length showing NaN in PPO2 training.

Hello,

First of all I would like to thank you for this is incredibly polished package. Seeing good documentation just makes me feel happy.

While training PPO2 on Pong-ram-v0 (and several other envs) I realized that the episode reward doesn't print correctly to the console. It is seen as follows:

I tried looking into it but couldn't figure out why just yet. If you give me some pointers I would like to work on fixing it.

I am using python 3.6.5 on an Ubuntu 16.04.

Training always restarts with very short episode lengths

I noticed after training, reloading model, and then training again, the first few printouts of eplenmean are very low (in env where long episode corresponds to good performance).

For a long time, this made me suspect that training was being lost or destroyed, and I wondered if stable-baselines had a bug...

	assert self.action_space == env.action_space, \
	"Error: the environment passed must have at least the same action space as the model was trained on."

hill-a / stable-baselines Goto Github PK

stable-baselines's People

Contributors

Stargazers

Watchers

Forkers

stable-baselines's Issues

@title Set-up the virtual display environment

@title Start the virtual monitor

@title play a random game and create video

download videos

Recommend Projects

Recommend Topics

Recommend Org