pat-coady / trpo Goto Github PK

View Code? Open in Web Editor NEW

361.0 16.0 106.0 12.11 MB

Trust Region Policy Optimization with TensorFlow and OpenAI Gym

Home Page: https://learningai.io/projects/2017/07/28/ai-gym-workout.html

License: MIT License

Python 4.59% Jupyter Notebook 95.41%

reinforcement-learning policy-gradient tensorflow machine-learning mujoco

trpo's People

Stargazers

Watchers

Forkers

pcchenxi benjamesbabala wilsonwangthu danijar hardmaru ml-lab andrewliao11 roszcz zssasa 4skynet peterzcc win2cs felipheggaliza min-yang mepster curranw lienbo nrontsis waffleking32 ronamit minghuichiu changkunye wu6u3 github30 magnusja tianyuanyu adamjosephjensen tengyangxie maxpridy wuntoguo charliezon yimingpeng tomorrowisanotherday sanjaythakur chaitanya1123 ziv-lin pleabargain ypy516478793 binderwang lukeeeeee sluo1989 anpark alexzhou1995 afcarl fdsmlhn wmyw96 decoderkurt xlrobotics gabrielcaterson frankpsch seanlinnaeusbarton rahulindoria5 fiberleif gvgramazio hcch0912 170928 daominglyu msheckells megayeye yixuanhuang98 kelvinson flandolsi aupilot brenthe seanhsieh jzinsa mrdadaguy bolundai0216 pabtorre dhruvramani nwaftp23 waquarazam andrewdahbura bczhu granilace weiqiao jiangyancao bzp92 isk03276 focusssss arrowliu123 jaustinb1 asokraju nesarasr erwincoumans minusshi tchaye59 mcx dreamwest myleosu wailker3 ydeh22 hs-wang17 ngc436 hidakay1 rical730

trpo's Issues

is it TRPO?

I reviewed source code and did not find surrogate function and importance sampling. Source code looks like PPO. Am I i right or not?

able to run FetchPickAndPlace-v1 ?

Trying to get some of the Openai gym Robotics environments working. I've added the appropriate imports so it pulls in FetchPickAndPlace-v1 but the script fails on this line

obs_dim = env.observation_space.shape[0]

I've had a look and shape is None -- although observation_space looks like this:

Dict(achieved_goal:Box(3,), desired_goal:Box(3,), observation:Box(25,))

any ideas how I could modify the code to get past this?

Thanks!

Mistake in KL divergence formula

Hi,

There is a small mistake in the policy.py file when you calculate the kl divergence between two multivariate normal distributions :

self.kl = 0.5 * tf.reduce_mean(log_det_cov_new - log_det_cov_old + tr_old_new + tf.reduce_sum(tf.square(self.means - self.old_means_ph) / tf.exp(self.log_vars), axis=1) - self.act_dim)

The ratio of the covariances i.e. tr_old_new should be squared in the KL divergence i.e. tr_old_new just needs to be replaced with tr_old_new**2.

Hello! Congratulations for the excellent implementation.
I noticed some differences between your policy nn loss function and the one of the original paper. What criteria did you follow to make such changes?

Using train_on_batch

Ok,

I'm sorry again for so many questions, but why does train_on_batch only have one input? On the keras documentation it sounds like the format is supposed to be

train_on_batch(object, x, y, class_weight = NULL, sample_weight = NULL).

So why does this file's implementation look like

train_on_batch([arrays])?

Can the `Data cardinality is ambiguous error` in Tensorflow 2.4 or 2.5 be solved as follows?

Hi, thanks very much for your work. I use docker to build an environment to learn your work. When I use FROM tensorflow/tensorflow:2.3.3-gpu-jupyter to create a container, and test the examples

python train.py InvertedPendulumBulletEnv-v0
python train.py InvertedDoublePendulumBulletEnv-v0 -n 5000
python train.py HalfCheetahBulletEnv-v0 -n 5000 -b 5

all the tests passed. But when I use newer images, for instance, FROM tensorflow/tensorflow:2.4.2-gpu-jupyter, I got the ValueError: Data cardinality is ambiguous error as presented below.

$ python train.py InvertedPendulumBulletEnv-v0
['/home/wezardlza/workspace/trpo', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/home/wezardlza/.local/lib/python3.6/site-packages', '/usr/local/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages', '/home/wezardlza/workspace']
pybullet build time: Jun 22 2021 23:31:53
2021-06-22 23:42:07.575098: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
/home/wezardlza/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Value Params -- h1: 60, h2: 17, h3: 5, lr: 0.00243
2021-06-22 23:42:08.572333: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-06-22 23:42:08.572860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-22 23:42:08.603929: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.604218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.815GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-06-22 23:42:08.604237: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-06-22 23:42:08.605660: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-06-22 23:42:08.605710: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-06-22 23:42:08.606300: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-06-22 23:42:08.606448: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-06-22 23:42:08.608003: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-06-22 23:42:08.608401: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-06-22 23:42:08.608521: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-06-22 23:42:08.608599: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.608890: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.609111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-06-22 23:42:08.609301: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-22 23:42:08.609478: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-06-22 23:42:08.609563: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.609800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.815GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-06-22 23:42:08.609823: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-06-22 23:42:08.609840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-06-22 23:42:08.609850: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-06-22 23:42:08.609860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-06-22 23:42:08.609870: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-06-22 23:42:08.609880: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-06-22 23:42:08.609891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-06-22 23:42:08.609901: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-06-22 23:42:08.609946: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.610192: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.610404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-06-22 23:42:08.610424: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-06-22 23:42:08.934643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-22 23:42:08.934667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-06-22 23:42:08.934672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-06-22 23:42:08.934802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.935063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.935289: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-22 23:42:08.935494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6638 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
Policy Params -- h1: 60, h2: 24, h3: 10, lr: 0.000184, logvar_speed: 2
argv[0]=
argv[0]=
2021-06-22 23:42:09.103904: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-06-22 23:42:09.375022: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-06-22 23:42:10.302274: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-06-22 23:42:10.322790: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3600000000 Hz
Traceback (most recent call last):
  File "train.py", line 351, in <module>
    main(**vars(args))
  File "train.py", line 317, in main
    policy.update(observes, actions, advantages, logger)  # update policy
  File "/home/wezardlza/workspace/trpo/policy.py", line 61, in update
    old_means, old_logvars, old_logp])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1725, in train_on_batch
    class_weight)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 1513, in single_batch_iterator
    _check_data_cardinality(data)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 1529, in _check_data_cardinality
    raise ValueError(msg)
ValueError: Data cardinality is ambiguous:
  x sizes: 369, 369, 369, 369, 1, 369
Make sure all arrays contain the same number of samples.

After some checks, I found in file ./trpo/policy.py the below code caused the mismatched batch size

class PolicyNN(Layer):
    """ Neural net for policy approximation function.

    Policy parameterized by Gaussian means and variances. NN outputs mean
     action based on observation. Trainable variables hold log-variances
     for each action dimension (i.e. variances not determined by NN).
    """
    def build(self, input_shape):
        self.batch_sz = input_shape[0]
        
    def call(self, inputs, **kwargs):
        y = self.dense1(inputs)
        y = self.dense2(y)
        y = self.dense3(y)
        means = self.dense4(y)
        logvars = K.sum(self.logvars, axis=0, keepdims=True) + self.init_logvar
        logvars = K.tile(logvars, (self.batch_sz, 1))
        return [means, logvars]

which set the first dimension of logvars to be one during runtime constantly while the first dimension of inputs seems varied. Thus, based on the code above, the first dimension of means is also different from logvars which causes the error

  File "/home/wezardlza/workspace/trpo/policy.py", line 61, in update
    old_means, old_logvars, old_logp])

Thus, I do the following things: In file ./trpo/policy.py, add

from tensorflow import shape

and change logvars = K.tile(logvars, (self.batch_sz, 1)) to logvars = K.tile(logvars, (shape(inputs)[0], 1)). These helped me to pass the exmple

python train.py InvertedPendulumBulletEnv-v0

but it seems self.batch_sz will not be used anymore. Perhaps we can just change logvars = K.tile(logvars, (self.batch_sz, 1)) to logvars = K.tile(logvars, (shape(inputs)[0], 1)) and remove the build() method above? I am new to TensorFlow and would like to know whether my changes will cause any problems or even errors for the TRPO results. Thanks for help!

Some questions about the code

Hi pat,

Good job! This code have a excellent performance in continue space env. But I have some questions about the code.

I can't understand the meaning of logvar_speed, could you explain that why it can make faster updates or give me some paper or blog something like these to understand it?
Why value function need use the 2 batches to update? which means every batch need to be updated 2 time.
Why checking variance before and after update value function can diagnose over-fitting?

I am looking forward your reply. : )

Ocasional NaN's

Hi! Thanks for the great job you did with the implementation.
I was playing a bit with your code and, in some runs, the "kl" that KLEntropy's call outputs is nan. I am not being able to reproduce this error, it only happens sometimes.

Have you experience? Can you guess any cause for it?

Why self.first_pass = True in Scaler method

Hello, in utils.py 29 line, there is self.first_pass = True, as far as what I know, this will not lead to a running mean method, isn't it? Plz point out what I misunderstood, thanks.

CalledProcessError: Command '['avconv', '-version']' returned non-zero exit status 1

Hi all,

I am excited about the repo but met this error after running the train.py. Any idea or thoughts on that will be appreciated.
I'm running in 3.16.0-4-amd64, and python 3.5.2
Thanks,

[2017-09-29 14:28:00,861] Starting new video recorder writing to /tmp/Walker2d-v1/Sep-29_21:27:59/openaigym.video.1.29235.video000000.mp4
Traceback (most recent call last):

File "", line 1, in
runfile('/netapp/cnl/home/kqian/anaconda3/envs/mujoco_new/trpo/src/train.py', wdir='/netapp/cnl/home/kqian/anaconda3/envs/mujoco_new/trpo/src')

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
execfile(filename, namespace)

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "/netapp/cnl/home/kqian/anaconda3/envs/mujoco_new/trpo/src/train.py", line 330, in
main(**vars(args))

File "/netapp/cnl/home/kqian/anaconda3/envs/mujoco_new/trpo/src/train.py", line 290, in main
run_policy(env, policy, scaler, logger, episodes=5)

File "/netapp/cnl/home/kqian/anaconda3/envs/mujoco_new/trpo/src/train.py", line 137, in run_policy
observes, actions, rewards, unscaled_obs = run_episode(env, policy, scaler)

File "/netapp/cnl/home/kqian/anaconda3/envs/mujoco_new/trpo/src/train.py", line 90, in run_episode
obs = env.reset()

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/gym/core.py", line 104, in reset
return self._reset()

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/gym/wrappers/monitoring.py", line 41, in _reset
self._after_reset(observation)

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/gym/wrappers/monitoring.py", line 198, in _after_reset
self._reset_video_recorder()

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/gym/wrappers/monitoring.py", line 219, in _reset_video_recorder
self.video_recorder.capture_frame()

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/gym/monitoring/video_recorder.py", line 121, in capture_frame
self._encode_image_frame(frame)

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/gym/monitoring/video_recorder.py", line 168, in _encode_image_frame
self.metadata['encoder_version'] = self.encoder.version_info

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/site-packages/gym/monitoring/video_recorder.py", line 268, in version_info
stderr=subprocess.STDOUT)),

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout

File "/home/kqian/anaconda3/envs/mujoco_new/lib/python3.5/subprocess.py", line 708, in run
output=stdout, stderr=stderr)

CalledProcessError: Command '['avconv', '-version']' returned non-zero exit status 1

Roboschool issue (dimensionality of `action` in train.py:105)

Hi! I love your repo (and your blog, and your suggestions for a ML intro curriculum of MOOCs) -- thank you!

Submitting this as an issue rather than a PR because I'm not sure if I fixed the issue in the best way.

I am having an issue trying to run train.py on a roboschool environment. I added "import roboschool" to the top of train.py (which registers the Roboschool environments) and had the following result:

$ python3 train.py RoboschoolReacher-v1 -n 60000 -b 50
[2017-07-30 16:39:35,372] Making new env: RoboschoolReacher-v1
Value Params -- h1: 100, h2: 22, h3: 5, lr: 0.00213
[bunch of TF initialization...]
Traceback (most recent call last):
  File "train.py", line 325, in <module>
    main(**vars(args))
  File "train.py", line 285, in main
    run_policy(env, policy, scaler, logger, episodes=5)
  File "train.py", line 135, in run_policy
    observes, actions, rewards, unscaled_obs = run_episode(env, policy, scaler)
  File "train.py", line 105, in run_episode
    obs, reward, done, _ = env.step(action)
  File "/mnt/brian/gym/gym/core.py", line 99, in step
    return self._step(action)
  File "/mnt/brian/gym/gym/wrappers/time_limit.py", line 36, in _step
    observation, reward, done, info = self.env.step(action)
  File "/mnt/brian/gym/gym/core.py", line 99, in step
    return self._step(action)
  File "/home/pender/roboschool/roboschool/gym_reacher.py", line 53, in _step
    self.apply_action(a)
  File "/home/pender/roboschool/roboschool/gym_reacher.py", line 27, in apply_action
    self.central_joint.set_motor_torque( 0.05*float(np.clip(a[0], -1, +1)) )
TypeError: only length-1 arrays can be converted to Python scalars

I used some debug statements to determine that line 105 of train.py is calling env.step(action) when the value of action is [[-0.70904064 -0.71731383]] -- i.e. a list of shape [1, 2] rather than a one-dimensional list of length 2. The action space for the environment is Box(2,) so I think it should just be a list of two floats.

I tried changing line 105 to obs, reward, done, _ = env.step(action[0]) to eliminate the degenerate dimension and it seems to work at that point.

I'm on Ubuntu 16.04.2 LTS, TF v1.2.1, gym v0.9.1, fresh install of roboschool as of 5 minutes ago.

Help Getting Cart Pole to Run

When I try to run the CartPole enviornment, I run into this error:

Traceback (most recent call last):
File "train.py", line 349, in
main(**vars(args))
File "train.py", line 289, in main
env, obs_dim, act_dim = init_gym(env_name)
File "train.py", line 72, in init_gym
act_dim = env.action_space.shape[0]
IndexError: tuple index out of range

After a little bit of digging I had to add self.action_space = np.array([1]) to line 53 in cartpole_bullet.py which does solve the error so that the environment can run, but I'm not sure if it is causing some training issues because it won't train after 1000 episodes. Is there an official fix to this problem?

enjoy a pre-trained model after training is done?

Is there a way to enjoy the pre-trained model (weights/checkpoint) after the training is completed?

Usually (TF Agents, OpenAI Baselines, ES Tool etc) there is a 'train' script that saves the weights of the neural network, and a 'enjoy' script that reads those weights and allows to run the pretrained network.

If it is not available, do you have any hints how it can be added?

Graphing NN Model

How would you go about graphing the NN model for the policy? I keep trying with TensorBoard but it keeps failing.

Temporal difference error in value estimates not calculated.

Does this code use TRPO?

The folder says TRPO but the description includes PPO. If this repository runs PPO and not TRPO then the repository should be renamed to PPO.

Help understanding how to read the code

Hello,

Just a quick question. In policy.py in class Policy it uses the Keras package to call "get_layer". This is the output layer correct? Also, I sent an email out so feel free to ignore this part if you already answered it, but I see from the TRPO paper that the NN is supposed to only calculate the mean and somehow uses another set of parameters which are a vector of the same size as the number of actions. But the paper is not clear to me how stdev is actually computed or updated. And in this code, all of it is computed under the hood in Keras.

Anyhelp on this would be greatly appreciated!

Ryan

DOI for citation

Thank you very much for your code and blog post! It was super helpful.

I think you do not have a paper for your code. Have you thought of getting a DOI for your code repository so that people can cite it formally?

https://guides.github.com/activities/citable-code/

Scaler vs. BatchNorm

Hi @pat-coady -- I was wondering why you use a custom Scaler python object instead of standard batchnorm (e.g. the tensorflow kind)? Wouldn't sticking a batchnorm layer onto the front of the policy net achieve the same thing, require less code and be compatible with TF savers? Sorry if I am misunderstanding!

training issue

hello!
When I load the saved training humanoid-v1 model, it will be after several minutes can get the best result. eg: when I load the 200000 episodes' training model , at the start the human walks not good and after several minutes, it can walk better.
Why this happens？
Thanks!

KL, PolicyEntropy, PolicyLoss go to NaN after 31,455 episodes

Hi there,
I have created a variant of the HumanStandup-v2 environment in gym which has a much simpler simulated robot that is represented as a mujoco formatted xml file. I have tested this model both in mujoco and in gym and it seems to work fine.
I tested the HumanStandup-v2 training on my hw/sw configuration and it worked well to 50,000 episodes. I then ran the identical setup with our robot model with the same reward function as the standard HumanStandup-v2. The only substantive difference between these two is the mujoco model.
When I ran the training on our model I get:

***** Episode 31455, Mean R = 28911.0 *****
Beta: 6.91
ExplainedVarNew: 0.913
ExplainedVarOld: 0.812
KL: nan
PolicyEntropy: nan
PolicyLoss: nan
Steps: 672
ValFuncLoss: 114

Traceback (most recent call last):
File "./train.py", line 334, in
main(**vars(args))
File "./train.py", line 290, in main
trajectories = run_policy(env, policy, scaler, logger, episodes=batch_size)
File "./train.py", line 135, in run_policy
observes, actions, rewards, unscaled_obs = run_episode(env, policy, scaler)
File "./train.py", line 105, in run_episode
obs, reward, done, _ = env.step(np.squeeze(action, axis=0))
File "/home/david/source/gym/gym/wrappers/monitor.py", line 31, in step
observation, reward, done, info = self.env.step(action)
File "/home/david/source/gym/gym/wrappers/time_limit.py", line 31, in step
observation, reward, done, info = self.env.step(action)
File "/home/david/source/gym/gym/envs/Senbionic/ballbotEnv.py", line 28, in step
self.do_simulation(a, self.frame_skip)
File "/home/david/source/gym/gym/envs/mujoco/mujoco_env.py", line 100, in do_simulation
self.sim.step()
File "source/mujoco-py/mujoco_py/mjsim.pyx", line 119, in mujoco_py.cymj.MjSim.step
File "source/mujoco-py/mujoco_py/cymj.pyx", line 115, in mujoco_py.cymj.wrap_mujoco_warning.exit
File "source/mujoco-py/mujoco_py/cymj.pyx", line 75, in mujoco_py.cymj.c_warning_callback
File "/home/david/.conda/envs/gym35/lib/python3.5/site-packages/mujoco_py-1.50.1.53-py3.5.egg/mujoco_py/builder.py", line 319, in user_warning_raise_exception
raise MujocoException('Got MuJoCo Warning: {}'.format(warn))
mujoco_py.builder.MujocoException: Got MuJoCo Warning: Unknown warning type Time = 0.0000.

I ran it again and it did the same thing at Episode 1280.

Any suggestions on how to approach overcoming this?

Many thanks for any advice..

Rendering doesn't work. Window goes block

Hi!
Our model is training properly and we are getting appropriate results. But the simulation window goes black. mujoco_py version 0.5, mujocopro131.Tried on both ubuntu 16.04 and arch with gnome.

Can't work on CartPole-v1

Hi,

I only changed a few codes to make this project work on CartPole-v1 env, but the result was not good.
The mean reward is always about 9.3 and can't goes up.

Do you have tested the performance on some easy environments?

Thank you

Trouble using pybullet and roboschool envs

Hello Pat,

Thanks for your great repo !

I read some of the closed issues but didn't find the same I'm encountering.

I tried to launch your train.py code on RoboschoolInvertedPendulum-v1 which is supposed to be the same as the MuJoCo environment but I got the following error:

File "/home/lea/roboschool/roboschool/gym_pendulums.py", line 88, in calc_state assert( np.isfinite(x) ) AssertionError

The error occurs on line 108: obs, reward, done, _ = env.step(np.squeeze(action, axis=0)). This should be an error with the dimensions but I can't figure out what is exactly the problem.

I tried to workaround by using the corresponding pybullet environment InvertedPendulumBullet-v0. The program is running but the pendulum doesn't seem to learn a thing. The reward stays on average around 30.0 and doesn't make any progress. Do you have an idea why ?

I'm looking forward for your reply !

Léa

How much episodes should we do ?

Hi!
Thanks for sharing the code！
I trian the Humanoid .
I have a question ,how much episodes should I do? When the episodes is much than 20000 , the human can not walk better.

System Reboots

My system just reboots when I run the code. Normally it is after 30,000 to 50,000 episode later. I tried to lower GPU power but it does not work. It still reboots.

Nice code! But much nicer if parallelized

Pat,

Love your code and algo here. But I'd really like to see it running in parallel.

You might want to take Schulmans trpo_mpi code from openai baselines and put your algo in it. I'm trying that with my algo and its working out well. But your algo is better.

I'm looking forward to using it parallelized

error

Hi I am getting the following error

raise error.DeprecatedEnv('Env {} not found (valid versions include {})'.format(id, matching_envs)) gym.error.DeprecatedEnv: Env Reacher-v1 not found (valid versions include ['Reacher-v2'])
Can you please help me solve this problem

Unusual replay buffer

In your replay buffer you are recording states and value functions {s,V}. Shouldn't the replay buffer record the states, actions, rewards, next states {s,a,r,s'} instead?

Is there information on what actions and observations really are?

Hi!
Is there information on what actions and observations really are?
I can't seem to find if there is any info on what actions and observation arrays that are supplied really are?
For example , the humanoid ,what the 17 and 376 represented for ? the joint angles or joint velocities ?
I look for the docs and cannot find the answer . How can I figure this out?
Thanks a lot !

add command line arguments for network sizing and initial policy variance

Make hidden layer 1 size adjustable from command line. Will specify as a multiple of observation dimension size. Present code has it hard-coded as 10x observation dimension. Will use same size for value function NN and policy NN.
Make initial policy variance configurable. Presently each action dimension starts with a variance of 0.1.

Saving and Loading trpo model for policynn

Hello,

I'm sorry for asking so many questions. But would you know how to go about saving and loading the subclassed model? I have tried everything I can think of including trying to change the subclassed model to a functional or sequential but I just keep running into issues.

Thanks,