Git Product home page Git Product logo

alf's People

Contributors

7gao avatar bayesian avatar breakds avatar emailweixu avatar haichao-zhang avatar hnyu avatar jesbu1 avatar jialn avatar le-horizon avatar neale avatar pd-perry avatar pinzhang avatar quantumope avatar resuscitated avatar ruizhaogit avatar runjerry avatar witwolf avatar www2171668 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alf's Issues

ParallelPyEnvironment will cause sys.exit() to hang after main() finishes

I've tested for quite some time to pin down an issue (reason unknown at this moment) that once we created tf_agents' ParallelPyEnvironment in main(), after the code finishes, sys.exit() will hang forever without releasing the GPU memory (running in CPU mode also has this hanging issue). This makes grid_search.py fail. The minimal re-producible example is to replace the train_eval(root_dir) function with the following:

@gin.configurable
def train_eval(root_dir):
    env = create_environment()
    env.pyenv.close()

and run any training job. The script will never exit.

And I have located the issue exactly at sys.exit(main(argv)) in absl's app.py. It's more like an issue happened inside sys.exit() rather than in absl's code.

@witwolf When you ran grid_search.py, did you ever have such a problem?

Make sure environments in SocialRobot work with tf_agents

We want to be able to run SocialRobot environment with tf_agents. There are two issues:

  1. Since tf_agents run step() of an environment in a thread different from the thread it is created, somehow SocialRobot will crash when it's used this way

  2. We want to run multiple SocialRobot environments. And each environment needs to be in a separate process because of Gazebo. Hence we need have an environment wrapper to wrap an environment in a separate process while use it like in the same process. This is similar to https://github.com/deepmind/scalable_agent/blob/master/py_process.py, which is used here https://github.com/deepmind/scalable_agent/blob/6c0c8a701990fab9053fb338ede9c915c18fa2b1/experiment.py#L437

_cached_opt_and_var_sets inconsistent between initial and sequential calls

I tried printing out how many variables each optimizer is responsible for optimization. I printed in two places. The first place is inside algorithm._get_opt_and_var_sets(), which is before self._cached_opt_and_var_sets is set, and I got something like below

{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 3e-05,
 'name': 'Adam'} 24
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 1e-05,
 'name': 'Adam'} 26
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 0.001,
 'name': 'Adam'} 1

For each row, the first is the optimizer config and the second is len(vars), which makes sense given my job. However, if I printed inside train_complete, after the results of calling self._get_cached_opt_and_var_sets(), the output is

{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 3e-05,
 'name': 'Adam'} 75
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 1e-05,
 'name': 'Adam'} 26
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 0.001,
 'name': 'Adam'} 1

A simple calculation suggests that somehow the parent algorithm (holds the first optimizer above) also takes into account all leaf trainable variables somewhere in the code (24 + 24 + 26 + 1 = 75). So some variables will be also optimized by the parent optimizer even though I have specified child optimizers for them (our tape is "persistent" and grads can be computed multiple times).

Verified that this also happened in the eager mode.

TrainTest fails sometimes

======================================================================
FAIL: test_ppo_cart_pole (bin.train_test.TrainTest)
test_ppo_cart_pole (bin.train_test.TrainTest)

Traceback (most recent call last):
File "/ALF/alf/bin/train_test.py", line 100, in test_ppo_cart_pole
self._test_train('ppo_cart_pole.gin', _test_func)
File "/ALF/alf/bin/train_test.py", line 126, in _test_train
assert_func(episode_returns, episode_lengths)
File "/ALF/alf/bin/train_test.py", line 97, in _test_func
self.assertGreater(np.mean(returns[-2:]), 198)
AssertionError: 197.8499984741211 not greater than 198

@witwolf It seems that the determinism isn't working as expected? There are some other cases failed sometimes, and I have to manually restart the testing job each time.

async train sometime fails

when test with ppo_async_icm_super_mario_intrinsic_only

rm -rf tmp && python3 -m alf.bin.train \
 --root_dir=tmp \
 --gin_file=ppo_async_icm_super_mario_intrinsic_only.gin \
 --gin_param=TrainerConfig.random_seed=0 \
 --gin_param=create_environment.num_parallel_environments=1 \
 --gin_param=TrainerConfig.num_iterations=2 \
 --gin_param=TrainerConfig.num_steps_per_iter=1 \
 --gin_param=TrainerConfig.num_updates_per_train_step=1 \
 --gin_param=TrainerConfig.mini_batch_length=2 \
 --gin_param=TrainerConfig.mini_batch_size=4 \
 --gin_param=TrainerConfig.num_envs=2 \
 --gin_param=ReplayBuffer.max_length=64 \
 --gin_param=TrainerConfig.unroll_length=2 \
 --gin_param=TrainerConfig.num_updates_per_train_step=2 \
 --gin_param=TrainerConfig.use_tf_functions=False

get error msg:

  ...
  File "/home/hongyingxiang/FLA/alf/drivers/threads.py", line 410, in _step
    self._env.step(action), action, first_env_id=self._first_env_id)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_environment.py", line 232, in step
    return self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 319, in _step
    name='step_py_func')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 591, in numpy_function
    return py_func_common(func, inp, Tout, stateful=True, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 488, in py_func_common
    result = func(*[x.numpy() for x in inp])
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 302, in _isolated_step_py
    return self._execute(_step_py, *flattened_actions)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 195, in _execute
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 298, in _step_py
    self._time_step = self._env.step(packed)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 136, in _step
    time_steps = [promise() for promise in time_steps]
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 136, in <listcomp>
    time_steps = [promise() for promise in time_steps]
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 338, in _receive
    raise Exception(stacktrace)
Exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 377, in _worker
    result = getattr(env, name)(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/wrappers.py", line 105, in _step
    time_step = self._env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/gym_wrapper.py", line 197, in _step
    observation, reward, self._done, self._info = self._gym_env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/gym/core.py", line 282, in step
    return self.env.step(self.action(action))
  File "/home/hongyingxiang/FLA/alf/environments/mario_wrappers.py", line 121, in action
    for i in self._actions[a]:
IndexError: list index out of range

and with diff

diff --git a/alf/algorithms/actor_critic_algorithm.py b/alf/algorithms/actor_critic_algorithm.py
index 20216fa..836e9dc 100644
--- a/alf/algorithms/actor_critic_algorithm.py
+++ b/alf/algorithms/actor_critic_algorithm.py
@@ -110,6 +110,8 @@ class ActorCriticAlgorithm(OnPolicyAlgorithm):
             step_type=time_step.step_type,
             network_state=state.actor)

+        import threading
+        print(action_distribution.logits[0][:4], threading.current_thread().ident)
         action = common.sample_action_distribution(action_distribution)
         return PolicyStep(
             action=action,

i get

  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/gym_wrapper.py", line 197, in _step
    observation, reward, self._done, self._info = self._gym_env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/gym/core.py", line 282, in step
    return self.env.step(self.action(action))
  File "/home/hongyingxiang/FLA/alf/environments/mario_wrappers.py", line 121, in action
    for i in self._actions[a]:
IndexError: list index out of range

tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648
tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648
tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648

logits for distributions.Categorical might be nan (it's very easy to reproduce this issue)

can you help take a look for this issue @emailweixu @hnyu

Figures showing metric against time

Right now, we have figures showing metrics against global count, environmental steps,... In order to comparison wall clock speed between algorithms/settings, we also need figures with time as x-axis and metric as y-axis.

Actor critic RNN policy unit test fails

There is a shape error when running test alf/algorithms/test_actor_critic_rnn_policy or alf/drivers/on_policy_driver_test.py

File "alf/algorithms/actor_critic_algorithm_test.py", line 165, in test_actor_critic_rnn_policy policy_step = policy.action(time_step, policy_state) ValueError: Incompatible shape for value ((100, 100)), expected ((100, 1))

can't init submodule with current master

current master e5e903b reports error when git submodule update:

03:10:03 (py3env) immars@immars-brick alf ±|master ✗|→ git submodule update
error: Server does not allow request for unadvertised object ba9cf75787469554483874efcf07a4f00306443c
Fetched in submodule path 'tf_agents', but it did not contain ba9cf75787469554483874efcf07a4f00306443c. Direct fetching of that commit failed.

seems that #22 updated tf_agents to that version ba9cf75787 which no longer exist now?

grocery_ground_goal_task training taking 3x more memory than expected?

4 bytes (float) * 80 * 80 (image size) * 3 (channels) * 100 (unroll length) * 12 (input + two conv layers + backprop + framestack) * 30 (parallel envs) /1000,000,000
=2.7 GB

Currently cuda seems to be taking ~9 GB GPU memory, (rendering is taking another ~4GB):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:17:00.0 On | N/A |
| 18% 59C P5 46W / 250W | 5274MiB / 10988MiB | 31% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 30% 66C P2 81W / 250W | 8856MiB / 10989MiB | 3% Default |
+-------------------------------+----------------------+----------------------+

This seems to suggest it is taking 3x the memory vs what we expect?

Did we miss anything in the calculation?

Thanks,
Le
-----some details-----
conv_layer_params = ((16, 3, 2), (32, 3, 2))
1st conv layer 404016, 2nd conv layer 202032, roughly adds up to 2x input layer,
then *2 for actor and critic networks, and *2 again for forward and backprop.

  • 4x input layer for FrameStack
    = 8 + 4 = 12x size of the input layer

Support rendering on multiple GPUs

Currently, for suite_socialbot, when using ParallelPyEnvironment, all the environments perform rendering on the same GPU0. We should take a look if allowing rendering on multiple GPUs can speed up training.

images are not stored as tf.uint8 in replay buffers

Due to tf_agents implementations in gym_wrapper.py _spec_from_gym_space(), all inputs of spec gym.spaces.Box will be mapped to the same dtype specified in dtype_map. This will have issues when the inputs are a mixture of uint8 and float32, or when the action is float32. So the default type for now is always float32. We are spending x4 memory storing image inputs right now.

Unittests for running all examples

Even it's not reasonable to train all the examples until they reaches desired performance, we should at least make sure they can run a few iterations and play correctly.

inoperative config info seems incorrect

I included "atari.gin" in my gin file to run a job. In TensorBoard "text", I see the following inoperative info:

Screenshot from 2019-09-26 16-54-34

However, as I check these confs are indeed used by my job. Is there any bug here?

extend grid search

Towards two directions:

  1. early stopping. For some obviously inferior hyperparameter combination, it's possible to determine a stop at the early stage of training. For that, we could periodically compare its performance with the top-k runs at the same training iteration number, via some communication channel like Queue. Basically, this eventually becomes an evolution-like search of hyperparameters. Of course, more advanced methods can be employed:
    https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

  2. If each training run needs a GPU, then it's impractical to launch all runs on a single machine. So a distributed search is needed (after our cluster is ready?).

I noticed that there are existing Python based libs for searching hyperparameters. One example is Hyperopt:
https://github.com/hyperopt/hyperopt

We should investigate more into these libs. If they are ready for use in our case, then we might not want to reinvent the wheel.

remove all specs

I believe that specs makes our framework strong-type which is good for avoiding mistakes but also makes code much less flexible and more verbose (not pythonic). The main reasons for having specs are two:

  1. tf_agents networks require specs to initialize
  2. a replay buffer requires specs to allocate memories in __init__

For 1), probably we can rewrite some networks (keras doesn't require specs); for 2) when we create a replay buffer, we should input an example of experience and let the replay buffer class extract the specs internally.

Am I missing other reasons for specs?

alf build failing

This is the current status on the github page:

image

Looking into travis logs, the following one fails
image

Weird Tensorboard behaviors

Sometimes when I train intrinsic-only agents, I observed non-zero "AverageReturn" points but at the same time observed all zero points in "extrinsic/mean". See the two figures below.

Metrics_AverageReturn (1)
reward_extrinsic_mean

Without any extrinsic rewards, it's basically impossible that "AverageReturn" is nonzero.
Note that this has nothing to do with "average" because the second curve is exactly 0. And I didn't downsample any curve.

Does anyone know why this happens?

Unoperative gin config

Sometime the configuration provided in gin config file may be overridden by python code. One example is alf/examples/ppo_pr2.sh in PR #20, where the commandline specifies --gin_param='train_eval.num_epochs=10', but unintentionally override by a FLAG '--num_epochs' with default value 10.

It would be nice if we can find a way to write all the unused config from gin_file or gin_params to tensorboard.

Better histogram for the summary of discrete action

At alf.utils.common.add_action_summaries, tf.summary.histogram does not generate good histogram for discrete actions. As we know the min and max of the discrete action, we should generate a histogram with the known number of buckets and with the right min and max.

Refactoring: remove ceate_???_algorithm

We should make it possible to directly configure Algorithm from gin instead of relying on create_???_algorithm.
Since we need to pass observation_spec, action_spec, time_spec etc. to many of the the constructors. We can implement several functions: observation_spec(), action_spec(), time_step_spec() to get those specs from gin.

trac_ddpg_pendulum failed

when test with trac_ddpg_pendulum

python -m alf.bin.train --root_dir=tdp --gin_file=trac_ddpg_pendulum

get error msg below, still investagte on it

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 88, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 79, in main
    train_eval(FLAGS.root_dir)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1032, in wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1009, in wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 73, in train_eval
    trainer.train()
  File "/home/hongyingxiang/FLA/alf/trainers/policy_trainer.py", line 315, in train
    summary_max_queue=self._summary_max_queue)
  File "/home/hongyingxiang/FLA/alf/utils/common.py", line 265, in run_under_record_context
    func()
  File "/home/hongyingxiang/FLA/alf/trainers/policy_trainer.py", line 345, in _train
    time_step=time_step)
  File "/home/hongyingxiang/FLA/alf/trainers/off_policy_trainer.py", line 74, in _train_iter
    update_counter_every_mini_batch=self._config.
  File "/home/hongyingxiang/FLA/alf/algorithms/off_policy_algorithm.py", line 105, in train
    mini_batch_length, update_counter_every_mini_batch)
  File "/home/hongyingxiang/FLA/alf/utils/common.py", line 985, in __call__
    return tf_func_instance(get_current_scope(), *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 696, in _call
    return function_lib.defun(fn_with_cond)(*canon_args, **canon_kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError:  assertion failed: [100]
   [[{{node cond/else/_1/StatefulPartitionedCall/while/body/_694/while/body/_1961/StatefulPartitionedCall/Assert/AssertGuard/else/_7599/Assert}}]] [Op:__inference_fn_with_cond_12063]

Function call stack:
fn_with_cond

  In call to configurable 'train_eval' (<function train_eval at 0x7f35b018a840>)

can not work with tf.functions

1)Code Patch: (examples/actor_critic.py)

    # driver = PyDriver(
    #     tf_env,
    #     policy,
    #     observers=train_metrics,
    #     max_steps=num_steps_per_iter)

    driver = DynamicStepDriver(
        tf_env,
        policy,
        observers=train_metrics,
        num_steps=num_steps_per_iter)
    # _ = algorithm.variables  # build networks
    driver.run = tfa_common.function(driver.run)

it run success with (use_icm=0): python actor_critic.py --root_dir=~/tmp/icm/ --num_parallel_environments=1 --gin_param=train_eval.use_icm=0

but failed with (use_icm=1): python actor_critic.py --root_dir=~/tmp/icm/ --num_parallel_environments=1 --gin_param=train_eval.use_icm=1

TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
@tf.function
def has_init_scope():
my_constant = tf.constant(1.)
with tf.init_scope():
added = my_constant * 2
The graph tensor has name: ActorDistributionNetwork/CategoricalProjectionNetwork/Categorical/sample/Reshape_1:0
In call to configurable 'train_eval' (<function train_eval at 0x134018950>)

the behavior was strange since there's nothing special with icm module (ICMAlgorithm )

and experiments above run success when we build networks before run graph
_ = algorithm.variables # build networks

2) training procedure is incorrect in success cases with tf.functions

Refactoring: Move OffPolicyDriver.train to RLAlgorithm

And also move replaybuffer to RLAlgortihm.

The exact training procedure should be decided by algorithm. Moving it to algorithm will make algorithm more flexible so that we can minimize the need to change or write policy driver for future algorithms

Speeding up the loading of SocialRobot environment

When using 30 or 60 parallel environments, the time for loading the environments become quite long. It seems that the environments are loaded sequentially. We might be able to speed up the loading by loading the environments parallelly

Incorporating long-term entropy bonus reward in on-policy AC

For on-policy AC, we have the entropy regularizer E_{\pi}[-\log\pi(a_t|s_t)] at every step. As an unbiased estimation, we can simply use the negative log-likelihood -\log\pi(a_t|s_t) as an additional reward added to the computed advantage towards the policy gradient loss computation. This is essentially a one-step entropy bonus reward where the policy only cares about the single-step bonus.

We can extend this to long-term entropy maximization by adding -log(a_t|s_t) as instrinsic rewards (just like the ICM rewards) which will be absorbed in the advantage and return computations. This could potentially further improve our on-policy AC performance on top of the current single-step entropy bonus.

Note that this simple treatment only applies to on-policy algorithms. For off-policy algorithms, we then need SAC or soft-Q's formulation.

missing curves and gifs in README

We are missing training curves and GIFs showing the trained agent play in the following two environments in the README file:

"Simple navigation with visual input. Follow the instruction at SocialRobot to install the environment."
and
"PR2 grasping state only. Follow the instruction at SocialRobot to install the environment."

@witwolf and @Jialn Maybe these are easy for you guys to add?

Add evaluation to on_policy_trainer

Similar to tf_agent train_eval, we need to periodically evaluate the policy during the training. The evaluation usually use greedy_predict and may have big difference from non-greedy predict.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.