horizonrobotics / alf Goto Github PK

View Code? Open in Web Editor NEW

292.0 292.0 47.0 86.14 MB

Agent Learning Framework https://alf.readthedocs.io

License: Apache License 2.0

Shell 0.24% Python 99.19% Nix 0.01% C++ 0.51% Roff 0.05%

alf's People

Contributors

Stargazers

Watchers

alf's Issues

ParallelPyEnvironment will cause sys.exit() to hang after main() finishes

I've tested for quite some time to pin down an issue (reason unknown at this moment) that once we created tf_agents' ParallelPyEnvironment in main(), after the code finishes, sys.exit() will hang forever without releasing the GPU memory (running in CPU mode also has this hanging issue). This makes grid_search.py fail. The minimal re-producible example is to replace the train_eval(root_dir) function with the following:

@gin.configurable
def train_eval(root_dir):
    env = create_environment()
    env.pyenv.close()

and run any training job. The script will never exit.

And I have located the issue exactly at sys.exit(main(argv)) in absl's app.py. It's more like an issue happened inside sys.exit() rather than in absl's code.

@witwolf When you ran grid_search.py, did you ever have such a problem?

Continuous integration

As the code base become larger, we should have CI test to make sure new changes do not break the tests. https://travis-ci.org/ is a popular site to provide free CI support. This is one example of using travis: https://github.com/PaddlePaddle/PARL/blob/develop/.travis.yml

One CI is available, we can add it as a condition of merge a pull request.

Another thing is to use fixed random seed for tests to make them robuster, otherwise there will be lot of headache.

Make sure environments in SocialRobot work with tf_agents

We want to be able to run SocialRobot environment with tf_agents. There are two issues:

Since tf_agents run step() of an environment in a thread different from the thread it is created, somehow SocialRobot will crash when it's used this way
We want to run multiple SocialRobot environments. And each environment needs to be in a separate process because of Gazebo. Hence we need have an environment wrapper to wrap an environment in a separate process while use it like in the same process. This is similar to https://github.com/deepmind/scalable_agent/blob/master/py_process.py, which is used here https://github.com/deepmind/scalable_agent/blob/6c0c8a701990fab9053fb338ede9c915c18fa2b1/experiment.py#L437

_cached_opt_and_var_sets inconsistent between initial and sequential calls

I tried printing out how many variables each optimizer is responsible for optimization. I printed in two places. The first place is inside algorithm._get_opt_and_var_sets(), which is before self._cached_opt_and_var_sets is set, and I got something like below

{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 3e-05,
 'name': 'Adam'} 24
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 1e-05,
 'name': 'Adam'} 26
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 0.001,
 'name': 'Adam'} 1

For each row, the first is the optimizer config and the second is len(vars), which makes sense given my job. However, if I printed inside train_complete, after the results of calling self._get_cached_opt_and_var_sets(), the output is

{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 3e-05,
 'name': 'Adam'} 75
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 1e-05,
 'name': 'Adam'} 26
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 0.001,
 'name': 'Adam'} 1

A simple calculation suggests that somehow the parent algorithm (holds the first optimizer above) also takes into account all leaf trainable variables somewhere in the code (24 + 24 + 26 + 1 = 75). So some variables will be also optimized by the parent optimizer even though I have specified child optimizers for them (our tape is "persistent" and grads can be computed multiple times).

Verified that this also happened in the eager mode.

Alf training is slower than original tfagents (it cost more time for one training step )

i tried several groups experiments with almost the same configurations (same network parameters , hyper-parameter ... and without tensorboard summaries except basic metrics )
the results show that alf cost more time

ddpg cost more than 2x times ,and sac cost about 4~5x times

TrainTest fails sometimes

======================================================================
FAIL: test_ppo_cart_pole (bin.train_test.TrainTest)
test_ppo_cart_pole (bin.train_test.TrainTest)

Traceback (most recent call last):
File "/ALF/alf/bin/train_test.py", line 100, in test_ppo_cart_pole
self._test_train('ppo_cart_pole.gin', _test_func)
File "/ALF/alf/bin/train_test.py", line 126, in _test_train
assert_func(episode_returns, episode_lengths)
File "/ALF/alf/bin/train_test.py", line 97, in _test_func
self.assertGreater(np.mean(returns[-2:]), 198)
AssertionError: 197.8499984741211 not greater than 198

@witwolf It seems that the determinism isn't working as expected? There are some other cases failed sometimes, and I have to manually restart the testing job each time.

TTP: CategoricalProjectionNetwork

Depends on

tfp.distribution replacement #337

async train sometime fails

when test with ppo_async_icm_super_mario_intrinsic_only

rm -rf tmp && python3 -m alf.bin.train \
 --root_dir=tmp \
 --gin_file=ppo_async_icm_super_mario_intrinsic_only.gin \
 --gin_param=TrainerConfig.random_seed=0 \
 --gin_param=create_environment.num_parallel_environments=1 \
 --gin_param=TrainerConfig.num_iterations=2 \
 --gin_param=TrainerConfig.num_steps_per_iter=1 \
 --gin_param=TrainerConfig.num_updates_per_train_step=1 \
 --gin_param=TrainerConfig.mini_batch_length=2 \
 --gin_param=TrainerConfig.mini_batch_size=4 \
 --gin_param=TrainerConfig.num_envs=2 \
 --gin_param=ReplayBuffer.max_length=64 \
 --gin_param=TrainerConfig.unroll_length=2 \
 --gin_param=TrainerConfig.num_updates_per_train_step=2 \
 --gin_param=TrainerConfig.use_tf_functions=False

get error msg:

  ...
  File "/home/hongyingxiang/FLA/alf/drivers/threads.py", line 410, in _step
    self._env.step(action), action, first_env_id=self._first_env_id)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_environment.py", line 232, in step
    return self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 319, in _step
    name='step_py_func')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 591, in numpy_function
    return py_func_common(func, inp, Tout, stateful=True, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 488, in py_func_common
    result = func(*[x.numpy() for x in inp])
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 302, in _isolated_step_py
    return self._execute(_step_py, *flattened_actions)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 195, in _execute
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 298, in _step_py
    self._time_step = self._env.step(packed)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 136, in _step
    time_steps = [promise() for promise in time_steps]
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 136, in <listcomp>
    time_steps = [promise() for promise in time_steps]
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 338, in _receive
    raise Exception(stacktrace)
Exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 377, in _worker
    result = getattr(env, name)(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/wrappers.py", line 105, in _step
    time_step = self._env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/gym_wrapper.py", line 197, in _step
    observation, reward, self._done, self._info = self._gym_env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/gym/core.py", line 282, in step
    return self.env.step(self.action(action))
  File "/home/hongyingxiang/FLA/alf/environments/mario_wrappers.py", line 121, in action
    for i in self._actions[a]:
IndexError: list index out of range

and with diff

diff --git a/alf/algorithms/actor_critic_algorithm.py b/alf/algorithms/actor_critic_algorithm.py
index 20216fa..836e9dc 100644
--- a/alf/algorithms/actor_critic_algorithm.py
+++ b/alf/algorithms/actor_critic_algorithm.py
@@ -110,6 +110,8 @@ class ActorCriticAlgorithm(OnPolicyAlgorithm):
             step_type=time_step.step_type,
             network_state=state.actor)

+        import threading
+        print(action_distribution.logits[0][:4], threading.current_thread().ident)
         action = common.sample_action_distribution(action_distribution)
         return PolicyStep(
             action=action,

i get

  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/gym_wrapper.py", line 197, in _step
    observation, reward, self._done, self._info = self._gym_env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/gym/core.py", line 282, in step
    return self.env.step(self.action(action))
  File "/home/hongyingxiang/FLA/alf/environments/mario_wrappers.py", line 121, in action
    for i in self._actions[a]:
IndexError: list index out of range

tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648
tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648
tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648

logits for distributions.Categorical might be nan (it's very easy to reproduce this issue)

can you help take a look for this issue @emailweixu @hnyu

gym related issues in setup.py

The current config will install the latest version of gym (0.14.0 as of now), but it has some compatibility issues. gym version 0.10.11 works fine.

alf/setup.py

Line 23 in 1f24509

'gym',

Figures showing metric against time

Right now, we have figures showing metrics against global count, environmental steps,... In order to comparison wall clock speed between algorithms/settings, we also need figures with time as x-axis and metric as y-axis.

XLA and mixed precision to speed up training

XLA:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/g3doc/index.md
https://medium.com/tensorflow/pushing-the-limits-of-gpu-performance-with-xla-53559db8e473

Can speed up complex model training 1.1x to 3x.

Mixed precision:
https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite
only enabled on Volta GPUs and above.

A possible logo for ALF?

@Haichao-Zhang Maybe use your image editing skills.

memory leak running examples/icm_mountain_car.sh

The memory usage keeps growing during training when running icm_mountain_car.sh as shown in the graph (unit is MB).

Actor critic RNN policy unit test fails

There is a shape error when running test alf/algorithms/test_actor_critic_rnn_policy or alf/drivers/on_policy_driver_test.py

File "alf/algorithms/actor_critic_algorithm_test.py", line 165, in test_actor_critic_rnn_policy policy_step = policy.action(time_step, policy_state) ValueError: Incompatible shape for value ((100, 100)), expected ((100, 1))

can't init submodule with current master

current master e5e903b reports error when git submodule update:

03:10:03 (py3env) immars@immars-brick alf ±|master ✗|→ git submodule update
error: Server does not allow request for unadvertised object ba9cf75787469554483874efcf07a4f00306443c
Fetched in submodule path 'tf_agents', but it did not contain ba9cf75787469554483874efcf07a4f00306443c. Direct fetching of that commit failed.

seems that #22 updated tf_agents to that version ba9cf75787 which no longer exist now?

TTP: Replacement for tensorflow_probablity.distributions

grocery_ground_goal_task training taking 3x more memory than expected?

4 bytes (float) * 80 * 80 (image size) * 3 (channels) * 100 (unroll length) * 12 (input + two conv layers + backprop + framestack) * 30 (parallel envs) /1000,000,000
=2.7 GB

This seems to suggest it is taking 3x the memory vs what we expect?

Did we miss anything in the calculation?

Thanks,
Le
-----some details-----
conv_layer_params = ((16, 3, 2), (32, 3, 2))
1st conv layer 404016, 2nd conv layer 202032, roughly adds up to 2x input layer,
then *2 for actor and critic networks, and *2 again for forward and backprop.

4x input layer for FrameStack
= 8 + 4 = 12x size of the input layer

async training currently has gpu memory releasing issue

For grid search, async training cannot effectively release GPU memory after each search. The reason is still under investigation. Generally, whether it's sync or async shouldn't matter, after the fix #112.

Support rendering on multiple GPUs

Currently, for suite_socialbot, when using ParallelPyEnvironment, all the environments perform rendering on the same GPU0. We should take a look if allowing rendering on multiple GPUs can speed up training.

images are not stored as tf.uint8 in replay buffers

Due to tf_agents implementations in gym_wrapper.py _spec_from_gym_space(), all inputs of spec gym.spaces.Box will be mapped to the same dtype specified in dtype_map. This will have issues when the inputs are a mixture of uint8 and float32, or when the action is float32. So the default type for now is always float32. We are spending x4 memory storing image inputs right now.

Figures for evaluation metric against the number of environmental steps

For RL, we need to have figure showing the performance with respect to the number of training environment steps. Currently, we only have the figures of metrics vs. global steps

Unittests for running all examples

Even it's not reasonable to train all the examples until they reaches desired performance, we should at least make sure they can run a few iterations and play correctly.

TTP: NormalProjectionNetwork

Depends on

#337 replacement for tfp.distributions

summarize_grads_and_vars and debug_summaries are very slow

In the PPO humanoid example, these two summaries will take roughly 3 seconds per training update while the training update itself is only about 2 seconds.

Options for storing RNN states in experience and using it for training

Reference: https://openreview.net/forum?id=r1lyTjAqYX

inoperative config info seems incorrect

I included "atari.gin" in my gin file to run a job. In TensorBoard "text", I see the following inoperative info:

However, as I check these confs are indeed used by my job. Is there any bug here?

extend grid search

Towards two directions:

early stopping. For some obviously inferior hyperparameter combination, it's possible to determine a stop at the early stage of training. For that, we could periodically compare its performance with the top-k runs at the same training iteration number, via some communication channel like Queue. Basically, this eventually becomes an evolution-like search of hyperparameters. Of course, more advanced methods can be employed:
https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f
If each training run needs a GPU, then it's impractical to launch all runs on a single machine. So a distributed search is needed (after our cluster is ready?).

I noticed that there are existing Python based libs for searching hyperparameters. One example is Hyperopt:
https://github.com/hyperopt/hyperopt

We should investigate more into these libs. If they are ready for use in our case, then we might not want to reinvent the wheel.

remove all specs

I believe that specs makes our framework strong-type which is good for avoiding mistakes but also makes code much less flexible and more verbose (not pythonic). The main reasons for having specs are two:

tf_agents networks require specs to initialize
a replay buffer requires specs to allocate memories in __init__

For 1), probably we can rewrite some networks (keras doesn't require specs); for 2) when we create a replay buffer, we should input an example of experience and let the replay buffer class extract the specs internally.

Am I missing other reasons for specs?

alf build failing

This is the current status on the github page:

Looking into travis logs, the following one fails

Weird Tensorboard behaviors

Sometimes when I train intrinsic-only agents, I observed non-zero "AverageReturn" points but at the same time observed all zero points in "extrinsic/mean". See the two figures below.

Without any extrinsic rewards, it's basically impossible that "AverageReturn" is nonzero.
Note that this has nothing to do with "average" because the second curve is exactly 0. And I didn't downsample any curve.

Does anyone know why this happens?

Unoperative gin config

Sometime the configuration provided in gin config file may be overridden by python code. One example is alf/examples/ppo_pr2.sh in PR #20, where the commandline specifies --gin_param='train_eval.num_epochs=10', but unintentionally override by a FLAG '--num_epochs' with default value 10.

It would be nice if we can find a way to write all the unused config from gin_file or gin_params to tensorboard.

segmentation fault for play with socialbot environments

The problem can be fixed if wrap_process=True is used in the following line (and remove the next line)

alf/alf/environments/utils.py

Lines 158 to 159 in d58103f

 py_env = env_load_fn(env_name) 

 py_env.seed(np.random.randint(0, np.iinfo(np.int32).max))

Better histogram for the summary of discrete action

At alf.utils.common.add_action_summaries, tf.summary.histogram does not generate good histogram for discrete actions. As we know the min and max of the discrete action, we should generate a histogram with the known number of buckets and with the right min and max.

Refactoring: Change MERLIN so that it can be used from ActorCriticAlgorithm

In fact, MERLIN is a kind of AC algorithm, we should allow the current ActorCriticAlgorithm to use the featue of MERLIN.

Unittest for end-to-end training using alf.bin.train

Currently we don't have any unittests for this. We should add one. ac_cart_pole.gin might be a good one since it's can be trained within a couple of minutes.

Do not use tf.while_loop when mini_batch_length is 1 in off_policy_driver

This can save some overhead of tf.while_loop

merge code for off_policy_trainer and on_policy_trainer

code implementation for off_policy_trainer and on_policy_trainer have much similarity,
we should merge them

Refactoring: remove ceate_???_algorithm

We should make it possible to directly configure Algorithm from gin instead of relying on create_???_algorithm.
Since we need to pass observation_spec, action_spec, time_spec etc. to many of the the constructors. We can implement several functions: observation_spec(), action_spec(), time_step_spec() to get those specs from gin.

Evaluation is not correct when there is observation_transformer

eager_compute does not use observation_transformer, which makes the evaluation incorrect. We should use (on/off)_policy_driver to run the algorithm.greedy_predict so that the pre-processing is consistent.

trac_ddpg_pendulum failed

when test with trac_ddpg_pendulum

python -m alf.bin.train --root_dir=tdp --gin_file=trac_ddpg_pendulum

get error msg below, still investagte on it

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 88, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 79, in main
    train_eval(FLAGS.root_dir)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1032, in wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1009, in wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 73, in train_eval
    trainer.train()
  File "/home/hongyingxiang/FLA/alf/trainers/policy_trainer.py", line 315, in train
    summary_max_queue=self._summary_max_queue)
  File "/home/hongyingxiang/FLA/alf/utils/common.py", line 265, in run_under_record_context
    func()
  File "/home/hongyingxiang/FLA/alf/trainers/policy_trainer.py", line 345, in _train
    time_step=time_step)
  File "/home/hongyingxiang/FLA/alf/trainers/off_policy_trainer.py", line 74, in _train_iter
    update_counter_every_mini_batch=self._config.
  File "/home/hongyingxiang/FLA/alf/algorithms/off_policy_algorithm.py", line 105, in train
    mini_batch_length, update_counter_every_mini_batch)
  File "/home/hongyingxiang/FLA/alf/utils/common.py", line 985, in __call__
    return tf_func_instance(get_current_scope(), *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 696, in _call
    return function_lib.defun(fn_with_cond)(*canon_args, **canon_kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError:  assertion failed: [100]
   [[{{node cond/else/_1/StatefulPartitionedCall/while/body/_694/while/body/_1961/StatefulPartitionedCall/Assert/AssertGuard/else/_7599/Assert}}]] [Op:__inference_fn_with_cond_12063]

Function call stack:
fn_with_cond

  In call to configurable 'train_eval' (<function train_eval at 0x7f35b018a840>)

Reducing log output of train_test

the output of alf/bin/train/train_test.py is too much, which makes reading the CI log difficult (e.g. https://travis-ci.org/HorizonRobotics/alf/jobs/609480092?utm_medium=notification&utm_source=github_status)

can not work with tf.functions

1）Code Patch: (examples/actor_critic.py)

    # driver = PyDriver(
    #     tf_env,
    #     policy,
    #     observers=train_metrics,
    #     max_steps=num_steps_per_iter)

    driver = DynamicStepDriver(
        tf_env,
        policy,
        observers=train_metrics,
        num_steps=num_steps_per_iter)
    # _ = algorithm.variables  # build networks
    driver.run = tfa_common.function(driver.run)

it run success with (use_icm=0): python actor_critic.py --root_dir=~/tmp/icm/ --num_parallel_environments=1 --gin_param=train_eval.use_icm=0

but failed with (use_icm=1): python actor_critic.py --root_dir=~/tmp/icm/ --num_parallel_environments=1 --gin_param=train_eval.use_icm=1

TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
@tf.function
def has_init_scope():
my_constant = tf.constant(1.)
with tf.init_scope():
added = my_constant * 2
The graph tensor has name: ActorDistributionNetwork/CategoricalProjectionNetwork/Categorical/sample/Reshape_1:0
In call to configurable 'train_eval' (<function train_eval at 0x134018950>)

the behavior was strange since there's nothing special with icm module (ICMAlgorithm )

and experiments above run success when we build networks before run graph
_ = algorithm.variables # build networks

2） training procedure is incorrect in success cases with tf.functions

Refactoring: Move OffPolicyDriver.train to RLAlgorithm

And also move replaybuffer to RLAlgortihm.

The exact training procedure should be decided by algorithm. Moving it to algorithm will make algorithm more flexible so that we can minimize the need to change or write policy driver for future algorithms

Speeding up the loading of SocialRobot environment

When using 30 or 60 parallel environments, the time for loading the environments become quite long. It seems that the environments are loaded sequentially. We might be able to speed up the loading by loading the environments parallelly

Incorporating long-term entropy bonus reward in on-policy AC

For on-policy AC, we have the entropy regularizer E_{\pi}[-\log\pi(a_t|s_t)] at every step. As an unbiased estimation, we can simply use the negative log-likelihood -\log\pi(a_t|s_t) as an additional reward added to the computed advantage towards the policy gradient loss computation. This is essentially a one-step entropy bonus reward where the policy only cares about the single-step bonus.

We can extend this to long-term entropy maximization by adding -log(a_t|s_t) as instrinsic rewards (just like the ICM rewards) which will be absorbed in the advantage and return computations. This could potentially further improve our on-policy AC performance on top of the current single-step entropy bonus.

Note that this simple treatment only applies to on-policy algorithms. For off-policy algorithms, we then need SAC or soft-Q's formulation.

missing curves and gifs in README

We are missing training curves and GIFs showing the trained agent play in the following two environments in the README file:

"Simple navigation with visual input. Follow the instruction at SocialRobot to install the environment."
and
"PR2 grasping state only. Follow the instruction at SocialRobot to install the environment."

@witwolf and @Jialn Maybe these are easy for you guys to add?

Add evaluation to on_policy_trainer

Similar to tf_agent train_eval, we need to periodically evaluate the policy during the training. The evaluation usually use greedy_predict and may have big difference from non-greedy predict.

missing link to doc "entropy_estimator.pdf"

alf/alf/utils/dist_utils.py

Line 34 in 1ea528d

gradient of entropy. See entropy_estimator.pdf for detail.

grid_search uses wrong config if the config is changed after launch

This is quite inconvenient as quite often we need to switch branch

Tensorboard visualization of alf computation graph

We should support this so that it's easier to examine model. This is the doc on how to do it
https://www.tensorflow.org/tensorboard/graphs#graphs_of_tffunctions

Additional benefits is that it also has CPU and GPU usage statistics, which might be very useful.

	py_env = env_load_fn(env_name)
	py_env.seed(np.random.randint(0, np.iinfo(np.int32).max))

horizonrobotics / alf Goto Github PK

alf's People

Contributors

Stargazers

Watchers

Forkers

alf's Issues

====================================================================== FAIL: test_ppo_cart_pole (bin.train_test.TrainTest) test_ppo_cart_pole (bin.train_test.TrainTest)

Recommend Projects

Recommend Topics

Recommend Org

======================================================================
FAIL: test_ppo_cart_pole (bin.train_test.TrainTest)
test_ppo_cart_pole (bin.train_test.TrainTest)