keiohta / tf2rl Goto Github PK

TensorFlow2 Reinforcement Learning

License: MIT License

Python 99.26% Shell 0.60% Dockerfile 0.13%

reinforcement-learning tensorflow2 inverse-reinforcement-learning tensorflow imitation-learning deep-reinforcement-learning

tf2rl's People

Contributors

Stargazers

Watchers

Forkers

ymd-h wshoway likeucode xunyiljg wwxfromtju bcc2xp fcdtc hititan huangyumeng laudehenri mamo3gr zhb0318 misoknisky nyk510 chris-512 mrdadaguy gaohunter dr-tuski marymijin jeffhere brightenzl yubo-huang dshwei eijoac benquick123 mariokcy illyakaynov zcdliuwei arastogi1997 zhangtning knakane arqam-ai liuuuu54 baojialiustc locke637 cambel chicken-point junjie2008v ai-hub-deep-learning-fundamental ellenyan123 seungjunha ruixueqingyang robotemperor jacklinquan eswarraop cmwlh gail-4-bark dapenggg ruixianzhang666 xanderyin harveyphm ykawamura96 lvzw1895 ruanribeiros joomladigger mengsicode wangyu1997 otreewen2020 utschie big-data-ai renzhenxuexidemaimai shiyujin0 jaysoncho92 aphelion17 staminatang jc-bao cloth41 hesnobi weihancool sff1019 safarzadeh-reza naji-s nikunj-gupta chupeng24 cocobar mczhi mk788 pgkang xioaxiao005 shengrenhou cristinagomezsantamaria liang813 alirezashamsoshoara bmabsout liguoyu1 kmh8667 jasonj99 zhengnx0906 waikikilick peteris-racinskis estherz-ele qinghuachen007 nathinal ponder-lab cqychen ronliu014 dragon-ai mysticmountain chopper0126 mrahman2-vt

tf2rl's Issues

Replace custom distribution classes with TensorFlow Probability

Replace custom distribution classes under tf2rl/distributions with TensorFlow Probability

Implement SQIL

SQIL: Imitation Learning via Regularized Behavioral Cloning

Enable to specify result folder name

Labels of TensorBoard output depend on names of results directories, so readability improves when user can specify the names.

ApeX: Maximize GPU usage by parallelizing environments

Generally, RL collects experiences by interacting with environment, and it needs to generate an action using current policy networks.
However, making an action for each transition is not computationally efficient because the input to the neural network uses only one batch. So, there could be room to improve efficiency by changing batch size.
In this issue, try to make batch size bigger for one actions by preparing multiple environments and forward all environments with one actions.
Note that this method is only valid for no-GIL environments, because python only works on one process.

Implement discrete version of SAC

Soft Actor-Critic for Discrete Action Settings

Fix random seed values to reproduce results

Using TensorFlow global time step makes main loop slower

Measure time spent in main loop using line_profiler
Spend 16% for computing tf.train.create_global_step() related operations
- 4.0 while total_steps < self._max_steps:
- 4.8 if total_steps >= self._policy.n_warmup:
- 4.8 if total_steps >= self._policy.n_warmup:
- 2.6 if int(total_steps) % self._model_save_interval == 0:

$ git checkout cff2d42ae73b7ddaa050853b3359a78ada06929a
$ python examples/run_dqn_line_profiler.py --max-steps=10000
...
File: /Users/keiohta/workspace/rl/tf2rl/tf2rl/experiments/trainer.py
Function: call at line 50

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    50                                               def call(self):
    51         1       3666.0   3666.0      0.0          total_steps = tf.train.create_global_step()
    52         1          2.0      2.0      0.0          episode_steps = 0
    53         1          0.0      0.0      0.0          episode_return = 0
    54         1          2.0      2.0      0.0          episode_start_time = time.time()
    55         1          1.0      1.0      0.0          n_episode = 0
    56                                           
    57         1          1.0      1.0      0.0          replay_buffer = get_replay_buffer(
    58         1          1.0      1.0      0.0              self._policy, self._env, self._use_prioritized_rb,
    59         1        762.0    762.0      0.0              self._use_nstep_rb, self._n_step)
    60                                           
    61         1         72.0     72.0      0.0          obs = self._env.reset()
    62                                           
    63         1       1049.0   1049.0      0.0          with tf.contrib.summary.record_summaries_every_n_global_steps(1000):
    64     10001     670392.0     67.0      4.0              while total_steps < self._max_steps:
    65     10000     629217.0     62.9      3.8                  if total_steps < self._policy.n_warmup:
    66       500       3332.0      6.7      0.0                      action = self._env.action_space.sample()
    67                                                           else:
    68      9500    2394183.0    252.0     14.4                      action = self._policy.get_action(obs)
    69                                           
    70     10000     285089.0     28.5      1.7                  next_obs, reward, done, _ = self._env.step(action)
    71     10000       9598.0      1.0      0.1                  if self._show_progress:
    72                                                               self._env.render()
    73     10000       7443.0      0.7      0.0                  episode_steps += 1
    74     10000       7514.0      0.8      0.0                  episode_return += reward
    75     10000     768571.0     76.9      4.6                  total_steps.assign_add(1)
    76                                           
    77     10000      10090.0      1.0      0.1                  done_flag = done
    78     10000      12436.0      1.2      0.1                  if hasattr(self._env, "_max_episode_steps") and \
    79     10000       8542.0      0.9      0.1                          episode_steps == self._env._max_episode_steps:
    80         6          5.0      0.8      0.0                      done_flag = False
    81     10000     310069.0     31.0      1.9                  replay_buffer.add(obs=obs, act=action, next_obs=next_obs, rew=reward, done=done_flag)
    82     10000       9355.0      0.9      0.1                  obs = next_obs
    83                                           
    84     10000       9105.0      0.9      0.1                  if done or episode_steps == self._episode_max_steps:
    85       179       2326.0     13.0      0.0                      obs = self._env.reset()
    86                                           
    87       179        199.0      1.1      0.0                      n_episode += 1
    88       179        268.0      1.5      0.0                      fps = episode_steps / (time.time() - episode_start_time)
    89       179        266.0      1.5      0.0                      self.logger.info("Total Epi: {0: 5} Steps: {1: 7} Episode Steps: {2: 5} Return: {3: 5.4f} FPS: {4:5.2f}".format(
    90       179      30602.0    171.0      0.2                          n_episode, int(total_steps), episode_steps, episode_return, fps))
    91                                           
    92       179        211.0      1.2      0.0                      episode_steps = 0
    93       179        122.0      0.7      0.0                      episode_return = 0
    94       179        173.0      1.0      0.0                      episode_start_time = time.time()
    95                                           
    96     10000     804974.0     80.5      4.8                  if total_steps >= self._policy.n_warmup:
    97      9501     371540.0     39.1      2.2                      samples = replay_buffer.sample(self._policy.batch_size)
    98      9501      11071.0      1.2      0.1                      td_error = self._policy.train(
    99      9501       8588.0      0.9      0.1                          samples["obs"], samples["act"], samples["next_obs"],
   100      9501      29581.0      3.1      0.2                          samples["rew"], np.array(samples["done"], dtype=np.float64),
   101      9501    8322550.0    876.0     49.9                          None if not self._use_prioritized_rb else samples["weights"])
   102      9501      14172.0      1.5      0.1                      if self._use_prioritized_rb:
   103                                                                   replay_buffer.update_priorities(samples["indexes"], np.abs(td_error) + 1e-6)
   104      9501     487314.0     51.3      2.9                      if int(total_steps) % self._test_interval == 0:
   105         5         86.0     17.2      0.0                          with tf.contrib.summary.always_record_summaries():
   106         5     989351.0 197870.2      5.9                              avg_test_return = self.evaluate_policy(int(total_steps))
   107         5         20.0      4.0      0.0                              self.logger.info("Evaluation Total Steps: {0: 7} Average Reward {1: 5.4f} over {2: 2} episodes".format(
   108         5       1098.0    219.6      0.0                                  int(total_steps), avg_test_return, self._test_episodes))
   109         5       1452.0    290.4      0.0                              tf.contrib.summary.scalar(name="AverageTestReturn", tensor=avg_test_return, family="loss")
   110         5       1288.0    257.6      0.0                              tf.contrib.summary.scalar(name="FPS", tensor=fps, family="loss")
   111                                           
   112         5        214.0     42.8      0.0                          self.writer.flush()
   113                                           
   114     10000     433524.0     43.4      2.6                  if int(total_steps) % self._model_save_interval == 0:
   115         1      16610.0  16610.0      0.1                      self.checkpoint_manager.save()
   116                                           
   117         1         44.0     44.0      0.0              tf.contrib.summary.flush()

Add test_examples.sh that ensures example scripts work

Which branch is stable?

Implement Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning

Replace all float64 operations with float32 operations

So far, all operations are done with float64 (np.float64 and tf.float64) for compatibility with cpprb, but cpprb now experimentally supports arbitrary data type, so replace all float64 operations with float32 operations to accelerate computation.

Implement GAE

High-Dimensional Continuous Control Using Generalized Advantage Estimation

[bug] ApeX does not copy all the local data to global buffer

tf2rl/tf2rl/algos/apex.py

Line 140 in c86d944

samples = local_rb.sample(local_rb.get_stored_size())

ApeX samples randomly from local replay buffer instead of copying all the data.

This looks like unwilling bug.

[Feature] Support new feature of cpprb

tf2rl does not keep up to date with recent cpprb development.

cpprb >= 7.14 obtained N-step feature in the new experimental package.
cpprb >= 8.0 replace stable code with experimental one, finally.

Record test scores with raw scores for Atari

Although training with clipped rewards, test scores should be recorded with raw scores as many papers do

Visualize game learning process on TensorBoard

Visualize final game capture of an episode so that user can know the game progress.
Call following function to visualize the capture to TensorBoard.

tf.contrib.summary.image('train/input_img', tf.cast(image * 255.0, tf.uint8))

Save and restore models

インストールに必要なパッケージを記載することが必要

Requirement の説明がないため、インストール方法を試してExampleを実行しても、エラーになる。

必要な依存関係を記載するもしくは、setup.py によってインストールすることが望ましい。

Implement Spectral Normalization

Use spectral normalization to expect to increase performance in GAIL
- Spectral Normalization for Generative Adversarial Networks

Implement QT-Opt

QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Automatic temperature parameter tuning on SAC

Implement automatic tuning of temperature parameter of entropy and reproduce results from Soft Actor-Critic Algorithms and Applications.

cpprb will break its api in version 8

A depending library cpprb is scheduled to break its api in version 8.
(ReplayBuffers in cpprb namespaces are replaced with those in cpprb.experimental, filnally.)

To prepare migration, the current version of tf2rl is recommended to ensure to use version 7 by specifying in setup.py

Use huber loss

which version is this Discrete SAC based upon

hi, may I know which paper(s)/method the discrete SAC is based upon?

from my understanding there are 3 main implementations of discrete SAC:

Gumbel Softmax
KL Divergence
Petros Christodoulou's

but they all include the auto entropy tuning/temperature term. I may have missed it but I dont see it in your version of the code. Thanks for your time!

Implement R2D2

Recurrent Experience Replay in Distributed Reinforcement Learning

Add unit tests

Adding noise to action in DDPG implementation

Hi,

I noticed another thing. In DDPG implementation, there is a method get_action() that I think by accident doesn't add noise to action in training phase, but adds it during testing. Here's the exact line that I think is problematic:

tf2rl/tf2rl/algos/ddpg.py

Line 100 in 1badc2a

 tf.constant(state), self.sigma * test, tf.constant(self.actor.max_action, dtype=tf.float32)) 

As per original pseudocode in original paper DDPG, page 5, it is explicitly stated that noise is added during training. I'm assuming actor's action is then directly used during testing.

Reproduce ApeX paper results for continuous action

Distributed Prioritized Experience Replay
Hyper parameters are shown in appendix D

Reproduce DQN paper results for Atari

Reproduce following paper results (at least 1 game, because @keiohta does not have enough computation resource)
- Human-level control through deep reinforcement learning

Write detail agent type to README

Agent type can be classified into discrete or continuous.
Also, some detail information should be written, such as recurrent output.

Show movies of atari and roboschool results solved using tf2rl

Implement A3C

Asynchronous Methods for Deep Reinforcement Learning

Implement ApeX DQN

Distributed Prioritized Experience Replay
Current implementation works only for DDPG variants, so extends it to work with DQN like agent

Implement Noisy Networks for DQN

Noisy Networks for Exploration

Implement WGAIL and WGAIL-GP

Modelling human driving behaviour using Generative Adversarial Networks
REF: DCGAN-LSGAN-WGAN-GP-DRAGAN-Tensorflow-2

Implement MPC

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

Fix coverage badge is not shown

Coverage badge is unknown for some reasons. Fix this.

Schedule epsilon for epsilon-greedy policy

Currently just use the specified value, such as epsilon=0.1.

Implement VPG

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Set CI

Support gin-config

Currently users need to specify hyper parameters by passing command line arguments or set_defaults in examples/run_*.py, but it's a bother to do for each algorithm/environment. So, use gin to specify initial values especially for reproducing the results of papers.
https://github.com/google/gin-config

TensorFlow の contrib は TF 2.0 で削除されるので、利用しないことが好ましい

以下、実行時にコンソールに表示された警告メッセージ

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Implement PPO

Proximal Policy Optimization Algorithms

Run multiple times for one experiment

We need to evaluate an agent with multiple random seeds (more than 5) to statistically show the performance of the agent, so enable to run multiple times with different random seeds.
- REF: How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments

Implement DQfD

Deep Q-learning from Demonstrations