mimoralea / gdrl Goto Github PK

Grokking Deep Reinforcement Learning

Home Page: https://www.manning.com/books/grokking-deep-reinforcement-learning

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.99% Dockerfile 0.01% Shell 0.01% Python 0.01%

deep-learning deep-reinforcement-learning reinforcement-learning machine-learning algorithms artificial-intelligence neural-networks pytorch pytorch-tutorials numpy numpy-tutorial docker gpu nvidia-docker

gdrl's Introduction

Grokking Deep Reinforcement Learning

Note: At the moment, only running the code from the docker container (below) is supported. Docker allows for creating a single environment that is more likely to work on all systems. Basically, I install and configure all packages for you, except docker itself, and you just run the code on a tested environment.

To install docker, I recommend a web search for "installing docker on <your os here>". For running the code on a GPU, you have to additionally install nvidia-docker. NVIDIA Docker allows for using a host's GPUs inside docker containers. After you have docker (and nvidia-docker if using a GPU) installed, follow the three steps below.

Running the code

Clone this repo:
git clone --depth 1 https://github.com/mimoralea/gdrl.git && cd gdrl
Pull the gdrl image with:
docker pull mimoralea/gdrl:v0.14
Spin up a container:
- On Mac or Linux:
  docker run -it --rm -p 8888:8888 -v "$PWD"/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
- On Windows:
  docker run -it --rm -p 8888:8888 -v %CD%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
- NOTE: Use nvidia-docker or add --gpus all after --rm to the command, if you are using a GPU.
Open a browser and go to the URL shown in the terminal (likely to be: http://localhost:8888). The password is: gdrl

About the book

Book's website

https://www.manning.com/books/grokking-deep-reinforcement-learning

Table of content

Introduction to deep reinforcement learning
Mathematical foundations of reinforcement learning
Balancing immediate and long-term goals
Balancing the gathering and utilization of information
Evaluating agents' behaviors
Improving agents' behaviors
Achieving goals more effectively and efficiently
Introduction to value-based deep reinforcement learning
More stable value-based methods
Sample-efficient value-based methods
Policy-gradient and actor-critic methods
Advanced actor-critic methods
Towards artificial general intelligence

Detailed table of content

1. Introduction to deep reinforcement learning

(Livebook)
(No Notebook)

2. Mathematical foundations of reinforcement learning

(Livebook)
(Notebook)
- Implementations of several MDPs:
  - Bandit Walk
  - Bandit Slippery Walk
  - Slippery Walk Three
  - Random Walk
  - Russell and Norvig's Gridworld from AIMA
  - FrozenLake
  - FrozenLake8x8

3. Balancing immediate and long-term goals

(Livebook)
(Notebook)
- Implementations of methods for finding optimal policies:
  - Policy Evaluation
  - Policy Improvement
  - Policy Iteration
  - Value Iteration

4. Balancing the gathering and utilization of information

(Livebook)
(Notebook)
- Implementations of exploration strategies for bandit problems:
  - Random
  - Greedy
  - E-greedy
  - E-greedy with linearly decaying epsilon
  - E-greedy with exponentially decaying epsilon
  - Optimistic initialization
  - SoftMax
  - Upper Confidence Bound
  - Bayesian

5. Evaluating agents' behaviors

(Livebook)
(Notebook)
- Implementation of algorithms that solve the prediction problem (policy estimation):
  - On-policy first-visit Monte-Carlo prediction
  - On-policy every-visit Monte-Carlo prediction
  - Temporal-Difference prediction (TD)
  - n-step Temporal-Difference prediction (n-step TD)
  - TD(λ)

6. Improving agents' behaviors

(Livebook)
(Notebook)
- Implementation of algorithms that solve the control problem (policy improvement):
  - On-policy first-visit Monte-Carlo control
  - On-policy every-visit Monte-Carlo control
  - On-policy TD control: SARSA
  - Off-policy TD control: Q-Learning
  - Double Q-Learning

7. Achieving goals more effectively and efficiently

(Livebook)
(Notebook)
- Implementation of more effective and efficient reinforcement learning algorithms:
  - SARSA(λ) with replacing traces
  - SARSA(λ) with accumulating traces
  - Q(λ) with replacing traces
  - Q(λ) with accumulating traces
  - Dyna-Q
  - Trajectory Sampling

8. Introduction to value-based deep reinforcement learning

(Livebook)
(Notebook)
- Implementation of a value-based deep reinforcement learning baseline:
  - Neural Fitted Q-iteration (NFQ)

9. More stable value-based methods

(Livebook)
(Notebook)
- Implementation of "classic" value-based deep reinforcement learning methods:
  - Deep Q-Networks (DQN)
  - Double Deep Q-Networks (DDQN)

10. Sample-efficient value-based methods

(Livebook)
(Notebook)
- Implementation of main improvements for value-based deep reinforcement learning methods:
  - Dueling Deep Q-Networks (Dueling DQN)
  - Prioritized Experience Replay (PER)

11. Policy-gradient and actor-critic methods

(Livebook)
(Notebook)
- Implementation of classic policy-based and actor-critic deep reinforcement learning methods:
  - Policy Gradients without value function and Monte-Carlo returns (REINFORCE)
  - Policy Gradients with value function baseline trained with Monte-Carlo returns (VPG)
  - Asynchronous Advantage Actor-Critic (A3C)
  - Generalized Advantage Estimation (GAE)
  - [Synchronous] Advantage Actor-Critic (A2C)

12. Advanced actor-critic methods

(Livebook)
(Notebook)
- Implementation of advanced actor-critic methods:
  - Deep Deterministic Policy Gradient (DDPG)
  - Twin Delayed Deep Deterministic Policy Gradient (TD3)
  - Soft Actor-Critic (SAC)
  - Proximal Policy Optimization (PPO)

13. Towards artificial general intelligence

(Livebook)
(No Notebook)

gdrl's People

Contributors

Stargazers

Watchers

Forkers

andrew27xu itsank zyntop2014 sfrias accidental-engineer aymar73 yisun518 rahulindoria5 kalyankumarpichuka zxspectrumz80 tmorgan4 ravindrasinghyadav chomolungma primemover2011 linked0 resperic mbrukman gnperdue fabono swaption2009 stayyule zyzhu2000 connorhayes scorpjd tfrance mysticsoul ngduyanhece allensmile wwxfromtju ashishpatel26 alphahmed aipachakutiqwan liujxing fokhruli ainilaha diem389 gdabrow sandeshprabhu02 ds-madhavan-ramani t-thanh albertvillanova neocsr heavy02011 pawkaz markusbuchholz khushboog9 amineaboussalah bksgupta guti1 nguyendo24 h0shekhar goodboychan jeyhooon cxu-nwafu greensun0830 bryant-zhu anujpare ipark-cs halesmith minalspatil aanrran satpreetsingh glinzhu myausweis lh4d mhbashari schoenemeyer vnikoofard peterleong boyko11 verakai nrjcs troniclt donkas rcijov adarshgouda dinghe yanwanquan rolandoquiroz datademystifier senakorkut uonliaquat gabrielleprat norman1007 azurecloudmonk febikambu shokuninsan valdera troddenspade bhaskar-j inpap nukelius agnes-nu yanzchen hrocha narendra974 3dalgolab toby-p jiruifu piotrbazan

gdrl's Issues

Windows docker run command doesn't work in PowerShell

In the readme, the command docker run -it --rm -p 8888:8888 -v %CD%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14 did not work for me in PowerShell.

However I was able to get it to work with a slight modification:

docker run -it --rm -p 8888:8888 -v "$(pwd)/notebooks/:/mnt/notebooks/" mimoralea/gdrl:v0.14

Found solution here: https://stackoverflow.com/questions/46526165/docker-invalid-characters-for-volume-when-using-relative-paths

include development container files

I would like to run your code in codespaces. I think there would need to be a devcontainer.json file pointing to your Dockerfile.

Cannot start the notebooks from the container

Hi! First time opening an issue, hope to make it right.
Not a super expert of using Docker and Windows command line in general, but I followed the instruction on the README.
Basically, if I try to run the docker with the recommended command I get the error:

...\gdrl>docker run -it --rm -p 8888:8888 -v %CD%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
docker: invalid reference format: repository name must be lowercase.
See 'docker run --help'.

the same if I use %cd% instead of %CD%.

I really don't understand where the issue is, since apart from %CD% there is no a single upper case letter.

A possible error in chapter-11.ipynb/SharedAdam and SharedRMSprop

Maybe we need to change
"self.state[p]['steps'] = self.state[p]['shared_step'].item()" to
"self.state[p]['step'] = self.state[p]['shared_step'].item()", in both the step functions of SharedAdam and SharedRMSprop under chapter-11.ipynb. Becasue in the step function of Adam, "state[step]" is used rather than "state[steps]".

To ensure the copy of gradients in chapter-11.ipynb/A3C

In the optimize_model function of A3C, gradients from local model are copyed to shared model when “ shared_param.grad is None”. However, it seems that shared_param.grad would never be none after the first copy operation. Maybe we need to use "self.shared_value_optimizer.zero_grad(set_to_none=True)" to replace "self.shared_value_optimizer.zero_grad()". The same change should also be applied to “self.shared_policy_optimizer”.

Hello, I get this error, no space left on device

at the last moment of sudo docker pull mimoralea/gdrl:v0.14 I get this error,

failed to register layer: Error processing tar file(exit status 1): write /opt/conda/lib/libnvvm.so.3.3.0: no space left on device

Any way to solve this?

> sudo docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.4.0-176-generic
 Operating System: Ubuntu 16.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 177.1GiB
 Name: aoaiie
 ID: 6FTS:LYUI:2W7H:I3H5:DBKG:6GI4:47XT:I3J6:2ZVE:7ET7:5H74:DFU7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

> df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             89G     0   89G   0% /dev
tmpfs            18G  8.8M   18G   1% /run
/dev/xvda1       48G   37G  8.9G  81% /
tmpfs            89G     0   89G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            89G     0   89G   0% /sys/fs/cgroup
/dev/xvdb       2.0T  616G  1.4T  31% /home
tmpfs            18G     0   18G   0% /run/user/1000
tmpfs            18G     0   18G   0% /run/user/1001

Docker run on windows fails, wrong syntax in readme?

I ran your docker instructions for windows given in readme.md:
docker run -it --rm -p 8888:8888 -v %CD%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
After downloading all the files, it says:
Status: Downloaded newer image for mimoralea/gdrl:v0.14 docker: Error response from daemon: create %CD%/notebooks: "%CD%/notebooks" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path. See 'docker run --help'.
It appears there is something wrong with the syntax.

The command runs well when '%CD%' is replaced with '.', however the notebook list is empty.

AttributeError: 'WalkEnv' object has no attribute 'seed'

error in running chapter-05.ipynb and chapter-06.ipynb

Code for GIF in chapter 8 is not running

I tried to run the code from Chapter 8 with no success. Not only forced to downgrade the gym version to 0.22.0 due to deprecation of gym.wrappers.Monitor(), but also best_agent.demo_progression() function, which is supposed to generate GIFs, got stucked in the subprocess.Popen() part with the following error. Is there any solution to this?

`FileNotFoundError                         Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12908/1650199322.py in <module>
----> 1 best_agent.demo_progression()

~\AppData\Local\Temp/ipykernel_12908/1002205623.py in demo_progression(self, title, max_n_videos)
    216 
    217         env.close()
--> 218         data = get_gif_html(env_videos=env.videos, 
    219                             title=title.format(self.__class__.__name__),
    220                             subtitle_eps=sorted(checkpoint_paths.keys()),

~\AppData\Local\Temp/ipykernel_12908/3674493635.py in get_gif_html(env_videos, title, subtitle_eps, max_n_videos)
     13         gif_path = basename + '.gif'
     14         if not os.path.exists(gif_path):
---> 15             ps = subprocess.Popen(
     16                 ('ffmpeg', 
     17                  '-i', video_path,

~\anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask)
    949                             encoding=encoding, errors=errors)
    950 
--> 951             self._execute_child(args, executable, preexec_fn, close_fds,
    952                                 pass_fds, cwd, env,
    953                                 startupinfo, creationflags, shell,

~\anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid, unused_gids, unused_uid, unused_umask, unused_start_new_session)
   1418             # Start the process
   1419             try:
-> 1420                 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
   1421                                          # no special security
   1422                                          None, None,

FileNotFoundError:  [WinError 2] The system cannot find the file specified]

How to enter the docker's command line mode?

Hi.
I had set up docker environment and pull the image mimoralea/gdrl . I run the docker image it would startup a notebook server. I can access this server via browser. I want to just use the docker image command line. How can I access the command line of this docker image?

Understanding the Double Discounting in the REINFORCE and VPG Algorithms

I noticed a double discounting in both the REINFORCE algorithm and VPG that is not in the original REINFORCE/VPG papers:

def optimize_model(self):
    T = len(self.rewards)
    discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False)

    # returns are the discounted cumulative sum of rewards in the trajectory
    returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])

   #< ... >
    # the loss here multiplies the discounted cumulative sum return by discounts again
    policy_loss = -(discounts * returns * self.logpas).mean()

In the original REINFORCE algorithm paper, the algorithm is described as:

Where the loss is simply the negative mean returns * log_probs.

The author explains in his book, "Grokking DRL",

Notice here that we're using the mathematically correct policy-gradient update, which isn't what you commonly find out there. The extra discount assumes that we're trying to optimize the expect discounted return [sic] from the initial state, so returns later in an episode get discounted.

I'm not sure I understand how discounting a second time is what sets the target to the expected discounted return - I'd have thought that we are already trying to optimize for the expected discounted return by discounting the return to begin with.

Is there a paper or article on re-discounting the parameter update?

Thank you so much! Also, I'm loving the book!

A2C algorithm doesnt work

Hey ,
I tried chapter-11 notebook and ran the A2C algorithm for training. The agent doesnt learn anything.
I am pasting the logs below:

el 00:00:36, ep 1427, ts 013537, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.5±000.7
el 00:01:06, ep 3335, ts 031432, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.5±000.6
el 00:01:37, ep 5295, ts 049762, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.3±000.8
el 00:02:07, ep 7101, ts 066745, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.5±000.7
el 00:02:37, ep 8996, ts 084474, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.2±000.8
el 00:03:07, ep 10812, ts 101495, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.3±000.7
el 00:03:37, ep 12534, ts 117600, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.4±000.9
el 00:04:07, ep 14259, ts 133695, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.4±000.8
el 00:04:37, ep 15967, ts 149650, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.4±000.7
el 00:05:07, ep 17651, ts 165403, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.4±000.7
el 00:05:37, ep 19366, ts 181470, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.4±000.8
el 00:06:07, ep 21014, ts 197042, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.3±000.8
el 00:06:37, ep 22672, ts 212565, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.5±000.8
el 00:07:07, ep 24350, ts 228283, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.5±000.8
el 00:07:37, ep 25916, ts 242947, ar 10 001.0±000.0, 100 001.0±000.0, ex 100 0.0±0.0, ev 009.3±000.8

....

I tried experimenting with entropy_weight but the result remains the same.

Can anyone point out the mistake?
Thanks

Container must be run with group "root" to update passwd file

Hi Miguel,

Thanks for today's update with the new fresh chapter 5. I am trying to install everything via Docker.

When running docker run -it --rm -p 8888:8888 -v "$PWD"/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.7, then accessing Jupyter, the password you provided looks invalid.

Furthermore, the first line in the console line after running the container is:

Container must be run with group "root" to update passwd file

Am I missing something?

Thanks again

No seed method error.

Slippery Walk Five MDP and sample policy in Chapter 3 of Jupyter notebook.

Why am I getting this error and how do I fix this?

env = gym.make('SlipperyWalkFive-v0')
P = env.env.P
init_state = env.reset()
goal_state = 6

LEFT, RIGHT = range(2)
pi = lambda s: {
0:LEFT, 1:LEFT, 2:LEFT, 3:LEFT, 4:LEFT, 5:LEFT, 6:LEFT
}[s]
print_policy(pi, P, action_symbols=('<', '>'), n_cols=7)
print('Reaches goal {:.2f}%. Obtains an average undiscounted return of {:.4f}.'.format(
probability_success(env, pi, goal_state=goal_state)*100,
mean_return(env, pi)))

AttributeError Traceback (most recent call last)
C:\Temp\ipykernel_11836\386057658.py in <cell line: 11>()
10 print_policy(pi, P, action_symbols=('<', '>'), n_cols=7)
11 print('Reaches goal {:.2f}%. Obtains an average undiscounted return of {:.4f}.'.format(
---> 12 probability_success(env, pi, goal_state=goal_state)*100,
13 mean_return(env, pi)))

C:\Temp\ipykernel_11836\4014631173.py in probability_success(env, pi, goal_state, n_episodes, max_steps)
1 def probability_success(env, pi, goal_state, n_episodes=100, max_steps=200):
----> 2 random.seed(123); np.random.seed(123) ; env.seed(123)
3 results = []
4 for _ in range(n_episodes):
5 state, done, steps = env.reset(), False, 0

~\anaconda3\lib\site-packages\gym\core.py in getattr(self, name)
239 if name.startswith("_"):
240 raise AttributeError(f"accessing private attribute '{name}' is prohibited")
--> 241 return getattr(self.env, name)
242
243 @Property

AttributeError: 'WalkEnv' object has no attribute 'seed'

"Spin up a container" correction

In your main document (README.md) / "Running the code" / "Spin up a container" / on Windows:

docker run -it --rm -p 8888:8888 -v %cd%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14

The %cd% should be capitalized. So, ...

docker run -it --rm -p 8888:8888 -v %CD%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14

please, update this project

thanks a bunch.

Ch04 - strategies

This book is awesome. I'm just leaving a note to anyone who wants to play with the code of Chapter 4 and test the simple strategies progressively with theEnvironemnt ("BanditSlipperyWalk-v0").
Within each loop, you need to reset the environment each time if you plan on inspecting the values of Q and Qe

Formula/code discrepancy in Chapter 4?

Really loving this book!! Doing plenty of reading and re-reading not to miss a beat.

I noticed a formula vs. code discrepancy in Chapter 4 - Upper Confidence Bound (UCB) equation formula where hyperparameter 'c' is outside the square root in the equation:

but inside the square root in the code:

Chapter 9 load experiences issue

In your DQN class, optimize_model method, you read in the experiences in the below order.

states, actions, rewards, next_states, is_terminals = experiences

However in the FCQ class, load method, you load the experiences in the below order.

states, actions, new_states, rewards, is_terminals = experiences

So you have mixed up new_states and rewards. My advice is to swap new_states and rewards in the FCQ class, load method to keep it consistent with the rest of the code. I found out this issue when I'm trying to look at the new_states just to find out my new_states are actually rewards, then I found out this consistency issue.

The code of the SAC algorithm does not work

Hello,
when I execute the code of the SAC algorithm, I get the following error message:

/usr/local/lib/python3.10/dist-packages/gym/core.py:317: DeprecationWarning: WARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.
  deprecation(
/usr/local/lib/python3.10/dist-packages/gym/wrappers/step_api_compatibility.py:39: DeprecationWarning: WARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.
  deprecation(
/usr/local/lib/python3.10/dist-packages/gym/utils/passive_env_checker.py:174: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed a `seed` instead of using `Env.seed` for resetting the environment random number generator.
  logger.warn(
/usr/local/lib/python3.10/dist-packages/gym/utils/passive_env_checker.py:190: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed `return_info` to return information from the environment resetting.
  logger.warn(
/usr/local/lib/python3.10/dist-packages/gym/utils/passive_env_checker.py:195: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed `options` to allow the environment initialisation to be passed additional information.
  logger.warn(
/usr/local/lib/python3.10/dist-packages/gym/utils/passive_env_checker.py:227: DeprecationWarning: WARN: Core environment is written in old step API which returns one bool instead of two. It is recommended to rewrite the environment with new step API. 
  logger.deprecation(
<ipython-input-11-34a0d7821828>:106: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py:200: UserWarning: Error detected in AddmmBackward0. Traceback of forward call that caused the error:
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelapp.py", line 619, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.10/dist-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 685, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 738, in _run_callback
    ret = callback()
  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 825, in inner
    self.ctx_run(self.run)
  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 786, in run
    yielded = self.gen.send(value)
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 361, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 539, in execute_request
    self.do_execute(
  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py", line 302, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/zmqshell.py", line 539, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell
    result = self._run_cell(
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell
    return runner(coro)
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner
    coro.send(None)
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-12-ebd8c3a16d78>", line 72, in <cell line: 72>
    sac_agent.train()
  File "<ipython-input-11-34a0d7821828>", line 180, in train
    self.optimize(experiences)
  File "<ipython-input-11-34a0d7821828>", line 129, in optimize
    current_q_vals_b = self.online_value_model_b(states, current_actions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "<ipython-input-8-c490369f540d>", line 54, in forward
    x = self.output_layer(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-12-ebd8c3a16d78>](https://localhost:8080/#) in <cell line: 72>()
     70 
     71 # start training
---> 72 sac_agent.train()

3 frames
[/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    198     # some Python versions print out the first line of a multi-line function
    199     # calls in the traceback and some print out the last line
--> 200     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    201         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    202         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Is there any error in the code ? Can somebody provide a solution for this ?

Jupyter notebook read-only

Hello,

I am just testing with your codes.

And I see there is no error that I have to intervene with, that's great.

But I see that scripts are all read-only. Is there anyway I can edit and save?

I new to docker so please be kind :)

Why is the value loss multiplied by 0.5?

In the VPG implementation, the value loss is calculated,

value_loss = value_error.pow(2).mul(0.5).mean()

Isn't the value loss simply the MSE, so just value_error.pow(2).mean()? Why the additional multiplication of 0.5?

Thank you!

The formula and code of SAC are inconsistent

In page 393, the objective for alpha shows the product relationship between alpha and the sum of target entropy heuristic and a likelihood term. However, the line
" alpha_loss = -(self.policy_model.logalpha * target_alpha).mean() "
is written in the corresponding code. They are inconsistent.

The use of 'discounts' in REINFORCE() class

This is just an enquiry about REINFORCE() class in chapter 11.

class REINFORCE():
......
def optimize_model(self):
T = len(self.rewards)
discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False)
returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])

    discounts = torch.FloatTensor(discounts).unsqueeze(1)
    returns = torch.FloatTensor(returns).unsqueeze(1)
    self.logpas = torch.cat(self.logpas)

    policy_loss = -(discounts * returns * self.logpas).mean()

In the code above, 'returns' already take into consideration 'discounts'. So, why do we multiply by another 'discounts' when working out 'policy_loss'? I am not clear on this.

PicklingError: Can't pickle <function <lambda> at 0x7fc295d89ee0>: attribute lookup <lambda> on main failed

When running A3C in chapter 11 the above error is issued when [w.start() for w in workers] is run.

This seems to be related to the fact that pytorch uses ForkingPickle when starting a process and lambda functions are not supported for pickling.

Creating Notebook Failed

I entered password: "gdrl" at jupyter login link in my browser, then after it showing notebook list empty and i can't even able to create new python notebook .ipynb file, returning a pop up block with error "Permission denied: Untitled.ipynb".

For the records who faced inplace operation error.

Someone maybe execute the example code in local machine.
If you use Pytorch >= 1.5.0, you will face the inplace operation error.
I got this error while executing SAC example. (especially on optimize_model)
In this forum comments, it requires to correct the optimization process like this,

    def optimize_model(self, experiences):
        states, actions, rewards, next_states, is_terminals = experiences
        batch_size = len(is_terminals)

        # policy loss
        current_actions, logpi_s, _ = self.policy_model.full_pass(states)

        target_alpha = (logpi_s + self.policy_model.target_entropy).detach()
        alpha_loss = -(self.policy_model.logalpha * target_alpha).mean()

        self.policy_model.alpha_optimizer.zero_grad()
        alpha_loss.backward()
        self.policy_model.alpha_optimizer.step()
        alpha = self.policy_model.logalpha.exp()

        current_q_sa_a = self.online_value_model_a(states, current_actions)
        current_q_sa_b = self.online_value_model_b(states, current_actions)
        current_q_sa = torch.min(current_q_sa_a, current_q_sa_b)
        policy_loss = (alpha * logpi_s - current_q_sa).mean()
        
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy_model.parameters(), 
                                       self.policy_max_grad_norm)        
        self.policy_optimizer.step()

        # Q loss
        ap, logpi_sp, _ = self.policy_model.full_pass(next_states)
        q_spap_a = self.target_value_model_a(next_states, ap)
        q_spap_b = self.target_value_model_b(next_states, ap)
        q_spap = torch.min(q_spap_a, q_spap_b) - alpha * logpi_sp
        target_q_sa = (rewards + self.gamma * q_spap * (1 - is_terminals)).detach()

        q_sa_a = self.online_value_model_a(states, actions)
        q_sa_b = self.online_value_model_b(states, actions)
        qa_loss = (q_sa_a - target_q_sa).pow(2).mul(0.5).mean()
        
        self.value_optimizer_a.zero_grad()
        qa_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.online_value_model_a.parameters(), 
                                       self.value_max_grad_norm)
        self.value_optimizer_a.step()
        
        qb_loss = (q_sa_b - target_q_sa).pow(2).mul(0.5).mean()
    

        self.value_optimizer_b.zero_grad()
        qb_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.online_value_model_b.parameters(),
                                       self.value_max_grad_norm)
        self.value_optimizer_b.step()

The thing I`ve changed is to move the policy loss optimization in front of the q-value loss optimization.
Please refer to this issue for the reference.

Hope this help!

Empty notebooks folder

Hello,

I'm trying to execute the notebooks from docker image, but when I run the container and log in into jupyter notebook, the notebook/folder list is empty.

I cloned the repository and CD to the gdrl directory. I'm using Docker Toolbox with Oracle Virtual Box to execute either:

On Windows cmd:
docker run -it --rm -p 8888:8888 -v %CD%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
On Linux emulator:
docker run -it --rm -p 8888:8888 -v "$PWD"/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14

The logs when spinning the docker container are:

$ docker run -it --rm -p 8888:8888 -v "$PWD"/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
Container must be run with group "root" to update passwd file
Executing the command: jupyter notebook
[I 13:28:52.426 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[I 13:28:58.160 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.7/site-packages/jupyterlab
[I 13:28:58.166 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 13:29:02.874 NotebookApp] Serving notebooks from local directory: /mnt/notebooks
[I 13:29:02.880 NotebookApp] The Jupyter Notebook is running at:
[I 13:29:02.883 NotebookApp] http://ba85e00e685a:8888/
[I 13:29:02.884 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

when I log into Jupyter Notebook (See error in line 10):

[W 13:31:50.797 NotebookApp] Clearing invalid/expired login cookie username-192-168-99-101-8888
[W 13:31:50.806 NotebookApp] Forbidden
[W 13:31:50.816 NotebookApp] 403 GET /api/sessions?_=1608384680945 (192.168.99.1) 28.33ms referer=http://192.168.99.101:8888/tree?
[W 13:31:50.836 NotebookApp] Clearing invalid/expired login cookie username-192-168-99-101-8888
[W 13:31:50.838 NotebookApp] Forbidden
[W 13:31:50.845 NotebookApp] 403 GET /api/terminals?_=1608384680946 (192.168.99.1) 8.98ms referer=http://192.168.99.101:8888/tree?
[W 13:31:50.856 NotebookApp] Clearing invalid/expired login cookie username-192-168-99-101-8888
[W 13:31:50.861 NotebookApp] Clearing invalid/expired login cookie username-192-168-99-101-8888
[I 13:31:50.864 NotebookApp] 302 GET /tree? (192.168.99.1) 10.72ms
[E 13:31:50.973 NotebookApp] Could not open static file ''
[W 13:31:51.152 NotebookApp] 404 GET /static/components/react/react-dom.production.min.js (192.168.99.1) 90.56ms referer=http://192.168.99.101:8888/login?next=%2Ftree%3F
[I 13:31:53.770 NotebookApp] 302 POST /login?next=%2Ftree%3F (192.168.99.1) 3.23ms
[W 13:31:54.074 NotebookApp] 404 GET /static/components/react/react-dom.production.min.js (192.168.99.1) 2.61ms referer=http://192.168.99.101:8888/tree?
[W 13:31:54.131 NotebookApp] 404 GET /static/components/react/react-dom.production.min.js (192.168.99.1) 13.17ms referer=http://192.168.99.101:8888/tree?

Thanks for the support

Facing issue when running docker

I tried docker and got the following, looks like it's suspended and I'm stuck.

Docker container is running but when pointing to localhost:8888 nothing happens.
I get the Docker warning "Image may have poor performance or fail, if run via emulation"
(btw I'm using a Mac Mini M2-Pro Apple Silicon running Mac OS Sonoma 14.2.1 and Docker 4.26)

> docker run -it --rm -p 8888:8888 -v "$PWD"/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
[TrustNotebookApp] Writing notebook-signing key to /home/jovyan/.local/share/jupyter/notebook_secret
Signing notebook: /mnt/notebooks/chapter_08/chapter-08.ipynb
Signing notebook: /mnt/notebooks/chapter_06/chapter-06.ipynb
Signing notebook: /mnt/notebooks/chapter_07/chapter-07.ipynb
Signing notebook: /mnt/notebooks/chapter_09/chapter-09.ipynb
Signing notebook: /mnt/notebooks/chapter_12/chapter-12.ipynb
Signing notebook: /mnt/notebooks/chapter_05/chapter-05.ipynb
Signing notebook: /mnt/notebooks/chapter_02/chapter-02.ipynb
Signing notebook: /mnt/notebooks/chapter_03/chapter-03.ipynb
Signing notebook: /mnt/notebooks/chapter_04/chapter-04.ipynb
Signing notebook: /mnt/notebooks/chapter_10/chapter-10.ipynb
Signing notebook: /mnt/notebooks/chapter_11/chapter-11.ipynb

custom gym environments are not compatible with current gym API

I think you are depending on an older version of gym in some of the environments you made, a common error is

/usr/local/lib/python3.9/site-packages/gym/wrappers/time_limit.py in step(self, action)
     48 
     49         """
---> 50         observation, reward, terminated, truncated, info = self.env.step(action)
     51         self._elapsed_steps += 1
     52 

ValueError: not enough values to unpack (expected 5, got 4)

walk_env.py has step function return return (int(s), r, d, {"prob": p}) but the version of gym I'm on (0.26.0) seems to wrap your environment in some time limit and gym.make('RandomWalk-v0') returns something like this <TimeLimit<OrderEnforcing<PassiveEnvChecker<WalkEnv<RandomWalk-v0>>>>>.

Your dockerfile doesn't pin gym to any specific version so I don't know what the right version is.

a consistent requirements.txt needed

where can I find a consistent requirements.txt for this python code?
i'm running into numerous compatibility issues of libraries when I try and run the code.

mimoralea / gdrl Goto Github PK

gdrl's Introduction

Grokking Deep Reinforcement Learning

Running the code

About the book

Book's website

Table of content

Detailed table of content

1. Introduction to deep reinforcement learning

2. Mathematical foundations of reinforcement learning

3. Balancing immediate and long-term goals

4. Balancing the gathering and utilization of information

5. Evaluating agents' behaviors

6. Improving agents' behaviors

7. Achieving goals more effectively and efficiently

8. Introduction to value-based deep reinforcement learning

9. More stable value-based methods

10. Sample-efficient value-based methods

11. Policy-gradient and actor-critic methods

12. Advanced actor-critic methods

13. Towards artificial general intelligence

gdrl's People

Contributors

Stargazers

Watchers

Forkers

gdrl's Issues

Recommend Projects

Recommend Topics

Recommend Org