deepreinforcementlearning / deepreinforcementlearninginaction Goto Github PK
View Code? Open in Web Editor NEWCode from the Deep Reinforcement Learning in Action book from Manning, Inc
License: MIT License
Code from the Deep Reinforcement Learning in Action book from Manning, Inc
License: MIT License
Unless I'm mistaken, there is something odd about the main training loop (Listing 8.13) for the Super Mario game in Chapter 8. The way that the current x-position is checked against the min_progress
parameter makes no sense to me.
More precisely: in line 23 of the main training loop, the environment step is taken (6 times) and last_x_pos
is set to the current x-position:
state2, e_reward_, done, info = env.step(action)
last_x_pos = info['x_pos']
In the following lines of code, neither last_x_pos
nor info['x_pos']
are changed. Then in line 33 the two are compared to one another:
if episode_length > params['max_episode_len']:
if (info['x_pos'] - last_x_pos) < params['min_progress']:
done = True
else:
last_x_pos = info['x_pos']
Isn't info['x_pos'] - last_x_pos
always going to be zero here? This would always reset the environment as soon as episode_length > params['max_episode_len']
.
What is the min_progress
parameter meant to be intuitively? The progress from beginning till the end of one episode? The progress from time 0 till max_episode_len
? Or the progress against a certain checkpoint in a certain amount of time? If so, how are these checkpoints chosen?
This has not become clear to me yet, neither from the book nor from the code.
These days, i tried the listing 3-3, and i set the epochs to 1. i found the reward value is always -1. So it seems that it is in dead loop status. How much time does it cost to run this example?
Thanks
Is there an error in the training loop code for playing Atari-Freeway: specifically generating the predictions?
pred2_batch = dist_dqn(state2_batch.detach(),theta_2,aspace=aspace)
Should the state2_batch be state_batch?
state2 is already reshaped in
"
state2_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/10.0
state2 = torch.from_numpy(state2_).float() #L
"
Therefore,
with torch.no_grad():
newQ = model(state2.reshape(1,64))
maxQ = torch.max(newQ) #M
might be fixed as:
with torch.no_grad():
newQ = model(state2)
maxQ = torch.max(newQ) #M
When I change the size to 12 (and in the mode = "player"), the agent no longer learning. It always move towards the borders, i.e. keep taking the action moving towards the borders even when it is already at the border.
Is it because there is no penalty for such action?
The code for the model is as below
model = torch.nn.Sequential(
torch.nn.Linear(l1, l2),
torch.nn.LeakyReLU(),
torch.nn.Linear(l2, l3),
torch.nn.Softmax(dim=0) #C
)
But the softmax
operation with dim=0
is only OK when the input is a 1 dimensional array. However, when you give a batch input, then the probability will be computed along the row direction of the batch matrix.
You can check it by printing pred_batch
of Listing 4.8.
pred_batch = model(state_batch) #N
print(pred_batch)
One way to fix this is by modifying it to:
torch.nn.Softmax(dim=1) #C
and do unsqueeze(0)
and squeeze(0)
for the computation of just one state vector:
state1 = env.reset()
pred = model(torch.from_numpy(state1).float().unsqueeze(0)) #G
action = np.random.choice(np.array([0,1]), p=pred.data.numpy().squeeze(0)) #H
state2, reward, done, info = env.step(action) #I
I like this book much since it gives some intuition for RL rather than trying to provide the theory^^
How to resolve this warning:
UserWarning: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
caused by this code part:
Y = torch.Tensor([Y]).detach()
X = qval.squeeze()[action_]
loss = loss_fn(X, Y)
The script still works fine, but I would like to get rid of the warning. Thanks
I get an error returned for this and i have no idea how to fix it
I write in Spyder 5.4.3 with Python 3.11
codeline:
`import torch
import torchvision as TV
import numpy as np
from matplotlib import pyplot as plt
def nn(x,w1,w2):
l1 = x @ w1
l1 = torch.relu(l1)
l2 = l1 @ w2
return l2
w1 = torch.randn(784,200,requires_grad=True)
w2 = torch.randn(200,10,requires_grad=True)
mnist_data = TV.datasets.MNIST("MNIST", train=True, download=False)
plt.figure(figsize=(10,7))
plt.imshow(mnist_data.train_data[0])
plt.axis('off')
lr = 0.0001
epochs = 2500
batch_size = 1000
losses = []
lossfn = torch.nn.CrossEntropyLoss()
for i in range(epochs):
rid = np.random.randint(0,mnist_data.train_data.shape[0],size=batch_size)
x = mnist_data.train_data[rid].float().flatten(start_dim=1)
x /= x.max()
pred = nn(x,w1,w2)
target = mnist_data.train_labels[rid]
loss = lossfn(pred,target)
losses.append(loss)
loss.backward()
with torch.no_grad():
w1 -= lr * w1.grad
w2 -= lr * w2.grad
plt.figure(figsize=(10,7))
plt.xlabel("Training Time", fontsize=22)
plt.ylabel("Loss", fontsize=22)
plt.plot(losses)`
console return:
File ~/anaconda3/lib/python3.11/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)
File ~/.spyder-py3/temp.py:49
plt.plot(losses)
File ~/anaconda3/lib/python3.11/site-packages/matplotlib/pyplot.py:2812 in plot
return gca().plot(
File ~/anaconda3/lib/python3.11/site-packages/matplotlib/axes/_axes.py:1688 in plot
lines = [*self._get_lines(*args, data=data, **kwargs)]
File ~/anaconda3/lib/python3.11/site-packages/matplotlib/axes/_base.py:311 in call
yield from self._plot_args(
File ~/anaconda3/lib/python3.11/site-packages/matplotlib/axes/_base.py:496 in _plot_args
x, y = index_of(xy[-1])
File ~/anaconda3/lib/python3.11/site-packages/matplotlib/cbook/init.py:1661 in index_of
y = _check_1d(y)
File ~/anaconda3/lib/python3.11/site-packages/matplotlib/cbook/init.py:1353 in _check_1d
return np.atleast_1d(x)
File <array_function internals>:200 in atleast_1d
File ~/anaconda3/lib/python3.11/site-packages/numpy/core/shape_base.py:65 in atleast_1d
ary = asanyarray(ary)
File ~/anaconda3/lib/python3.11/site-packages/torch/_tensor.py:956 in array
return self.numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
When I copy code in List 5.1 and run in jupyter, it always tells me this:
Can't get attribute 'square' on <module 'main' (built-in)>
According to what I have found in Google, it seems that the code needs to be titled:
if name =='main':
but this way only works in Spyder and Pycharm.
So I wanna know how you guys tackle it.
Grateful to hear any suggestions!
Accoring to what authors say in chapter 4, more episode duration will allow the model to hold the game longer.
Then I download the code of chapter 4, run it locally with MAX_EPISODES = 250.
Surprisingly, this makes the model be bad at the task, only 22 times exceed 180s while the original model can make it by 90 times.
And I also reset the model, try with higher MAX_EPISODES, but all of them fail to beat the beginning set.
What may contribute to this phenomenon?
Running the cell in notebook produces the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-448853d32d49> in <module>
34 optimizer.zero_grad()
35 loss.backward()
---> 36 losses.append(loss.data[0])
37 optimizer.step()
38 state = new_state
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number
Python 3.7.2 with:
[('Jinja2', '2.10.1'), ('Mako', '1.0.7'), ('Markdown', '3.0.1'), ('MarkupSafe', '1.1.0'), ('Pillow', '6.0.0'), ('Pygments', '2.3.1'), ('Send2Trash', '1.5.0'), ('appnope', '0.1.0'), ('attrs', '19.1.0'), ('backcall', '0.1.0'), ('bleach', '3.1.0'), ('cycler', '0.10.0'), ('decorator', '4.4.0'), ('defusedxml', '0.6.0'), ('entrypoints', '0.3'), ('ipykernel', '5.1.0'), ('ipython', '7.4.0'), ('ipython-genutils', '0.2.0'), ('ipywidgets', '7.4.2'), ('jedi', '0.13.3'), ('jsonschema', '3.0.1'), ('jupyter', '1.0.0'), ('jupyter-client', '5.2.4'), ('jupyter-console', '6.0.0'), ('jupyter-core', '4.4.0'), ('kiwisolver', '1.0.1'), ('matplotlib', '3.0.3'), ('mistune', '0.8.4'), ('nbconvert', '5.4.1'), ('nbformat', '4.4.0'), ('notebook', '5.7.8'), ('numpy', '1.16.3'), ('pandocfilters', '1.4.2'), ('parso', '0.4.0'), ('pdoc3', '0.5.2'), ('pexpect', '4.7.0'), ('pickleshare', '0.7.5'), ('pip', '19.0.3'), ('prometheus-client', '0.6.0'), ('prompt-toolkit', '2.0.9'), ('ptyprocess', '0.6.0'), ('pyparsing', '2.4.0'), ('pyrsistent', '0.14.11'), ('python-dateutil', '2.8.0'), ('pyzmq', '18.0.1'), ('qtconsole', '4.4.3'), ('setuptools', '40.8.0'), ('six', '1.12.0'), ('snap', '5.0.0-64-dev-macosx10.14.3-x64-py3.7'), ('terminado', '0.8.2'), ('testpath', '0.4.2'), ('torch', '1.0.1.post2'), ('torchvision', '0.2.2.post3'), ('tornado', '6.0.2'), ('traitlets', '4.3.2'), ('wcwidth', '0.1.7'), ('webencodings', '0.5.1'), ('wheel', '0.33.0'), ('widgetsnbextension', '3.4.2')]
state1 = torch.from_numpy(state_).float() #E
I noticed that for both teams, when calling team_step()
we are using the same parameter vector param[0]
for both teams:
acts_1, act_means1, qvals1, obs_small_1, ids_1 = \
team_step(team1,params[0],acts_1,layers) #B
env.set_action(team1, acts_1.detach().numpy().astype(np.int32)) #C
acts_2, act_means2, qvals2, obs_small_2, ids_2 = \
team_step(team2,params[0],acts_2,layers)
env.set_action(team2, acts_2.detach().numpy().astype(np.int32))
Shouldn't it be param[0]
for team 1 and param[1]
for team 2? That's the behaviour shown later when calling train
:
loss1 = train(batch_size,replay1,params[0],layers=layers,J=N1)
loss2 = train(batch_size,replay2,params[1],layers=layers,J=N1)
To run in a Colab, seems like it is necessary to add
mp.set_start_method('spawn', force = True)
In the Listing 3.7, we use both memory replay and target network to improve the stablility.
However, in the memory loop:
if len(replay) > batch_size:
minibatch = random.sample(replay, batch_size)
...
action_batch = torch.Tensor([a for (s1,a,r,s2,d) in minibatch])
The compiler tells me this error:
---> 42 action_batch = torch.Tensor([a for (s1,a,r,s2,d) in minibatch])
too many dimensions 'str'
I suppose that when we memory, the action is represented by a characteristic. There, nevertheless, corresponding number are needed.
So I propose to make a reverse action set to fill this transform.
get_action_value
is not defined anywhere in the notebook.
I am very happy and grateful to read this brilliant book!
But I recently find some pictures in book is blank. In my case, Figure 3.17, 3.18, 4.5 are all blank.
I read the eBook from O'reilly, and I do hope these pictures can show up so that readers can understand all authors' thoughts!
I have attempted to Chapt 8 code, as a python file, on a 32Gb CPU RAM Ubuntu 18.04 rig with 16Gb NVidia 1800 GTi GPU card. However my RAM Utilisation grows excessively as the training epochs run, exceeds 30 Gb when I hit 1800 epochs on the Super Mario Mario Curiosity Deep Training code.
The book suggested that this code would only take 30 minutes on a Mac Book Air (no GPU) So i don’t understand why the RAM use grows to excess as the training epochs grows.
Interested in any others experience on this, or why I would be experiencing such excessive and growing RAM utilisation.
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==0.4.0
Could not find a version that satisfies the requirement torch==0.4.0 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 0.4.1, 0.4.1.post2, 1.0.0, 1.0.1, 1.0.1.post2, 1.1.0, 1.2.0, 1.2.0+cpu, 1.2.0+cu92, 1.3.0, 1.3.0+cpu, 1.3.0+cu100, 1.3.0+cu92, 1.3.1, 1.3.1+cpu, 1.3.1+cu100, 1.3.1+cu92, 1.4.0, 1.4.0+cpu, 1.4.0+cu100, 1.4.0+cu92, 1.5.0, 1.5.0+cpu, 1.5.0+cu101, 1.5.0+cu92, 1.5.1, 1.5.1+cpu, 1.5.1+cu101, 1.5.1+cu92, 1.6.0, 1.6.0+cpu, 1.6.0+cu101, 1.6.0+cu92, 1.7.0, 1.7.0+cpu, 1.7.0+cu101, 1.7.0+cu110, 1.7.0+cu92, 1.7.1, 1.7.1+cpu, 1.7.1+cu101, 1.7.1+cu110, 1.7.1+cu92, 1.7.1+rocm3.7, 1.7.1+rocm3.8)
No matching distribution found for torch==0.4.0
import multiprocessing as mp
import numpy as np
def square(x): #A
return np.square(x)
x = np.arange(64) #B
print(x)
print(mp.cpu_count())
pool = mp.Pool(4) #C
squared = pool.map(square, [x[8i:8i+8] for i in range(4)])
print(squared)
Unfortunately, I receive the following error after running the code.
print(squared)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]
4
Process SpawnPoolWorker-9:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 114, in worker
task = get()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 358, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'square' on <module 'main' (built-in)>
Process SpawnPoolWorker-10:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 114, in worker
task = get()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 358, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'square' on <module 'main' (built-in)>
Process SpawnPoolWorker-11:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 114, in worker
task = get()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 358, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'square' on <module 'main' (built-in)>
Process SpawnPoolWorker-12:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 114, in worker
task = get()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 358, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'square' on <module 'main' (built-in)>
I am reading the book in Chapter two, and I have a question in a paragraph below.
In the reward function, We will assume that each arm is executed 10 times to determine if the probability value in the numpy array is less than prob?And whether the reward is a real reward or an estimated reward?
Per our casino example, we will be solving a 10-armed bandit problem, hence n = 10. We’ve also defined a numpy array of length n filled with random floats that can be understood as probabilities. The way we've chosen to implement our reward probability distributions for each arm/lever/slot machine is this: Each arm will have a probability, e.g. 0.7. The maximum reward is $10. We will setup a for loop to 10 and at each step, it will add +1 to the reward if a random float is less than the arm's probability. Thus on the first loop, it makes up a random float (e.g. 0.4). 0.4 is less than 0.7, so reward += 1. On the next iteration, it makes up another random float (e.g. 0.6) which is also less than 0.7, thus reward += 1. This continues until we complete 10 iterations and then we return the final total reward, which could be anything between 0 and 10. With an arm probability of 0.7, the average reward of doing this to infinity would be 7, but on any single play, it could be more or less.
def reward(prob, n=10):
reward = 0;
for i in range(n):
if random.random() < prob:
reward += 1
return reward
Both numpy and pytorch links in readme file are invalid.
Almost all the inline code snippets in chapter 2 are not working.
E.g:
plt.xlabel("Plays")
plt.ylabel("Avg Reward")
for i in range(500):
if random.random() > eps:
choice = get_best_arm(pastRewards, actions)
else:
choice = np.where(arms == np.random.choice(arms))[0][0]
thisAV = np.array([[choice, reward(arms[choice])]])
av = np.vstack((av, thisAV))
percCorrect = 100*(len(av[np.where(av[:,0] == np.argmax(arms))])/len(av))
runningMean = np.mean(av[:,1])
plt.scatter(i, runningMean)
NameError: name 'pastRewards' is not defined
for i in range(500):
choice = np.random.choice(arms, p=av_softmax)
counts[choice] += 1
k = counts[choice]
rwd = reward(arms[choice])
old_avg = av[choice]
new_avg = old_avg + (1/k)*(rwd - old_avg)
av[choice] = new_avg
av_softmax = softmax(av)
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
>>> x = torch.Tensor([2,4]) #input data
>>> m = torch.randn(2, requires_grad=True) #parameter 1
>>> b = torch.randn(1, requires_grad=True) #parameter 2
>>> y = m*x+b #linear model
>>> loss = (torch.sum(y_known - y))**2 #loss function
>>> loss.backward() #calculate gradients
>>> m.grad
tensor([ 0.7734, -90.4993])
NameError: name 'y_known' is not defined
model = torch.nn.Sequential(
torchh.nn.Linear(10, 150),
torch.nn.ReLU(),
torch.nn.Linear(150, 4),
torch.nn.ReLU(),
)
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
NameError: name 'torchh' is not defined
Does the Errata folder contain the correct versions of the notebooks or older incorrect versions? I think it would be helpful to have that information in the README.md or to simply include only corrected versions of the notebooks.
The German translation of the book (p. 134) promises:
"However, if you want, you can easily adapt the algorithm to a more difficult game like Pong in OpenAI Gym; you can find such an implementation on the GitHub page for this chapter: http://mng.bz/JzKp."
Unfortunately I couldn't find anything about it! Anyone know where the code is?
Auf Seite 134 im Buch (Hanser) steht:
"Wenn Sie möchten, können Sie den Algorithmus jedoch leicht an ein schwierigeres Spiel wie Pong in OpenAI Gym anpassen; eine solche Implementierung finden Sie auf der GitHub-Seite zu diesem Kapitel: http://mng.bz/JzKp."
Leider habe ich dazu nichts gefunden! Wer weiss, wo der Code ist?
When I try to run the notebook I am getting the following error in the cell "Without experience replay"
'Variable' object has no attribute 'reshape'
The error occur in the line
newQ = model(new_state.reshape(1,64)).data.numpy()
I am running pytorch 0.3.1 on Windows 10
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.