shangtongzhang / reinforcement-learning-an-introduction Goto Github PK
View Code? Open in Web Editor NEWPython Implementation of Reinforcement Learning: An Introduction
License: MIT License
Python Implementation of Reinforcement Learning: An Introduction
License: MIT License
Hi @ShangtongZhang could you add a standard license, like MIT one for example, to this project? Otherwise it is not exactly clear how it could be used.
the code from the author zhang is:
# get the n-step return from the given time def nStepReturnFromTime(self, n, time): # gamma is always 1 and rewards are zero except for the last reward # the formula can be simplified endTime = min(time + n, self.T) returns = self.value(self.trajectory[endTime]) if endTime == self.T: returns += self.reward return returns
which I think should be like this:
# get the n-step return from the given time def nStepReturnFromTime(self, n, time): # gamma is always 1 and rewards are zero except for the last reward # the formula can be simplified endTime = min(time + n, self.T) #test if endTime == self.T: returns = self.reward else: returns = self.value(self.trajectory[endTime]) return returns #test
this is confused, can somebody help me out?
Hi, we are currently reading the book, and had a question posted on StackOverflow, could you please to give any hint, or thought.
This is the link:
In getAction method, when upon exploration I see that the indices are shuffled. As shuffles the indices permanently, it may affect the exploitation mechanism when performing np.random.choice(self.indices, p=self.actionProb)
Hi, may I ask about the Blackjack.py file in chapter5?
Why does line 195 add the initial state in trajectory again?
Won't this lead to calculate the initial state repeatedly?
Since at line 137 already append all states in trajectory.
Or do I misunderstand something?
Thanks.
Dear Shangtong,
Code Line No.157 should change from "argmax(values)" to "np.argmax(values)"
since the original code will cause compile error : "global name 'argmax' is not defined"
Best Regards,
yuqing
Your code is one of the exercises from Chap. 1 ? Mine is the port of the tictactoe lisp example to Haskell(https://github.com/mohanr/Reinforcement-Learning-An-Introduction-by-Richard-S.-Sutton-and-Andrew-G.-Barto/blob/master/tictactoe.hs)
What is the result from 'run' or 'runs' that proves the learning is happening ?
My code returns a value between 40 and 50.
I mean this section.
(defun run ()
(loop repeat 40 do (print (/ (loop repeat 100 sum (game t))
100.0))))
My results are like this.
Played 100 times 42.0 0.42
Played 100 times 43.0 0.43
Played 100 times 42.5 0.425
Played 100 times 40.5 0.405
Played 100 times 43.0 0.43
Played 100 times 43.0 0.43
Played 100 times 42.0 0.42
Did you create the example with a view to actually prove the RL algorithm learns ?
if 3 nested loop,isn't that the number of simulations to be nBanditsnBanditstime?
Hi,
When comparing the backup updates between the 2 methods for the extended mazes you multiply the actual steps taken by n(planning_steps). Shouldn't be this multiplied by n+1 since it uses n updates in the simulated experience and +1 more update in the real experience?
Hi,
When I run CliffWalking.py from Chapter 6 I get this as optimal policy output ( graphs look great btw ):
Sarsa Optimal Policy:
['R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['U', 'U', 'U', 'U', 'U', 'R', 'U', 'U', 'U', 'U', 'R', 'D']
['R', 'U', 'L', 'R', 'L', 'U', 'R', 'U', 'U', 'L', 'R', 'D']
['U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'G']
Q-Learning Optimal Policy:
['L', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'D', 'R', 'R', 'R', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'G']
These are not optimal - if you follow them in cliff gridworld they lead to nowhere.
Any comments ?
Regards, Ranko.
In the GridWorld.py, you write the code using bellman equation to update the value function:
while True:
# keep iteration until convergence
newWorld = np.zeros((WORLD_SIZE, WORLD_SIZE))
for i in range(0, WORLD_SIZE):
for j in range(0, WORLD_SIZE):
for action in actions:
newPosition = nextState[i][j][action]
# bellman equation
newWorld[i, j] += actionProb[i][j][action] * (actionReward[i][j][action] + discount * world[newPosition[0], newPosition[1]])
if np.sum(np.abs(world - newWorld)) < 1e-4:
print('Random Policy')
print(newWorld)
break
world = newWorld
In the last line,
world = newWorld
This assignment will cause the world and newWorld to share the same address and content. This is also right because it can be seen as the asynchronous updating. But if you want to implement the asynchronous way, I think there is no need to use two variables. So, I believe that this is a error, and the right implementation should be:
world = copy.deepcopy(newWorld)
Thank for you reply in advance.
In takeAction method updating qEst may be having a problem.
for selected action (1 - π), and
and for other actions simply π should be applied
But the following code updates it to
for selected action( 1- π), and
for other actions (0 - π)
self.qEst = self.qEst + self.stepSize * (reward - baseline) * (oneHot - self.actionProb)
Ch2, line62:
Issue#1:
If, line 48, you set self.actionCount to array of 10 zeros,
Then if, line 62, - np.asarray(self.actionCount) + 1 - then adds +1 to the count of EVERY actionCount, not the SPECIFIC actionCount; this seems to be an error because...; note: asarray doesn't copy, it's ref'd
Then, line 77, self.actionCount[action] += 1;
Issue #2:
Also, if, line 62, self.qEst + \
then adds self.UCBParam * np.sqrt(np.log(self.time + 1) / (np.asarray(self.actionCount) + 1)) to EVERY ActionEstimate(qEst), not just the SPECIFIC ActionEstimate
The book does both these calc for each SPECIFIC action, not EVERY action; see page 37, Upper-Confidence-Bound Action Selection
Hi,
Noticed the link in the readme is not right anymore. Looks like all URIs are without "/sutton/" path now.
Many thanks.
In the 218th line of Blackjack.py
_, reward, trajectory = play(behaviorPolicy, initialState, initialAction)
Here the initialAction
is set to be a random action. In the implementation of play
function, if initialAction is not None
, the policyPlayerFn
will not be used. This means that the behaviorPolicy
will not work. Does it have any special meaning here?
Any update on when the 2nd edition would be released into print?
Best,
Andrew
Hi,
I forked you repository and clone to my local workspace, then installed the whole package with python setup.py install
. The package is successfully installed and I can see the package name with pip list
. The problem is I can run every script files and functions successfully in a terminal with command python **name**.py
, but I cannot run the module alone, which means I open the specific file and run the file in a IDE like spyder or use Ctrl+b in sublime. I thought it is related to the python packing functions. Could you give me some explanations?
Thanks
there is a mistake in the book:
for the gambler's problem in Chapter 4, there is no unique optimal policy.
The plot of optimal actions is missing other equally valuable actions.
For example
There are no floating point problems in your implementation.
The correct solution (i hope) is displayed in the images on the page
https://github.com/idsc-frazzoli/subare
Suggestion: prohibit the gambler to bet an amount of 0 as long as his/her cash is between 0 and 100. This will speed up the convergence.
Relevant code is :
priority=abs(R+gammamax_aQ(s'.a)-Q(s,a))
However stateActionValue should be updated using R+gammamax_aQ(s'.a)-Q(s,a) which isn't equal to priority.
If two action have the same state-action value, then use the np.argmax() method will always return the first action. So I change the implementation (MaximizationBias.py):
# choose an action based on epsilon greedy algorithm
def chooseAction(state, stateActionValues):
if np.random.binomial(1, EPSILON) == 1:
return np.random.choice(stateActions[state])
elif state == STATE_A and len(set(stateActionValues[0])) == 1:
return np.random.choice(stateActions[state])
else:
return np.argmax(stateActionValues[state])
Hi Shangtong,
When I tested this tic-tac-toe AI program, I found that AI didn't play well as player 2. I think the tie rewards are not properly. If set tie rewards as 0.1/0.5 for player1/player2. Then this AI program works very well for both player 1&2. Below is modified code.
# give reward to two players
def giveReward(self):
if self.currentState.winner == self.p1Symbol:
self.p1.feedReward(1)
self.p2.feedReward(0)
elif self.currentState.winner == self.p2Symbol:
self.p1.feedReward(0)
self.p2.feedReward(1)
else:
self.p1.feedReward(0.1)
self.p2.feedReward(0.5)
One issue of np.argmax
is that it always return the first index even if there is many maximal values. Instead we should use
np.random.choice([action for action, value in enumerate(values) if value == np.max(values)])
At the very beginning I implemented my own argmax
in utils
like this. However later on utils
was removed due to much complaint of import error
and at that time I switched back to np.argmax
. Now it turns out to be problematic especially for some tabular examples. It may lead to infinite loop.
I fixed it if it's used for behavior policy as I think it's the only case np.argmax
will lead to problem. However I still leave this issue open to give some hint if you find some strange bug.
Observation: If, line 46, self.qTrue.append(np.random.randn() + trueReward); where trueReward = 0
then why line 74, reward = np.random.randn() + self.qTrue[action] instead of
reward = self.qTrue[action]
Offer To Help: I'm new to python data analysis but not to financial programs (which is what, from my perspective, RL is, overall); so, if you want to assign me coding and/or analysis tasks in response to your 'help wanted' request, please ask. All I need is a requirements statement to help. Why? Because, this may sound a bit weird, but I think the survival of the human species depends upon the development of RL over the next 4 years; else it might be too late to help.
Hi Shangtong,
First of all, thank you so much for writing these python codes for the RL textbook.
For chapter03's GridWorld.py, I believe there might be an error in the updating code for the 'Optimal Policy' or Bellman Optimality Equation. The update does not take into account the probability of actions, pi.
I understand that in the book, the Bellman Optimality equation (3.19) does not include probability of actions, pi. But if you inspect closer, the maximization of probability of actions are done through the last term V*(s'), where it is defined as max pi V(s').
May I suggest the codes for line 140-142 to be fixed as follow (suggestions in bold):
for action in actions:
newPosition = nextState[i][j][action]
values.append(actionReward[i][j][action] + discount * world[newPosition[0], newPosition[1]])newWorld[i][j] = np.max([a*b for a,b in zip(actionProb[i][j].values(), values)])
The probability of actions, pi are important because one could run into a case where the Future Rewards are large but the probability of an action is very very small, making the expected value very small. If we just maximize the rewards (without taking into account the probability of actions), we might be picking the wrong optimal policy.
Please let me know what you think.
I can be reached at [email protected]
Sincerely,
Chong Yi Xiang
When i'm running some examples, there comes out "ImportError: No module named utils.utils" , I see there exists subdirectory named "utils' and "extra" under the main directory, but I don't known how to load these two self-defined packages into the programs?
May I propose following line to substitute for line 54:
policy[state] = actions[np.argmax(np.round(action_returns[1:],5))]
The [1:] avoid choosing the '0' action which doesn't change state nor exptected returns. Since numpy.argmax chooses the first option in case of ties, rounding the near-ties assures the one associated with the smallest action (or bet) is selected. The output of the app now resembles Figure 4.3. in the Sutton/Bartho's book.
PS just signed up for Github, may not be using it optimally yet
https://github.com/superxingzheng/test/blob/master/reinforcement_learning/GamblersProblem/main.cpp
This is my implementation. Not only the tied best actions can cause different plotting. If we add the +1 reward gained from the transition to the Goal ( capital 100) state, the value of capital 99 state will approach 2.0 not 1.0.
Hi,
In the RentalCar.py code, I noticed an issue where the policy evaluation is only being iterated to convergence the first time. For subsequent times, policy evaluation is iterated just once and then it tries to improve the policy.
To fix this, you need to set the boolean PolicyImprovement to false after each time the PolicyImprovement block of code is run.
This results in a change to the final values of the state and the convergence occurs in fewer steps.
Let me know if you agree with my findings and i can submit a pull request for the fix.
Thank you for sharing this code. I find this very helpful in my study of the 2nd edition book.
The implementation between line 154 and 170 in blackjack.py seems to treat all Aces the dealer might get after the first two cards as one. What if the dealer gets an Ace afterwards which can be used as 11? For example the dealer gets cards (2, 3, A).
Even with 500,000 episodes of running monteCarloES, I sometimes get different results, which suggests that the convergence is not really reached. Do you have any idea which is the best way to check the convergence?
Thanks a lot.
Hi there, I have a question to the implementation of the MC on policy evaluation algorithm. First Visit MC algorithm at page 76, In the final draft version of the book, says that "...for each state s appear in the episode...", while in the code (https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter05/Blackjack.py), at line 189, seems only the initial state of each episode is sampled, and all the other states in playerTrajectory are discarded. I think these states also appears the first time in the episode, so it might be beneficial to take them into account for sampling the state values. I tested this change by modifying line 189 to
for _, (usableAce, playerSum, dealerCard) in playerTrajectory:
state[0]=usableAce
state[1]=playerSum
state[2]=dealerCard
Now all the states will be sampled, and the results looks quite similar to the old results.
You guys have any idea? I might be wrong, and if so, thank you for pointing them out.
I am as a volunteer want to know the way to observe front end part of the project.
So, add please a link to the project at README.md
Ch2, line 62: what does the operator '+ \' do? This makes the code look confusing if '\' does nothing. Is '\' just an escape to continue calc on the next line?
In the example (5.3) the book states
As the initial policy we use the policy evaluated in the previous blackjack example, that which sticks only on 20 or 21.
However, the implementation in this repository seems to use random selection as the initial policy. In fact, I'm unable to reproduce the results in the book if the initial policy is the previous (targetPolicyPlayer
in this repositories implementation).
Using random selection makes sense if the Monte Carlo ES algorithm is strictly followed (initial policies are "arbitrary"). How did the authors of this implementation know to use random selection instead of the previous policy? If I missed something, knowing what I missed may help in how I continue to read the book.
In figure 4.1 of GridWorld, you update the values with "two array" version, which is exactly the version used in the figure, but I suggest you can also use the "in place" version for comparison, which can converge faster.
hi there,I have some confusion with your code about “The Gambler question ” in chapter 4.
here is part of code:
`
while True:
delta = 0.0
for state in states[1:GOAL]:
# get possilbe actions for current state
actions = np.arange(min(state, GOAL - state) + 1)
actionReturns = []
for action in actions:
actionReturns.append(headProb * stateValue[state + action] + (1 - headProb) * stateValue[state - action])
newValue = np.max(actionReturns)
delta += np.abs(stateValue[state] - newValue)
# update state value
stateValue[state] = newValue
if delta < 1e-9:
break
`
I wonder why you didn't add the win-reward as 1 when head of coin appear in the following line:
actionReturns.append(headProb * stateValue[state + action] + (1 - headProb) * stateValue[state - action])
ps: I add the win-reward as 1 when head of coin appears but thus get the value over 2, not limited by 1 as probability. So it confused me because the value iteration update rule is saying:
Look forward to your reply!
Thank you!
This will accelerate training significantly, thank @datahaki for this optimization
hi there. i am also trying to write some code for exercises in this book.
But i get stuck at chapter4--the Gambler question.
I refer to your code and compare the result with mine. we both get the similar result.But the result is far away from the result in book.
I want to know the reasons or some of your opinion about this question.
Thank you!
Hi.
The question is related to TD learning using the n-step Tree Backup Algorithm and also to the later unification work: https://arxiv.org/abs/1703.01327
In the book, it has been mentioned a few times that the Tree Backup Algorithm by itself is natural extension of the Q-learning algorithm to a multi-step representation. However, considering about "off-policiness" I figured out the algorithm does not seem to be able to search the greedy optimum policy where only the optimal actions have some positive probability and all others are zeros. On the other hand, it is exactly what does the one-step Q-learning algorithm (that is also the reason the Q-learning is an off-policy method in contrast to the SARSA method).
However, it seems the chain of n-steps is being teared in places it comes out the optimal policy if the target policy is absolutely greedy. Therefore, using the more sophisticated, unifying algorithm such breaks should be filled out by going through the expected SARSA leafs but we cant because of all other actions have zero probability in absolutely greedy target policy case.
In the paper, we see some experiments with e-greedy target policies but cases with absolutely greedy were not mentioned as it was not mentioned is a target policy like in one-step Q-learning algorithm possible/reasonable.
I'm curious about your explicit thoughts about it. I found some mentions about lack of the tree backup algorithm to find efficiently for the reason that it tends to tear sequences, as you described in the summary after the chapter, but it was not discussed anywhere the algorithms are presented. So, could you, please, share your experience about some like cases?
Thanks in advance and many thanks for your work!
Hello,
Very nice implementation of the examples from the book. Really helpful for better understanding of the ideas!
figure6_3(): It seems for big alphas(e.g. 0.1) the batch updates for random walk example sometimes are entering an infinite loop while trying to find convergence for the updates array. The deltas are crawling towards the 0.001 threshold very slowly and the program seems to hang. Trying to put a "max_iterations" parameter instead of "while True" fixes this but causes the next episode to increase each subsequent delta instead of decreasing it while the iterations are progressing though.
Sometimes the deltas seem to be increasing during the "while True" iterations instead of going down even without a preceding loop termination due to "max_iterations"(in case "while True" was replaced with "while iteration < max_iterations").
Do you had any problems of that kind and do you have any idea why this might be happening?
Thank you
I try to re-implement your code but have met some problem, so I wonder would you mind helping answer some questions?
Because I just started from chapter one, so the first question comes from the chessman algorithm. And here are parts of your code.
results.append(0)
for i in range(0, BOARD_ROWS):
results[-1] += self.data[i, i]
results.append(0)
for i in range(0, BOARD_ROWS):
results[-1] += self.data[i, BOARD_ROWS - 1 - i]
Why you write this command "results.append(0)" and what is the different targets between those two "for function"?
Hi ShangtongZhang,
This is regarding the CarRental.py for Chapter 04. I'm curious under the expectedReturn function, under line 115 and 116 for variables realRentalFirst/SecondLoc, why would you require the valid rental requests to be less than the actual # of cars?
Wouldn't you want to maximize your returns by actually renting out the maximum number of cars? When I changed lines 115 and 116 to the following:
realRentalFirstLoc = max(numOfCarsFirstLoc, rentalRequestFirstLoc)
realRentalSecondLoc = max(numOfCarsSecondLoc, rentalRequestSecondLoc)
I actually get higher expected returns in the range of 1650-1750, instead of expected returns at 550-650. Let me know if this makes sense to you.
Regards,
Yi Xiang
[email protected]
Hi, Zhang
I have some confuse about the sample average equation use in code
self.qEst[action] += 1.0 / self.actionCount[action] * (reward - self.qEst[action])
In the book, the equation is:
self.qEst[action] = sum(the reward get when execute specific action in past time) / self.actionCount[action]
I dont think this two equation is equivalence.
Thanks for your contribution! and do you have time finish the chapter13:policy gradient methods?
I noticed in the cliff_walking.py, line 126 you mention q_learning function has a @expected param when this only applies to Sarsa.
The current method of prediction (policy evaluation) is iterative update of value function by an equi-probable actions.
But what if we update the values and action probabilities concurrently?
Or is what I am describing now the policy iteration concept
?
Can I modify version of GridWorld in chapter04 and ask for a pull request?
@ShangtongZhang
Hello,
How did you calculate and implement the gradient descent in differential semi gradient sarsa algorithm?
I am trying to understand the implementation and for some reason, I can't see how the gradient descent implemented.
Thanks
for i in self.data.reshape(BOARD_ROWS * BOARD_COLS):
What is the purpose of using reshape staement, line no.34? It works even without it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.