Python Implementation of Reinforcement Learning: An Introduction

License: MIT License

Python 100.00%

artificial-intelligence reinforcement-learning

reinforcement-learning-an-introduction's People

Contributors

Stargazers

Watchers

Forkers

bankwang teslasloan nggsam xfdywy diegslva jeremykid mothaibatacungmua vyraun minjiang sfchik wanjinchang xgzxc iamduyang mars2009bme complexly bitisony echoorchid jeffzhengye yuz302 clear-datacenter benjamesbabala zhiyu-chen jdc08161063 fage2016 amiltonwong sunyunyun ieee820 flybirdwww linrio zleikgb guanlongtianzi zhangjiang19926 fangzheng354 hitwsl wonyonyon yanleirex coddinglxf marshb yyfcc17 leezqcst winggyn imagef wenlin-zhang guidefreitas paseam fancyerii xueyuan11 ljjj tonylibing fromradio allankevinrichie wenxuanliu lizhen-dlut coderhaoranlee hundred06 hearthstoneboss yuchen8807 hxl1990 hajungong007 cosmozhang caomw libcorner mackwang xsongx kapoyegou robustfengbin davidwang8088 chanhou scscml limingdeng nkmaple emigmo ambier jinyu0310 dkolmas wuntoguo collawolley vico louiss007 fredchenjialin xiliangsong liyi193328 rishihahs kentan peratham xzm2004260 dyanchenko 7feiw nguyenvanhoa89 dcomplexity zsl199002 chrisguo williamd4112 xiaodiu2010 binderwang haoshuji khudkhud mansteinliliang lundqvistinnovation stevenjames1

reinforcement-learning-an-introduction's Issues

Licensing: MIT?

Hi @ShangtongZhang could you add a standard license, like MIT one for example, to this project? Otherwise it is not exactly clear how it could be used.

should there be some changes about the code in chapter 12 "Eligibility Traces", RandomWalk.py, class "OffLineLambdaReturn" function "nStepReturnFromTime"

the code from the author zhang is:
# get the n-step return from the given time def nStepReturnFromTime(self, n, time): # gamma is always 1 and rewards are zero except for the last reward # the formula can be simplified endTime = min(time + n, self.T) returns = self.value(self.trajectory[endTime]) if endTime == self.T: returns += self.reward return returns
which I think should be like this:
# get the n-step return from the given time def nStepReturnFromTime(self, n, time): # gamma is always 1 and rewards are zero except for the last reward # the formula can be simplified endTime = min(time + n, self.T) #test if endTime == self.T: returns = self.reward else: returns = self.value(self.trajectory[endTime]) return returns #test
this is confused, can somebody help me out?

One question posted on SOF

Hi, we are currently reading the book, and had a question posted on StackOverflow, could you please to give any hint, or thought.

This is the link:

https://datascience.stackexchange.com/questions/25362/cannot-see-what-the-notation-abuse-is-mentioned-by-author-of-book/25389#25389

np.random.shuffle(self.indices) in Chapter02 affects gradient

In getAction method, when upon exploration I see that the indices are shuffled. As shuffles the indices permanently, it may affect the exploitation mechanism when performing np.random.choice(self.indices, p=self.actionProb)

chapter5

Hi, may I ask about the Blackjack.py file in chapter5?
Why does line 195 add the initial state in trajectory again?
Won't this lead to calculate the initial state repeatedly?
Since at line 137 already append all states in trajectory.
Or do I misunderstand something?
Thanks.

One bug on the MountainCar.py in the folder Chapter12

Dear Shangtong,

    Code Line No.157 should change from "argmax(values)" to "np.argmax(values)"

since the original code will cause compile error : "global name 'argmax' is not defined"

Best Regards,
yuqing

Some explanation of tictactoe is required

Your code is one of the exercises from Chap. 1 ? Mine is the port of the tictactoe lisp example to Haskell(https://github.com/mohanr/Reinforcement-Learning-An-Introduction-by-Richard-S.-Sutton-and-Andrew-G.-Barto/blob/master/tictactoe.hs)

What is the result from 'run' or 'runs' that proves the learning is happening ?
My code returns a value between 40 and 50.

I mean this section.

        (defun run ()
               (loop repeat 40 do (print (/ (loop repeat 100 sum (game t)) 
                        100.0))))

My results are like this.

        Played 100 times 42.0  0.42
        Played 100 times 43.0  0.43
        Played 100 times 42.5  0.425
        Played 100 times 40.5  0.405
        Played 100 times 43.0  0.43
        Played 100 times 43.0  0.43
        Played 100 times 42.0  0.42

Did you create the example with a view to actually prove the RL algorithm learns ?

why 3 nested loop in function banditSimulation in TenArmedTestbed.py

if 3 nested loop,isn't that the number of simulations to be nBanditsnBanditstime?

Chapter 8: Backup updates for Prioritized Sweeping vs Dyna-Q

Hi,

When comparing the backup updates between the 2 methods for the extended mazes you multiply the actual steps taken by n(planning_steps). Shouldn't be this multiplied by n+1 since it uses n updates in the simulated experience and +1 more update in the real experience?

Sarsa, Q-Learning Optimal Policy Output is Not Optimal

Hi,
When I run CliffWalking.py from Chapter 6 I get this as optimal policy output ( graphs look great btw ):
Sarsa Optimal Policy:
['R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['U', 'U', 'U', 'U', 'U', 'R', 'U', 'U', 'U', 'U', 'R', 'D']
['R', 'U', 'L', 'R', 'L', 'U', 'R', 'U', 'U', 'L', 'R', 'D']
['U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'G']
Q-Learning Optimal Policy:
['L', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'D', 'R', 'R', 'R', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'G']

These are not optimal - if you follow them in cliff gridworld they lead to nowhere.
Any comments ?
Regards, Ranko.

Maybe Error in Chapter03/GridWorld.py

In the GridWorld.py, you write the code using bellman equation to update the value function:

while True:
    # keep iteration until convergence
    newWorld = np.zeros((WORLD_SIZE, WORLD_SIZE))
    for i in range(0, WORLD_SIZE):
        for j in range(0, WORLD_SIZE):
            for action in actions:
                newPosition = nextState[i][j][action]
                # bellman equation
                newWorld[i, j] += actionProb[i][j][action] * (actionReward[i][j][action] + discount * world[newPosition[0], newPosition[1]])
    if np.sum(np.abs(world - newWorld)) < 1e-4:
        print('Random Policy')
        print(newWorld)
        break
    world = newWorld

In the last line,

    world = newWorld

This assignment will cause the world and newWorld to share the same address and content. This is also right because it can be seen as the asynchronous updating. But if you want to implement the asynchronous way, I think there is no need to use two variables. So, I believe that this is a error, and the right implementation should be:

    world = copy.deepcopy(newWorld)

Thank for you reply in advance.

Possible Error in updating Gradient Bandit algorithm

Chapter02

In takeAction method updating qEst may be having a problem.
for selected action (1 - π), and
and for other actions simply π should be applied
But the following code updates it to
for selected action( 1- π), and
for other actions (0 - π)
self.qEst = self.qEst + self.stepSize * (reward - baseline) * (oneHot - self.actionProb)

Chapter 4, Gambler's problem incorrect output

The policy is incorrect as it's not the same as the one given in the textbook. The correct output should be:

Ch2, line 48, 62, & 77 : don't seem to match book calc

Ch2, line62:

Issue#1:
If, line 48, you set self.actionCount to array of 10 zeros,
Then if, line 62, - np.asarray(self.actionCount) + 1 - then adds +1 to the count of EVERY actionCount, not the SPECIFIC actionCount; this seems to be an error because...; note: asarray doesn't copy, it's ref'd
Then, line 77, self.actionCount[action] += 1;

Issue #2:
Also, if, line 62, self.qEst + \
then adds self.UCBParam * np.sqrt(np.log(self.time + 1) / (np.asarray(self.actionCount) + 1)) to EVERY ActionEstimate(qEst), not just the SPECIFIC ActionEstimate

The book does both these calc for each SPECIFIC action, not EVERY action; see page 37, Upper-Confidence-Bound Action Selection

The link to the book is broken

Hi,

Noticed the link in the readme is not right anymore. Looks like all URIs are without "/sutton/" path now.

Many thanks.

Confused about the implementation of figure-5-3 in Blackjack.py

In the 218th line of Blackjack.py

_, reward, trajectory = play(behaviorPolicy, initialState, initialAction)

Here the initialAction is set to be a random action. In the implementation of play function, if initialAction is not None, the policyPlayerFn will not be used. This means that the behaviorPolicy will not work. Does it have any special meaning here?

questions on Q-learning applied for racetrack

exercise 4.8

this is your picture

this is my picture

but we both don't get the picture in the book like this

Release date - 2nd Edition

Any update on when the 2nd edition would be released into print?
Best,
Andrew

debug errors for missing modules

Hi,

I forked you repository and clone to my local workspace, then installed the whole package with python setup.py install. The package is successfully installed and I can see the package name with pip list. The problem is I can run every script files and functions successfully in a terminal with command python **name**.py, but I cannot run the module alone, which means I open the specific file and run the file in a IDE like spyder or use Ctrl+b in sublime. I thought it is related to the python packing functions. Could you give me some explanations?

Thanks

gambler's problem

there is a mistake in the book:
for the gambler's problem in Chapter 4, there is no unique optimal policy.
The plot of optimal actions is missing other equally valuable actions.

For example

in state 51, actions {1, 49} are equally valuable.
in state 64, optimal actions are {11, 14, 36}

There are no floating point problems in your implementation.

The correct solution (i hope) is displayed in the images on the page
https://github.com/idsc-frazzoli/subare

Suggestion: prohibit the gambler to bet an amount of 0 as long as his/her cash is between 0 and 100. This will speed up the convergence.

Shouldn't update state action value using absolute priority in Priority Sweeping

Relevant code is :

reinforcement-learning-an-introduction/chapter08/VariousMaze.py

Line 364 in f5f4334

 stateActionValues[sampleState[0], sampleState[1], sampleAction] += dynaParams.alpha * priority 

priority=abs(R+gammamax_aQ(s'.a)-Q(s,a))
However stateActionValue should be updated using R+gammamax_aQ(s'.a)-Q(s,a) which isn't equal to priority.

About the action selection in Double Q-Learning

If two action have the same state-action value, then use the np.argmax() method will always return the first action. So I change the implementation (MaximizationBias.py):

# choose an action based on epsilon greedy algorithm
def chooseAction(state, stateActionValues):
    if np.random.binomial(1, EPSILON) == 1:
        return np.random.choice(stateActions[state])
    elif state == STATE_A and len(set(stateActionValues[0])) == 1:
        return np.random.choice(stateActions[state])
    else:
        return np.argmax(stateActionValues[state])

Tic-Tac-Toe program didn't work well when AI is player 2.

Hi Shangtong,

When I tested this tic-tac-toe AI program, I found that AI didn't play well as player 2. I think the tie rewards are not properly. If set tie rewards as 0.1/0.5 for player1/player2. Then this AI program works very well for both player 1&2. Below is modified code.

    # give reward to two players
    def giveReward(self):
        if self.currentState.winner == self.p1Symbol:
            self.p1.feedReward(1)
            self.p2.feedReward(0)
        elif self.currentState.winner == self.p2Symbol:
            self.p1.feedReward(0)
            self.p2.feedReward(1)
        else:
            self.p1.feedReward(0.1)
            self.p2.feedReward(0.5)

np.argmax may lead to unexpected behavior

One issue of np.argmax is that it always return the first index even if there is many maximal values. Instead we should use

np.random.choice([action for action, value in enumerate(values) if value == np.max(values)])

At the very beginning I implemented my own argmax in utils like this. However later on utils was removed due to much complaint of import error and at that time I switched back to np.argmax. Now it turns out to be problematic especially for some tabular examples. It may lead to infinite loop.

I fixed it if it's used for behavior policy as I think it's the only case np.argmax will lead to problem. However I still leave this issue open to give some hint if you find some strange bug.

Observation & Offer To Help

Observation: If, line 46, self.qTrue.append(np.random.randn() + trueReward); where trueReward = 0
then why line 74, reward = np.random.randn() + self.qTrue[action] instead of
reward = self.qTrue[action]

Offer To Help: I'm new to python data analysis but not to financial programs (which is what, from my perspective, RL is, overall); so, if you want to assign me coding and/or analysis tasks in response to your 'help wanted' request, please ask. All I need is a requirements statement to help. Why? Because, this may sound a bit weird, but I think the survival of the human species depends upon the development of RL over the next 4 years; else it might be too late to help.

Chapter 3: GridWorld

Hi Shangtong,

First of all, thank you so much for writing these python codes for the RL textbook.

For chapter03's GridWorld.py, I believe there might be an error in the updating code for the 'Optimal Policy' or Bellman Optimality Equation. The update does not take into account the probability of actions, pi.

I understand that in the book, the Bellman Optimality equation (3.19) does not include probability of actions, pi. But if you inspect closer, the maximization of probability of actions are done through the last term V*(s'), where it is defined as max pi V(s').

May I suggest the codes for line 140-142 to be fixed as follow (suggestions in bold):

for action in actions:
newPosition = nextState[i][j][action]
values.append(actionReward[i][j][action] + discount * world[newPosition[0], newPosition[1]])

newWorld[i][j] = np.max([a*b for a,b in zip(actionProb[i][j].values(), values)])

The probability of actions, pi are important because one could run into a case where the Future Rewards are large but the probability of an action is very very small, making the expected value very small. If we just maximize the rewards (without taking into account the probability of actions), we might be picking the wrong optimal policy.

Please let me know what you think.

I can be reached at [email protected]

Sincerely,
Chong Yi Xiang

ImportError: No module named utils.utils

When i'm running some examples, there comes out "ImportError: No module named utils.utils" , I see there exists subdirectory named "utils' and "extra" under the main directory, but I don't known how to load these two self-defined packages into the programs?

break ties in Gambler's Problem

May I propose following line to substitute for line 54:

policy[state] = actions[np.argmax(np.round(action_returns[1:],5))]

The [1:] avoid choosing the '0' action which doesn't change state nor exptected returns. Since numpy.argmax chooses the first option in case of ties, rounding the near-ties assures the one associated with the smallest action (or bet) is selected. The output of the app now resembles Figure 4.3. in the Sutton/Bartho's book.

PS just signed up for Github, may not be using it optimally yet

Example 4.3 Gambler's Problem Cannot Reproduce Book's Plotting

https://github.com/superxingzheng/test/blob/master/reinforcement_learning/GamblersProblem/main.cpp

This is my implementation. Not only the tied best actions can cause different plotting. If we add the +1 reward gained from the transition to the Goal ( capital 100) state, the value of capital 99 state will approach 2.0 not 1.0.

Policy Iteration in Chapter4 for RentalCar

Hi,

In the RentalCar.py code, I noticed an issue where the policy evaluation is only being iterated to convergence the first time. For subsequent times, policy evaluation is iterated just once and then it tries to improve the policy.

To fix this, you need to set the boolean PolicyImprovement to false after each time the PolicyImprovement block of code is run.

This results in a change to the final values of the state and the convergence occurs in fewer steps.

Let me know if you agree with my findings and i can submit a pull request for the fix.

Thank you for sharing this code. I find this very helpful in my study of the 2nd edition book.

question about implementation of dealer's part in blackjack.py

The implementation between line 154 and 170 in blackjack.py seems to treat all Aces the dealer might get after the first two cards as one. What if the dealer gets an Ace afterwards which can be used as 11? For example the dealer gets cards (2, 3, A).

Even with 500,000 episodes of running monteCarloES, I sometimes get different results, which suggests that the convergence is not really reached. Do you have any idea which is the best way to check the convergence?

Thanks a lot.

First-visit MC prediction for estimating v_pi

Hi there, I have a question to the implementation of the MC on policy evaluation algorithm. First Visit MC algorithm at page 76, In the final draft version of the book, says that "...for each state s appear in the episode...", while in the code (https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter05/Blackjack.py), at line 189, seems only the initial state of each episode is sampled, and all the other states in playerTrajectory are discarded. I think these states also appears the first time in the episode, so it might be beneficial to take them into account for sampling the state values. I tested this change by modifying line 189 to
for _, (usableAce, playerSum, dealerCard) in playerTrajectory:
state[0]=usableAce
state[1]=playerSum
state[2]=dealerCard
Now all the states will be sampled, and the results looks quite similar to the old results.

You guys have any idea? I might be wrong, and if so, thank you for pointing them out.

Add a link to learning resource

I am as a volunteer want to know the way to observe front end part of the project.
So, add please a link to the project at README.md

Ch2, line 62: What does the operator '+ \' do?

Ch2, line 62: what does the operator '+ \' do? This makes the code look confusing if '\' does nothing. Is '\' just an escape to continue calc on the next line?

Chapter 5: Monte Carlo ES initial policy

In the example (5.3) the book states

As the initial policy we use the policy evaluated in the previous blackjack example, that which sticks only on 20 or 21.

However, the implementation in this repository seems to use random selection as the initial policy. In fact, I'm unable to reproduce the results in the book if the initial policy is the previous (targetPolicyPlayer in this repositories implementation).

Using random selection makes sense if the Monte Carlo ES algorithm is strictly followed (initial policies are "arbitrary"). How did the authors of this implementation know to use random selection instead of the previous policy? If I missed something, knowing what I missed may help in how I continue to read the book.

Chapter4 - Suggestion

In figure 4.1 of GridWorld, you update the values with "two array" version, which is exactly the version used in the figure, but I suggest you can also use the "in place" version for comparison, which can converge faster.

A little problem in the Gambler‘s question in chapter4

hi there，I have some confusion with your code about “The Gambler question ” in chapter 4.
here is part of code:

value iteration

while True:

delta = 0.0

for state in states[1:GOAL]:

    # get possilbe actions for current state

    actions = np.arange(min(state, GOAL - state) + 1)

    actionReturns = []

    for action in actions:

        actionReturns.append(headProb * stateValue[state + action] + (1 - headProb) * stateValue[state - action])

    newValue = np.max(actionReturns)

    delta += np.abs(stateValue[state] - newValue)

    # update state value

    stateValue[state] = newValue

if delta < 1e-9:

    break

I wonder why you didn't add the win-reward as 1 when head of coin appear in the following line:

actionReturns.append(headProb * stateValue[state + action] + (1 - headProb) * stateValue[state - action])

ps: I add the win-reward as 1 when head of coin appears but thus get the value over 2, not limited by 1 as probability. So it confused me because the value iteration update rule is saying:

Look forward to your reply!
Thank you!

Make the value function invariant under rotation and mirror of the board for Tic-Tac-Toe

This will accelerate training significantly, thank @datahaki for this optimization

chapter4:Gambler question. Cannot reproduce result, why?

hi there. i am also trying to write some code for exercises in this book.
But i get stuck at chapter4--the Gambler question.
I refer to your code and compare the result with mine. we both get the similar result.But the result is far away from the result in book.
I want to know the reasons or some of your opinion about this question.
Thank you!

Is there any boundary for policies with off-policy Q-learning using the tree backup algorithm?

Hi.

The question is related to TD learning using the n-step Tree Backup Algorithm and also to the later unification work: https://arxiv.org/abs/1703.01327

In the book, it has been mentioned a few times that the Tree Backup Algorithm by itself is natural extension of the Q-learning algorithm to a multi-step representation. However, considering about "off-policiness" I figured out the algorithm does not seem to be able to search the greedy optimum policy where only the optimal actions have some positive probability and all others are zeros. On the other hand, it is exactly what does the one-step Q-learning algorithm (that is also the reason the Q-learning is an off-policy method in contrast to the SARSA method).

However, it seems the chain of n-steps is being teared in places it comes out the optimal policy if the target policy is absolutely greedy. Therefore, using the more sophisticated, unifying algorithm such breaks should be filled out by going through the expected SARSA leafs but we cant because of all other actions have zero probability in absolutely greedy target policy case.

In the paper, we see some experiments with e-greedy target policies but cases with absolutely greedy were not mentioned as it was not mentioned is a target policy like in one-step Q-learning algorithm possible/reasonable.

I'm curious about your explicit thoughts about it. I found some mentions about lack of the tree backup algorithm to find efficiently for the reason that it tends to tear sequences, as you described in the summary after the chapter, but it was not discussed anywhere the algorithms are presented. So, could you, please, share your experience about some like cases?

Thanks in advance and many thanks for your work!

Chapter 6: Random Walk --> Infinite loop

Hello,

Very nice implementation of the examples from the book. Really helpful for better understanding of the ideas!

figure6_3(): It seems for big alphas(e.g. 0.1) the batch updates for random walk example sometimes are entering an infinite loop while trying to find convergence for the updates array. The deltas are crawling towards the 0.001 threshold very slowly and the program seems to hang. Trying to put a "max_iterations" parameter instead of "while True" fixes this but causes the next episode to increase each subsequent delta instead of decreasing it while the iterations are progressing though.

Sometimes the deltas seem to be increasing during the "while True" iterations instead of going down even without a preceding loop termination due to "max_iterations"(in case "while True" was replaced with "while iteration < max_iterations").

Do you had any problems of that kind and do you have any idea why this might be happening?

Thank you

Could you please help to answer some questions?

I try to re-implement your code but have met some problem, so I wonder would you mind helping answer some questions?
Because I just started from chapter one, so the first question comes from the chessman algorithm. And here are parts of your code.

check diagonals

    results.append(0)
    for i in range(0, BOARD_ROWS):
        results[-1] += self.data[i, i]
    results.append(0)
    for i in range(0, BOARD_ROWS):
        results[-1] += self.data[i, BOARD_ROWS - 1 - i]

Why you write this command "results.append(0)" and what is the different targets between those two "for function"?

Chapter 04: CarRental.py - suggestions for realRentalFirst/SecondLoc fix

Hi ShangtongZhang,

This is regarding the CarRental.py for Chapter 04. I'm curious under the expectedReturn function, under line 115 and 116 for variables realRentalFirst/SecondLoc, why would you require the valid rental requests to be less than the actual # of cars?

Wouldn't you want to maximize your returns by actually renting out the maximum number of cars? When I changed lines 115 and 116 to the following:

realRentalFirstLoc = max(numOfCarsFirstLoc, rentalRequestFirstLoc)
realRentalSecondLoc = max(numOfCarsSecondLoc, rentalRequestSecondLoc)

I actually get higher expected returns in the range of 1650-1750, instead of expected returns at 550-650. Let me know if this makes sense to you.

Regards,
Yi Xiang
[email protected]

Ch02 TenArmedTestbed sampleAverage

Hi, Zhang

I have some confuse about the sample average equation use in code

self.qEst[action] += 1.0 / self.actionCount[action] * (reward - self.qEst[action])

In the book, the equation is:

self.qEst[action] = sum(the reward get when execute specific action in past time) / self.actionCount[action]

I dont think this two equation is equivalence.

chapter13

Thanks for your contribution! and do you have time finish the chapter13:policy gradient methods?

Q-learning Example Has No @expected

I noticed in the cliff_walking.py, line 126 you mention q_learning function has a @expected param when this only applies to Sarsa.

Policy evaluation for GridWorld

The current method of prediction (policy evaluation) is iterative update of value function by an equi-probable actions.
But what if we update the values and action probabilities concurrently?
Or is what I am describing now the policy iteration concept?
Can I modify version of GridWorld in chapter04 and ask for a pull request?
@ShangtongZhang

Question about gradient in differential semi-gradient Sarsa

Hello,

How did you calculate and implement the gradient descent in differential semi gradient sarsa algorithm?
I am trying to understand the implementation and for some reason, I can't see how the gradient descent implemented.

Thanks

CHAPTER1 ,TicTacToe.py: Purpose of reshape function?

for i in self.data.reshape(BOARD_ROWS * BOARD_COLS):

What is the purpose of using reshape staement, line no.34? It works even without it.

shangtongzhang / reinforcement-learning-an-introduction Goto Github PK