5 different techniques are used to solved the control problem of balancing a pole on a cart. Environment is taken from OpenAI gym CartPole_v0.
- Random Search (Actor only)
- Hill Climbing (Actor only)
- Q-learning (Critic only)
- Sarsa (Critic only)
- Monte Carlo (Critic only)
python Main.py
- gym
- itertools
- math
- matplotlib
- networkx
- numpy
The goal is to move the cart to the left and right in a way that the pole on top of it does not fall down. The states of the environment are composed of 4 elements - cart position (x), cart speed (xdot), pole angle (theta) and pole angular velocity (thetadot). For each time step when the pole is still on the cart we get a reward of 1. The problem is considered to be solved if for 100 consecutive episodes the average reward is at least 195.
If we translate this problem into reinforcement learning terminology:
- action space is 0 (left) and 1 (right)
- state space is a set of all 4-element lists with all possible combinations of values of x, xdot, theta, thetadot
The policy/action will be simply determined by the following decision rule:
if w_0 * x + w_1 * xdot + w_2 * theta + w_3 * thetadot + b > 0:
return 1
else:
return 0
Each episode we randomly generate the weights w = (w_0, w_1, w_2, w_3, b) and use the above decision rule to move the cart. If we stumble upon a winner (w that leads to 200 steps) then we save it for later. From certain episode, we switch to a "greedy mode" and only pick randomly the previous winning policies. If a policy wins again, we keep it in the winners list. If it does not, we remove it.
We use the same decision rule as in Random Search but instead of generating a new w each episode, we simply wait until a good enough policy is found ( > 150 timesteps) and then apply the following logic until a winner is found.
while reward(w_old) < 200
w_new = w_old + random_noise
if reward(w_new) > reward(w_old):
w_old = w_new
else:
pass
For the following 3 methods it is necessary to discretize the sample space. To do this we select for each element a list of threshold points and based on these we discretize the continuous observations. A sufficient discretization for all three algorithms is for example the following:
threshold_points = [[1, 0, 1], # x
[-0.5, 0, 0.5], # xdot
[-math.radians(5), 0, math.radians(5)], # theta
[-math.radians(5), 0, math.radians(5)]] # thetadot
It results in a sample space of 256 states.
The Q value function update is the following and can be done online without the need to wait until the end of episode.
Q(S_t,A_t) += alpha * (R_t + discount_factor * max Q(S_{t+1}, a) - Q(S_t,A_t))
The Sarsa update has the following form:
Q(S_t,A_t) += alpha * (R_t + discount_factor * Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t))
Monte Carlo approximates the Q function with the sample mean of observed returns
Q(S,A) = np.mean([observed returns starting with an state-action value (S,A)])
This project is licensed under the MIT License - see the LICENSE.md file for details
- https://gym.openai.com/envs/CartPole-v0
- http://kvfrans.com/simple-algoritms-for-solving-cartpole/ - amazing blog post about Random Search and Hill Climbing