Git Product home page Git Product logo

balancing-the-pendulum's Introduction

Balancing the Pendulum

Implemented Q-learning with a Q-table to achieve desired position for simple pendulum

Simulation of Simple Pendulum

Problem Formulation

The inverted pendulum has a limit on the maximum torque it can apply, therefore it is necessary for the pendulum to do a few "back and forth" motions to be able to reach the inverted position ( $\theta=\pi$ ) from the standing still non-inverted position ( $\theta=0$ ).

Pendulum Model

In the following,the vector of states of the system is given by:

$$x = \begin{pmatrix} \theta \\ \omega \end{pmatrix}$$

We will also work with time-discretized dynamics, and refer to $x_n$ as the state at time $t = n \Delta t$ (assuming discretization time $\Delta t$)

We want to minimize the following discounted cost function

$$\sum_{i=0}^{\infty} \alpha^i g(x_i, u_i)$$ where

$$g(x_i, u_i) = (\theta-\pi)^2 + 0.01 \cdot \dot{\theta}_i^2 + 0.0001 \cdot u_i^2 $$ and $$\alpha=0.99$$

This cost mostly penalizes deviations from the inverted position but also encourages small velocities and control.

Q-table

The Q-learning algorithm is implemented with a table. For the action value function given in equation, $Q(x_t, u_t)$ the assumptions are made such that $u$ can take only three possible values. For states $\theta$ can take any value in the range of ${[0,2\pi]}$ and that $\omega$ can take any value between ${[-6,6]}$. In order to build the table, we will need to discretize the states. So for the learning algorithm, we will use $\mathbf{50}$ discretized states for $\theta$ and $\mathbf{50}$ for $\omega.$ Hence the dimension of the Q-table will be of dimension $\mathbf{50x50x3}$.

Q-table

Q-table gives the quality of the state and action pair, where value of $Q$ is given by

$$ Q(x_t,u_t)=g(x_t, u_t)+ \alpha \min_{u}Q(x_{t+1}, u) $$

To obtain next state in above equation, $x_{n+1}$ given $(x_n, u_n)$ a function is defined that integrates the pendulum for one step of $0.1$ seconds and returns the next state of the pendulum as a $2D$ numpy array at the end of the integration.

Policy and value function

The action-value function combines all results of the single-stage predictive search. For each state-action pair, the optimal expected long-term return is displayed, allowing the selection of optimal actions without the knowledge of future states and their value, and thus without knowing anything about the dynamics of the environment.

$$ J^{*}(x_t) = \min_{u}Q(x_{t}, u) $$

$$ \mu^{*}(x) = arg \min_{u}Q(x_t, u) $$

Important for Reinforcement is that both, policy, as well as value function, can be learned and lead to a close optimal behavior.

$\epsilon$-greedy policy

Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly. In this method, epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring given by:

$$ u_t = \begin{cases} \textit{random action} & \textit{if probability } < \epsilon\\ \textit{arg } \min_{u} Q(x_t, u), & \textit{else} \\ \end{cases} $$

Q-learning

Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Temporal difference error is calculate over eachstep size using equation,

$$ \delta_{t} = g(x_t, u_t) + \alpha \min_{u}Q(x_{t+1}, u) - Q(x_t, u_t) $$

and the value of Q is updated by adding the product of learning rate and tempral difference error to Q as shown in equation,

$$ Q(x_t, u_t) \gets Q(x_t, u_t) + \gamma \delta_{t} $$

When control $u \in {-4,0,4}$

By taking control $u \in {-4,0,4}$ and the model was trained for 10000 episodes keeping $\gamma = 0.1$ and $\epsilon = 0.3$. It can be observed from the plot of cost vs episodes that the model is taking around 5000 episodes to train after which it is stabilized. Also, the moving average of cost keeping window size equal to 100 is also plotted to get clarity on the behavior of cost over the episodes.

Plot of cost vs episodes

The pendulum took 2 swings to reach the inverted position in almost 3 seconds which can be observed from the plots of states w.r.t. time

Plot of states vs time

From the control plots, it can be observed that for the values of control in a given range it is taking more time to be at the same value of control during that time the pendulum is trying to attain momentum to reach the swing upward condition. When it is close to the desired condition it is taking less time to be at the same control value and after reaching the inverted position the control is oscillating at a high rate between an upper and lower limit which is equivalent to zero control.

To understand it in a better way we can compare the plot of omega with the plot of controls and understand that whenever the value of omega is positive, the control is positive and vice versa.

Plot of control vs time

The plot of value function with respect to states can be observed that the value function is maximum when $\omega$ is zero and $\theta$ is at it's extreme values. The plot of policy, explains that whenever Velocity i.e $\omega$ is is positive the policy also has the positive value and and when velocity is negative, the policy is also negative except for few cases.

Plot of Value Function Plot of Policy

balancing-the-pendulum's People

Contributors

navoday01 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.