The a2c's intro from pthpth

a2c's Introduction

Take state input
Compute probs for actions state and take action acc to to those probs
Store the probs
Do the action choosen.
Store the reward after each action.
Repeat 1-5 until the episode ends
Calculate discounted rewards for each step in the trajectory
Compute grads

The Objective function is:

The grads can be derived from:

where $$G_t$$ is the dsicounted rewards as a consequence of that actions

This was implementation of vanilla Policy Gradient now lets see A2C. The full form is Advantage Actor Critic In Vanilla implementation many times we take good actions and sometimes bad actions and those 2 cancel out each other and the agent doesnt learn whats actually bad and good.

So in A2C we introduce a Critic which tells how good was the action done in this particular step. It basically is the difference between how much we could have get in this state and how much we actually got. It creates a difference between individula steps instead of whole trajectory. This is the advantage part.

The changed Objective function is :

The grads will be:

Implementation:

There will be 2 NNs one predicting the Q values of each state and then one predicting the State values of the current state.
The function we want to maximize the Expected difference between the predicted max reward from this state , and the actual reward we get.

Change the name of envoirment to try out different envoirments

Recommend Projects

pthpth / a2c Goto Github PK

a2c's Introduction

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent