yrlu / irl-imitation Goto Github PK

Implementation of Inverse Reinforcement Learning (IRL) algorithms in Python/Tensorflow. Deep MaxEnt, MaxEnt, LPIRL

Python 100.00%

imitation imitation-learning inverse-reinforcement-learning irl learning-from-demonstration lfd machine-learning ml reinforcement-learning rl tensorflow

irl-imitation's People

Contributors

Stargazers

Watchers

Forkers

gandalfvn amoliu gitsamshi andrewliao11 ruotianluo vikingmew benjamesbabala botyue jdc08161063 zxsted xinhandi sarthak10193 qifeng2010 sanaiqbalw mors25 yluo42 magnusja rosssong jfan2016 wz1938 meelement rosvill monkeyjohn pencilandbike huiwenzhang rohansaphal97 uotter kwnsiy ompugao junchenjin shamanez gereon-boehm ratidevidze robin970822 ai3dvision afcarl jkwang1992 pidipidi achenr wellbeing18 vigneshramk liujiangjiang iamwangyunkai alanxu89 samangel93 nunofernandes-plight megayeye badfisher himelys kun-son sapanachaudhary geonhee-lee decoderkurt zhangfuyang sfschouten hyzcn kingstarcraft znittzel 15327311512 haochen3611 attler caizhuo hanyangliu lamperougeyxy ceciliaxiyang yichen89 likangxidian mg-yatming daominglyu tangmhmhmh silvaco hankerbit sean0719 fcdtc abanddd zhousiyuhit jlks96 gzelda davidjunl thomasrantian etarakci-hvl awohlford lanseyege antoniopereira1996 jackblandin lucianzhong zivzone manuelschmidt p10rahulm soniabaee wenshuowang digital-idiot ramonpereira n-nsh doitdodo marisssssa justinwnicholson panxuetin yimingzhang521 bibeklincoln

irl-imitation's Issues

Possible bugs : Determine action with previous ( not current ) state

Hi,

I feel like something is wrong with gw.step() call at
(https://github.com/stormmax/irl-imitation/blob/master/maxent_irl_gridworld.py#L95)
and
(https://github.com/stormmax/irl-imitation/blob/master/deep_maxent_irl_gridworld.py#L72) .

I think
cur_state, action, next_state, reward, is_done = gw.step(int(policy[gw.pos2idx(cur_state)]))
should be
cur_state, action, next_state, reward, is_done = gw.step(int(policy[gw.pos2idx(next_state)])).
By calling step() , current state inside gridworld object is iterated. So local variable here
next_state (not cur_state confusingly) always corresponds to the current state, and
that should be passed to the policy.

Do I misunderstand something?

Possible bug: state visitation frequency

Hey there,

I am not a 100% sure but I feel like there is something wrong with calculating the state visitation frequency (https://github.com/stormmax/irl-imitation/blob/master/deep_maxent_irl.py#L93).

You iterate over all the states and calculate the frequency for every timestep then.

for s in range(N_STATES):
    for t in range(T-1):
      if deterministic:
        mu[s, t+1] = sum([mu[pre_s, t]*P_a[pre_s, s, int(policy[pre_s])] for pre_s in range(N_STATES)])
      else:
mu[s, t+1] = sum([sum([mu[pre_s, t]*P_a[pre_s, s, a1]*policy[pre_s, a1] for a1 in range(N_ACTIONS)]) for pre_s in range(N_STATES)])

In my opinion the loops should be switched:

for t in range(T-1):
    for s in range(N_STATES):
      if deterministic:
        mu[s, t+1] = sum([mu[pre_s, t]*P_a[pre_s, s, int(policy[pre_s])] for pre_s in range(N_STATES)])
      else:
mu[s, t+1] = sum([sum([mu[pre_s, t]*P_a[pre_s, s, a1]*policy[pre_s, a1] for a1 in range(N_ACTIONS)]) for pre_s in range(N_STATES)])

Because the visitation frequency of timestep t+1 depends on all the state frequencies of timestamp t. This also reflects the formular from the original MaxEnt paper (Ziebart et al, 2008):

Unfortunately if I change the loop heads, the reward is not recovered correctly anymore. Do you have any hints on this?

LPIRL: Redundant Constraints

Hi! Thank you for this great reference implementation - it is very helpful.

I was going over the LPIRL implementation and I think you have some redundant constraints in your LP matrices - see line 59 in lp_irl.py - this loop does the same thing as the previous loop on line 55, resulting in a redundant set of constraints.

Thanks again,

Possible bug: value iteration

Hey there,

I found another issue. Value iteration is defined like this:

See: http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf

Your code:

for s in range(N_STATES):
      v_s = []
      values[s] = max([sum([P_a[s, s1, a]*(rewards[s] + gamma*values_tmp[s1]) for s1 in range(N_STATES)]) for a in range(N_ACTIONS)])

https://github.com/stormmax/irl-imitation/blob/master/mdp/value_iteration.py#L42

So you are using reward of current state s and add it to the discounted value of the next state s1. How I understand the formular you should be doing:

for s in range(N_STATES):
      v_s = []
      values[s] = max([sum([P_a[s, s1, a]*(rewards[s1] + gamma*values_tmp[s1]) for s1 in range(N_STATES)]) for a in range(N_ACTIONS)])

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

\irl-imitation\mdp\gridworld.py", line 151, in get_transition_states_and_probs
nei_s[1] < 0 or nei_s[1] >= self.width or self.grid[nei_s[0]][nei_s[1]] == 'x':
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

yrlu / irl-imitation Goto Github PK

irl-imitation's People

Contributors

Stargazers

Watchers

Forkers

irl-imitation's Issues

Possible bugs : Determine action with previous ( not current ) state

Possible bug: state visitation frequency

LPIRL: Redundant Constraints

Possible bug: value iteration

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent