yrlu / irl-imitation Goto Github PK
View Code? Open in Web Editor NEWImplementation of Inverse Reinforcement Learning (IRL) algorithms in Python/Tensorflow. Deep MaxEnt, MaxEnt, LPIRL
Implementation of Inverse Reinforcement Learning (IRL) algorithms in Python/Tensorflow. Deep MaxEnt, MaxEnt, LPIRL
Hi,
I feel like something is wrong with gw.step() call at
(https://github.com/stormmax/irl-imitation/blob/master/maxent_irl_gridworld.py#L95)
and
(https://github.com/stormmax/irl-imitation/blob/master/deep_maxent_irl_gridworld.py#L72) .
I think
cur_state, action, next_state, reward, is_done = gw.step(int(policy[gw.pos2idx(cur_state)]))
should be
cur_state, action, next_state, reward, is_done = gw.step(int(policy[gw.pos2idx(next_state)]))
.
By calling step() , current state inside gridworld object is iterated. So local variable here
next_state (not cur_state confusingly) always corresponds to the current state, and
that should be passed to the policy.
Do I misunderstand something?
Hey there,
I am not a 100% sure but I feel like there is something wrong with calculating the state visitation frequency (https://github.com/stormmax/irl-imitation/blob/master/deep_maxent_irl.py#L93).
You iterate over all the states and calculate the frequency for every timestep then.
for s in range(N_STATES):
for t in range(T-1):
if deterministic:
mu[s, t+1] = sum([mu[pre_s, t]*P_a[pre_s, s, int(policy[pre_s])] for pre_s in range(N_STATES)])
else:
mu[s, t+1] = sum([sum([mu[pre_s, t]*P_a[pre_s, s, a1]*policy[pre_s, a1] for a1 in range(N_ACTIONS)]) for pre_s in range(N_STATES)])
In my opinion the loops should be switched:
for t in range(T-1):
for s in range(N_STATES):
if deterministic:
mu[s, t+1] = sum([mu[pre_s, t]*P_a[pre_s, s, int(policy[pre_s])] for pre_s in range(N_STATES)])
else:
mu[s, t+1] = sum([sum([mu[pre_s, t]*P_a[pre_s, s, a1]*policy[pre_s, a1] for a1 in range(N_ACTIONS)]) for pre_s in range(N_STATES)])
Because the visitation frequency of timestep t+1 depends on all the state frequencies of timestamp t. This also reflects the formular from the original MaxEnt paper (Ziebart et al, 2008):
Unfortunately if I change the loop heads, the reward is not recovered correctly anymore. Do you have any hints on this?
Hi! Thank you for this great reference implementation - it is very helpful.
I was going over the LPIRL implementation and I think you have some redundant constraints in your LP matrices - see line 59 in lp_irl.py - this loop does the same thing as the previous loop on line 55, resulting in a redundant set of constraints.
Thanks again,
Hey there,
I found another issue. Value iteration is defined like this:
See: http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf
Your code:
for s in range(N_STATES):
v_s = []
values[s] = max([sum([P_a[s, s1, a]*(rewards[s] + gamma*values_tmp[s1]) for s1 in range(N_STATES)]) for a in range(N_ACTIONS)])
https://github.com/stormmax/irl-imitation/blob/master/mdp/value_iteration.py#L42
So you are using reward of current state s and add it to the discounted value of the next state s1. How I understand the formular you should be doing:
for s in range(N_STATES):
v_s = []
values[s] = max([sum([P_a[s, s1, a]*(rewards[s1] + gamma*values_tmp[s1]) for s1 in range(N_STATES)]) for a in range(N_ACTIONS)])
\irl-imitation\mdp\gridworld.py", line 151, in get_transition_states_and_probs
nei_s[1] < 0 or nei_s[1] >= self.width or self.grid[nei_s[0]][nei_s[1]] == 'x':
IndexError: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.