div99 / iq-learn Goto Github PK
View Code? Open in Web Editor NEW(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation
Home Page: https://div99.github.io/IQ-Learn/
License: Other
(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation
Home Page: https://div99.github.io/IQ-Learn/
License: Other
Hey thanks for sharing this work! And I really appreciate the in depth beginner friendly blog post! I was wondering if this pseudocode was
If not feel free to close. But I would appreciate it if you could help me understand a few parts about the code! Thanks!
def init_network():
q_net = torch.nn.Linear(state_size, action_size)
target_net = deepcopy(q_net)
def episode_step():
action = softmax(q_net(state))
next_state, reward = env.step(action)
memory.add((state, next_state, action, reward)) # memory = collections.deque
update_critic(memory, expert_memory)
target_net = deepcopy(q_net)
def update_critic(memory, expert_memory):
# The idea here is that we backprop both the rewards for the expert's actions and the agent's actions
# the batch dimension contains examples from the expert and the agent
state = torch.cat((memory[:][0], expert_memory[:][0]))
next_state = torch.cat((memory[:][1], expert_memory[:][1]))
action = torch.cat((memory[:][2], expert_memory[:][2]))
# v = sum of future rewards for all possible actions given current state
v = torch.logsumexp(q_net(state), dim=1, keepdim=True)
# next_v = sum of future rewards for all possible actions given state(t+1)
next_v = torch.logsumexp(q_net(next_state), dim=1, keepdim=True)
# q = sum of future rewards predicted given current state, action pair
q = q_net(state).gather(action)
loss = iq_loss(q, v, next_v)
critic_optimizer.zero_grad()
loss.backward()
critic_optimizer.step()
def iq_loss(q, v, next_v):
if done:
expert_reward = q[where_expert]
# Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse?
value_loss = v.mean()
else:
expert_reward = (q - next_v)[where_expert]
value_loss = (v - next_v).mean()
# Why is this negative?
expert_reward_loss = -expert_reward.mean()
loss = reward_loss + value_loss
return loss
Hi!
First, thanks for sharing the codebase of this great work.
I have a question regarding the implementation of iq_loss
function which resides in the "iq.py" file.
In line 26-43, I don't quite understand why you used the gradient of \phi instead of directly following Equation (9) of the IQLearn Paper with \phi in Table 4. I thought we can represent loss function just the same as the equation because pytorch will automatically calculate gradients via autograd
. I wonder what I'm missing here.
Hello,
I started working with the provided implementation recently, thank you for sharing it. Just wanted to know why we require access to env for offline training . Do we need to make changes in code for it, if we have expert data but no access to env for training ?
Hi!
I've been trying to get some of the examples going, but I keep getting a issue with initating the models. The error belowe is from running the CartPole_v1 from the run_offline.sh.
Error in call to target 'agent.softq_models.OfflineQNetwork':
TypeError("init() missing 1 required positional argument: 'args'")
full_key: q_net.args.agent.critic_cfg
As I get the same error trying several models and env I assume it's something I've overlooked. I'm not getting any errors about dependecies.
Hi, Thank you for providing us a wonderful code. I am trying to adopt IQ method in my custom environment. However, I faced with diverging loss critic loss function. I tried to copy and paste the original code from github but this event is happening again and again. Is it a normal event if IQ imitation learning method is combined with SAC or am i using it in a wrong way. I uploaded my code with post. I also upload my loss function together.
class IQ(nn.Module):
def __init__(self, args):
super(IQ, self).__init__()
self.args = args
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.actor = Actor(self.args).to(self.device)
self.q = Critic(self.args).to(self.device)
# self.q_2 = Critic(self.args).to(self.device)
self.target_q = Critic(self.args).to(self.device)
# self.target_q_2 = Critic(self.args).to(self.device)
self.soft_update(self.q, self.target_q, 1.)
# self.soft_update(self.q_2, self.target_q_2, 1.)
# self.alpha = nn.Parameter(torch.tensor(self.args.alpha_init))
self.log_alpha = nn.Parameter(torch.log(torch.tensor(1e-3)))
self.target_entropy = - torch.tensor(self.args.p_len * self.args.state_dim)
self.q_optimizer = optim.Adam(self.q.parameters(), lr=self.args.q_lr)
# self.q_2_optimizer = optim.Adam(self.q_2.parameters(), lr=self.args.q_lr)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=self.args.actor_lr)
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=self.args.q_lr)
# check directory
isExist = os.path.exists(self.args.pretrain_model_dir)
if not isExist:
os.mkdir(self.args.pretrain_model_dir)
@property
def alpha(self):
return self.log_alpha.exp()
def get_action(self, depth, imu, dir_vector):
# normalization
depth, imu, dir_vector = self.normalization(depth, imu, dir_vector)
mu,std = self.actor(depth, imu, dir_vector)
std = std.exp()
dist = Normal(mu, std)
u = dist.rsample()
u_log_prob = dist.log_prob(u)
a = torch.tanh(u)
a_log_prob = u_log_prob - torch.log(1 - torch.square(a) +1e-3)
return a, a_log_prob.sum(-1, keepdim=True)
def q_update(self, current_Q, current_v, next_v, done_masks, is_expert):
# calculate 1st term for IQ loss
# -E_(ρ_expert)[Q(s, a) - γV(s')]
with torch.no_grad():
y = (1 - done_masks) * self.args.gamma * next_v
reward = (current_Q - y)[is_expert]
# our proposed unbiased form for fixing kl divergence
# 1st loss function
phi_grad = torch.exp(-reward)
loss = -(phi_grad * reward).mean()
######
# sample using expert and policy states (works online)
# E_(ρ)[V(s) - γV(s')], 2nd loss function
value_loss = (current_v - y).mean()
loss += value_loss
# Use χ2 divergence (calculate the regularization term for IQ loss using expert and policy states) (works online)
reward = current_Q - y
# alpha value가 fixed 형태로 0.5로 설정되어 있음
# chi2_loss = 1/(4 * self.alpha) * (reward**2).mean()
chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
loss += chi2_loss
######
return loss
def train_network(self, writer, n_epi, train_memory):
print("SAC UPDATE")
depth, imu, dir_vector, actions, rewards, next_depth, next_imu, next_dir_vector, done_masks, is_expert = \
self.get_samples(train_memory)
q1, q2 = self.q(depth, imu, dir_vector, actions)
v1, v2 = self.getV(self.q, depth, imu, dir_vector)
with torch.no_grad():
next_v1, next_v2 = self.get_targetV(self.target_q, next_depth, next_imu, next_dir_vector)
#q_update
q1_loss = self.q_update(q1, v1, next_v1, done_masks, is_expert)
q2_loss = self.q_update(q2, v2, next_v2, done_masks, is_expert)
# define critic loss
critic_loss = 1/2 * (q1_loss + q2_loss)
# update
self.q_optimizer.zero_grad()
critic_loss.backward()
torch.nn.utils.clip_grad_norm_(self.q.parameters(), 1.0)
# step critic
self.q_optimizer.step()
### actor update
actor_loss,prob = self.actor_update(depth, imu, dir_vector)
###alpha update
# alpha_loss = self.alpha_update(prob)
self.soft_update(self.q, self.target_q, self.args.soft_update_rate)
self.soft_update(self.q_2, self.target_q_2, self.args.soft_update_rate)
if writer != None:
writer.add_scalar("loss/q_1", q1_loss, n_epi)
writer.add_scalar("loss/q_2", q2_loss, n_epi)
writer.add_scalar("loss/actor_loss", actor_loss, n_epi)
writer.add_scalar("loss/alpha", alpha_loss, n_epi)
# save model
if np.mod(n_epi, self.args.save_period)==0 and n_epi > 0:
# save models
torch.save(self.actor.state_dict(), self.args.pretrain_model_dir + str('actor.pt'))
def actor_update(self, depth, imu, dir_vector):
now_actions, now_action_log_prob = self.get_action(depth, imu, dir_vector)
q_1, q_2 = self.q(depth, imu, dir_vector, now_actions)
q = torch.min(q_1, q_2)
loss = (self.alpha.detach() * now_action_log_prob - q).mean()
self.actor_optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 1.0)
self.actor_optimizer.step()
return loss,now_action_log_prob
def alpha_update(self, now_action_log_prob):
loss = (- self.alpha * (now_action_log_prob + self.target_entropy).detach()).mean()
self.alpha_optimizer.zero_grad()
loss.backward()
self.alpha_optimizer.step()
return loss
def soft_update(self, network, target_network, rate):
for network_params, target_network_params in zip(network.parameters(), target_network.parameters()):
target_network_params.data.copy_(target_network_params.data * (1.0 - rate) + network_params.data * rate)
def get_expert_data(self):
# define train and validation dataset
self.expert_dataloader = DataLoader(True, self.args)
# load expert and training dataset
expert_depth, expert_imu, expert_dir_vector, expert_action, reward, done, expert_next_depth, expert_next_imu, expert_next_dir_vector \
= self.expert_dataloader.__getitem__(batch_size=self.args.discrim_batch_size)
# prepocessing training label
expert_action = self.label_preprocessing(expert_action)
# convert numpy array into tensor
expert_depth = torch.Tensor(expert_depth).cuda()
expert_imu = torch.Tensor(expert_imu).cuda()
expert_dir_vector = torch.Tensor(expert_dir_vector).cuda()
expert_action = torch.Tensor(expert_action).cuda()
expert_next_depth = torch.Tensor(expert_next_depth).cuda()
expert_next_imu = torch.Tensor(expert_next_imu).cuda()
expert_next_dir_vector = torch.Tensor(expert_next_dir_vector).cuda()
return expert_depth, expert_imu, expert_dir_vector, expert_action, expert_next_depth, expert_next_imu, expert_next_dir_vector
def getV(self, critic, depth, imu, dir_vector):
action, log_prob = self.get_action(depth, imu, dir_vector)
current_Q1, current_Q2 = critic(depth, imu, dir_vector, action)
current_V1 = current_Q1 - self.alpha.detach() * log_prob
current_V2 = current_Q2 - self.alpha.detach() * log_prob
return current_V1, current_V2
def get_targetV(self, critic_target, depth, imu, dir_vector):
action, log_prob = self.get_action(depth, imu, dir_vector)
target_Q1, target_Q2 = critic_target(depth, imu, dir_vector, action)
target_V1 = target_Q1 - self.alpha.detach() * log_prob
target_V2 = target_Q2 - self.alpha.detach() * log_prob
return target_V1, target_V2```
When I use trajectory in iq_learn/experts/, the results are not optimal. Are these just demo datas? It seems that expert datasets from Dropbox cannot be downloaded successfully.
Thanks for your help!
Hello!
Thanks so much for sharing the code!
I am new at inverse reinforcement learning. Now I am trying to apply the code to a customized environment without knowing anything about the reward function. So are there any metrics that can be used to judge the convergence except for rewards?
Thanks ;).
Hi, Thanks for making your code accessible!. I was wondering if it was possible you could also share the code to reproduce the gridworld experiments, specifically Fig 13 from your paper.
Hi~Thank you very much for sharing your paper and source code !!! I am new to inverse RL and I want to implement your method on the robot recently.
About Ant-v2
About random seeds
I runned with this python train_iq.py env=humanoid agent=sac expert.demos=1 method.loss=v0 method.regularize=True agent.actor_lr=3e-05 seed=0/1/2/3/4/5 agent.init_temp=1
Your work is very valuable and I look forward to your help in solving my doubts.
I tried to reproduce your results presented in the paper with single expert demonstration in MuJoCo tasks. However, I can't reach any score close to the ones reported. For instance, in Walker2d-v2
, the maximum score that I can achieve is around 3500. I used scripts provided in iq_learn/scripts/run_mujoco.sh
. Can you please share all hyper parameters that you used in these experiments?
I've tried using IQ-learn for tasks on robosuite, using their dataset as well as data I've collected myself, and IQ-Learn doesn't perform as well as BC and other offline RL inside robomimic, can you please explain what kind of reasons are there. Thank you very much.
I can't reproduce the Atari results reported in your paper. I directly use your code with your hyper parameters, with same package versions. However, I can't reproduce your results. The problem also holds for MuJoCo as I explained in another issue. Can you please provide correct hyperparameters to reproduce your results?
Hi,
What would you suggest as a basic config setup for generating expert demonstrations on a custom environment?
Thanks,
Hi, I opened this issue regarding divergence problems. I think you already suggested some solutions (#11 ) to this problem but still there are some problems. I spent a couple of months to handle this problem but still faced with same issue. For a better discussion, I think I need to explain the training environment.
I am using your approach in vision-based collision avoidance problem. If the agent reaches to a goal point or encounter collision, I restart my episode in a random position. It seems like my agent is learning well initially; however, the critic network starts to diverge and the network is totally destroyed at the end. It seems like the critic network thinks that current states are very good.
Here are my training settings:
iq.py
): 1I have tried with different initial alpha value
and found out that using higher initial alpha value
gives more stability to network but resulted in poor performance (The behavior of agent is far from the expert behavior). Am I using it in a wrong way or needs more hyperparameter tunings?
I attach loss figures for more clear understanding. Waiting for your response. Many Thanks.
This is my iq.py
to update critic.
def q_update(self, current_Q, current_v, next_v, done_masks, is_expert):
with torch.no_grad():
y = (1-done) * self.args.gamma * next_v
reward = (current_Q - y)[is_expert]
# 1st loss function
loss = -(reward).mean()
# 2nd loss function
value_loss = (current_v - y).mean()
loss += value_loss
# Use χ2 divergence (calculate the regularization term for IQ loss using expert and policy states) (works online)
reward = current_Q - y
chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
return loss
Dear Author, it's an honor to see your paper and code! I am a novice in this area and now I am trying to reproduce the effect of your experiment, but I have encountered some obstacles. In Half-Cheetah, I don't get the 5076.6 points in the paper, even my reward is less than 0 in most cases, the code is not modified, is the reason the hyperparameter setting? If so, could you share your hyperparameter setting? Thanks for sharing!
Hello! Could you provide the hyperparameters and the number of training steps of each MuJoCo env to reproduce the Table5 results?(Appendix D.2 in the original paper)
I've tried the iq_learn/scripts/run_mujoco.sh
script to train on Ant-v2
for ~300k steps with default hyperparameters and 10 expert trajectories. But only got the eval returns around 3000~4000. The eval/episode_reward
shows 3301.59521 and the best_returns
is 4275.31665. thank you!
Can you add the expert demos and support of the Carracing-v1 environment?
Hi, thanks for sharing your work.
Currently I'm trying to reproduce the result in pointmaze environment. I am wondering why there is a negation in
visualize_reward
function in vis/maze_vis.py
(line 144).
Also, I would like to know whether only_expert_state
option works in pointmaze environment. If so, is there a suitable set of hyperparameters for pointmaze environment? Thank you!
I've tried using IQ-learn for tasks on robosuite, using their dataset as well as data I've collected myself, and IQ-Learn doesn't perform as well as BC and other offline RL inside robomimic, can you please explain what kind of reasons are there. Thank you very much.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.