div99 / iq-learn Goto Github PK

View Code? Open in Web Editor NEW

185.0 3.0 29.0 195.27 MB

(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation

Home Page: https://div99.github.io/IQ-Learn/

License: Other

Python 97.96% Shell 2.04%

imitation-learning inverse-reinforcement-learning reinforcement-learning

iq-learn's Introduction

Inverse Q-Learning (IQ-Learn)

[Project Page] [Blog Post] Official code base for IQ-Learn: Inverse soft-Q Learning for Imitation, NeurIPS '21 Spotlight

IQ-Learn is an simple, stable & data-efficient algorithm that's a drop-in replacement to methods like Behavior Cloning and GAIL, to boost your imitation learning pipelines!

Update: IQ-Learn was recently used to create the best AI agent for playing Minecraft. Placing #1 in NeurIPS MineRL Basalt Challenge using only recorded human player demos. (IQ-Learn also competed with methods that use human-in-the-loop interactions and surprisingly still achieved Overall Rank #2)

We introduce Inverse Q-Learning (IQ-Learn), a state-of-the-art novel framework for Imitation Learning (IL), that directly learns soft Q-functions from expert data. IQ-Learn enables non-adverserial imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than 3x. It is very simple to implement requiring ~15 lines of code on top of existing RL methods.

Inverse Q-Learning is theoretically equivalent to Inverse Reinforcement learning, i.e. learning rewards from expert data. However, it is much more powerful in practice. It admits very simple non-adverserial training and works on complete offline IL settings (without any access to the environment), greatly exceeding Behavior Cloning.

IQ-Learn is the successor to Adversarial Imitation Learning methods like GAIL (coming from the same lab).
It extends the theoretical framework for Inverse RL to non-adverserial and scalable learning, for the first-time showing guaranteed convergence.

Citation

@inproceedings{garg2021iqlearn,
title={IQ-Learn: Inverse soft-Q Learning for Imitation},
author={Divyansh Garg and Shuvam Chakraborty and Chris Cundy and Jiaming Song and Stefano Ermon},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021},
url={https://openreview.net/forum?id=Aeo-xqtb5p}
}

Key Advantages

✅ Drop-in replacement to Behavior Cloning
✅ Non-adverserial online IL (Successor to GAIL & AIRL)
✅ Simple to implement
✅ Performant with very sparse data (single expert demo)
✅ Scales to Complex Image Envs (SOTA on Atari and playing Minecraft)
✅ Recover rewards from envs

Usage

To install and use IQ-Learn check the instructions provided in the iq_learn folder.

Imitation

Reaching human-level performance on Atari with pure imitation:

Rewards

Recovering environment rewards on GridWorld:

Questions

Please feel free to email us if you have any questions.

Div Garg ([email protected])

iq-learn's People

Contributors

Stargazers

Watchers

iq-learn's Issues

expert datasets

When I use trajectory in iq_learn/experts/, the results are not optimal. Are these just demo datas? It seems that expert datasets from Dropbox cannot be downloaded successfully.
Thanks for your help!

Issue on reproduce MuJoCo results

Hello! Could you provide the hyperparameters and the number of training steps of each MuJoCo env to reproduce the Table5 results?(Appendix D.2 in the original paper)
I've tried the iq_learn/scripts/run_mujoco.sh script to train on Ant-v2 for ~300k steps with default hyperparameters and 10 expert trajectories. But only got the eval returns around 3000~4000. The eval/episode_reward shows 3301.59521 and the best_returns is 4275.31665. thank you!

Atari results are not reproducible

I can't reproduce the Atari results reported in your paper. I directly use your code with your hyper parameters, with same package versions. However, I can't reproduce your results. The problem also holds for MuJoCo as I explained in another issue. Can you please provide correct hyperparameters to reproduce your results?

Code for gridworld experiments

Hi, Thanks for making your code accessible!. I was wondering if it was possible you could also share the code to reproduce the gridworld experiments, specifically Fig 13 from your paper.

Issues on reproducing MuJoCo results

I tried to reproduce your results presented in the paper with single expert demonstration in MuJoCo tasks. However, I can't reach any score close to the ones reported. For instance, in Walker2d-v2, the maximum score that I can achieve is around 3500. I used scripts provided in iq_learn/scripts/run_mujoco.sh. Can you please share all hyper parameters that you used in these experiments?

Issue on robosuite tasks

I've tried using IQ-learn for tasks on robosuite, using their dataset as well as data I've collected myself, and IQ-Learn doesn't perform as well as BC and other offline RL inside robomimic, can you please explain what kind of reasons are there. Thank you very much.

Pseudocode and questions

Hey thanks for sharing this work! And I really appreciate the in depth beginner friendly blog post! I was wondering if this pseudocode was

Correct
Helpful to anyone else trying to understand the code

If not feel free to close. But I would appreciate it if you could help me understand a few parts about the code! Thanks!

Questions

How come the environment reward env_reward is unused and reward is entirely dependent on the output of the model? Does this algorithm only learn the expert and never take into account environment reward?
Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse?

Pseudocode

def init_network():
  q_net = torch.nn.Linear(state_size, action_size)
  target_net = deepcopy(q_net)
  
def episode_step():
  action = softmax(q_net(state))
  next_state, reward = env.step(action)
  memory.add((state, next_state, action, reward)) # memory = collections.deque
  update_critic(memory, expert_memory)
  target_net = deepcopy(q_net)
  
def update_critic(memory, expert_memory):
  # The idea here is that we backprop both the rewards for the expert's actions and the agent's actions
  # the batch dimension contains examples from the expert and the agent
  state = torch.cat((memory[:][0], expert_memory[:][0]))
  next_state = torch.cat((memory[:][1], expert_memory[:][1]))
  action = torch.cat((memory[:][2], expert_memory[:][2]))
  # v = sum of future rewards for all possible actions given current state
  v = torch.logsumexp(q_net(state), dim=1, keepdim=True)
  # next_v = sum of future rewards for all possible actions given state(t+1)
  next_v = torch.logsumexp(q_net(next_state), dim=1, keepdim=True)
  # q = sum of future rewards predicted given current state, action pair
  q = q_net(state).gather(action) 
  loss = iq_loss(q, v, next_v)
  critic_optimizer.zero_grad()
  loss.backward()
  critic_optimizer.step()
  
def iq_loss(q, v, next_v):
  if done:
    expert_reward = q[where_expert]
    # Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse? 
    value_loss = v.mean()
  else:
    expert_reward = (q - next_v)[where_expert]
    value_loss = (v - next_v).mean()
  # Why is this negative?
  expert_reward_loss = -expert_reward.mean()
  loss = reward_loss + value_loss
  return loss

Divergence Issue

Hi, I opened this issue regarding divergence problems. I think you already suggested some solutions (#11 ) to this problem but still there are some problems. I spent a couple of months to handle this problem but still faced with same issue. For a better discussion, I think I need to explain the training environment.

I am using your approach in vision-based collision avoidance problem. If the agent reaches to a goal point or encounter collision, I restart my episode in a random position. It seems like my agent is learning well initially; however, the critic network starts to diverge and the network is totally destroyed at the end. It seems like the critic network thinks that current states are very good.

Here are my training settings:

initial alpha value: 1e-3
actor learning rate: 3e-5
critic learnign rate: 3e-4
CNN layers for actor and critic: DQN structure layers
Actor network uses Squashed Normal
phi value (The value in the 26~44 line in iq.py): 1
SAC with single critic without updating alpha

I have tried with different initial alpha value and found out that using higher initial alpha value gives more stability to network but resulted in poor performance (The behavior of agent is far from the expert behavior). Am I using it in a wrong way or needs more hyperparameter tunings?

I attach loss figures for more clear understanding. Waiting for your response. Many Thanks.

This is my iq.py to update critic.

   def q_update(self, current_Q, current_v, next_v, done_masks, is_expert):
        with torch.no_grad():
            y = (1-done) * self.args.gamma * next_v
        reward = (current_Q - y)[is_expert]        
        # 1st loss function             
        loss = -(reward).mean()
        # 2nd loss function
        value_loss = (current_v - y).mean()
        loss += value_loss        

        # Use χ2 divergence (calculate the regularization term for IQ loss using expert and policy states) (works online)
        reward = current_Q - y                 
        chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
        return loss

Poor performance on robosuite tasks

How to judge the convergence

Hello!

Thanks so much for sharing the code!

I am new at inverse reinforcement learning. Now I am trying to apply the code to a customized environment without knowing anything about the reward function. So are there any metrics that can be used to judge the convergence except for rewards?

Thanks ;).

Question regarding iq_loss implementation

Hi!
First, thanks for sharing the codebase of this great work.
I have a question regarding the implementation of iq_loss function which resides in the "iq.py" file.

In line 26-43, I don't quite understand why you used the gradient of \phi instead of directly following Equation (9) of the IQLearn Paper with \phi in Table 4. I thought we can represent loss function just the same as the equation because pytorch will automatically calculate gradients via autograd. I wonder what I'm missing here.

Issue on reproducing pointmaze experiments

Hi, thanks for sharing your work.

Currently I'm trying to reproduce the result in pointmaze environment. I am wondering why there is a negation in
visualize_reward function in vis/maze_vis.py (line 144).

Also, I would like to know whether only_expert_state option works in pointmaze environment. If so, is there a suitable set of hyperparameters for pointmaze environment? Thank you!

Critic function is diverging while using SAC

Hi, Thank you for providing us a wonderful code. I am trying to adopt IQ method in my custom environment. However, I faced with diverging loss critic loss function. I tried to copy and paste the original code from github but this event is happening again and again. Is it a normal event if IQ imitation learning method is combined with SAC or am i using it in a wrong way. I uploaded my code with post. I also upload my loss function together.

class IQ(nn.Module):
    def __init__(self, args):
        super(IQ, self).__init__()

        self.args = args
        
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")        
        self.actor = Actor(self.args).to(self.device)

        self.q = Critic(self.args).to(self.device)
        # self.q_2 = Critic(self.args).to(self.device)

        self.target_q = Critic(self.args).to(self.device)
        # self.target_q_2 = Critic(self.args).to(self.device)

        self.soft_update(self.q, self.target_q, 1.)
        # self.soft_update(self.q_2, self.target_q_2, 1.)
    
        # self.alpha = nn.Parameter(torch.tensor(self.args.alpha_init))
        self.log_alpha = nn.Parameter(torch.log(torch.tensor(1e-3)))
        self.target_entropy = - torch.tensor(self.args.p_len * self.args.state_dim)

        self.q_optimizer = optim.Adam(self.q.parameters(), lr=self.args.q_lr)
        # self.q_2_optimizer = optim.Adam(self.q_2.parameters(), lr=self.args.q_lr)
        
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=self.args.actor_lr)
        self.alpha_optimizer = optim.Adam([self.log_alpha], lr=self.args.q_lr)

       # check directory
        isExist = os.path.exists(self.args.pretrain_model_dir)
        if not isExist:
            os.mkdir(self.args.pretrain_model_dir)

    @property
    def alpha(self):
        return self.log_alpha.exp()

    def get_action(self, depth, imu, dir_vector):
        # normalization
        depth, imu, dir_vector = self.normalization(depth, imu, dir_vector)
        mu,std = self.actor(depth, imu, dir_vector)
        std = std.exp()
        dist = Normal(mu, std)
        u = dist.rsample()
        u_log_prob = dist.log_prob(u)
        a = torch.tanh(u)
        a_log_prob = u_log_prob - torch.log(1 - torch.square(a) +1e-3)
        return a, a_log_prob.sum(-1, keepdim=True)

    def q_update(self, current_Q, current_v, next_v, done_masks, is_expert):
        #  calculate 1st term for IQ loss
        #  -E_(ρ_expert)[Q(s, a) - γV(s')]
        with torch.no_grad():
            y = (1 - done_masks) * self.args.gamma * next_v
        reward = (current_Q - y)[is_expert]
        
        # our proposed unbiased form for fixing kl divergence
        # 1st loss function     
        phi_grad = torch.exp(-reward)   
        loss = -(phi_grad * reward).mean()

        ######
        # sample using expert and policy states (works online)
        # E_(ρ)[V(s) - γV(s')], 2nd loss function
        value_loss = (current_v - y).mean()
        loss += value_loss        

        # Use χ2 divergence (calculate the regularization term for IQ loss using expert and policy states) (works online)
        reward = current_Q - y         

        # alpha value가 fixed 형태로 0.5로 설정되어 있음
        # chi2_loss = 1/(4 * self.alpha) * (reward**2).mean()
        chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
        loss += chi2_loss
        ######
        return loss

    def train_network(self, writer, n_epi, train_memory):
        print("SAC UPDATE")
        depth, imu, dir_vector, actions, rewards, next_depth, next_imu, next_dir_vector, done_masks, is_expert = \
            self.get_samples(train_memory)
        q1, q2 = self.q(depth, imu, dir_vector, actions)    
        v1, v2 = self.getV(self.q, depth, imu, dir_vector)        
        
        with torch.no_grad():
            next_v1, next_v2 = self.get_targetV(self.target_q, next_depth, next_imu, next_dir_vector)            
        
        #q_update
        q1_loss = self.q_update(q1, v1, next_v1, done_masks, is_expert)        
        q2_loss = self.q_update(q2, v2, next_v2, done_masks, is_expert)        

        # define critic loss
        critic_loss = 1/2 * (q1_loss + q2_loss)
        # update
        self.q_optimizer.zero_grad()
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q.parameters(), 1.0)
        # step critic
        self.q_optimizer.step()

        ### actor update
        actor_loss,prob = self.actor_update(depth, imu, dir_vector)        
        ###alpha update
        # alpha_loss = self.alpha_update(prob)
        
        self.soft_update(self.q, self.target_q, self.args.soft_update_rate)
        self.soft_update(self.q_2, self.target_q_2, self.args.soft_update_rate)
        
        if writer != None:
            writer.add_scalar("loss/q_1", q1_loss, n_epi)
            writer.add_scalar("loss/q_2", q2_loss, n_epi)
            writer.add_scalar("loss/actor_loss", actor_loss, n_epi)
            writer.add_scalar("loss/alpha", alpha_loss, n_epi)
                # save model
        if np.mod(n_epi, self.args.save_period)==0 and n_epi > 0:
            # save models
            torch.save(self.actor.state_dict(), self.args.pretrain_model_dir + str('actor.pt'))            

    def actor_update(self, depth, imu, dir_vector):
        now_actions, now_action_log_prob = self.get_action(depth, imu, dir_vector)
        q_1, q_2 = self.q(depth, imu, dir_vector, now_actions)        
        q = torch.min(q_1, q_2)
        loss = (self.alpha.detach() * now_action_log_prob - q).mean()

        self.actor_optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 1.0)
        self.actor_optimizer.step()
        return loss,now_action_log_prob
    
    def alpha_update(self, now_action_log_prob):
        loss = (- self.alpha * (now_action_log_prob + self.target_entropy).detach()).mean()
        self.alpha_optimizer.zero_grad()    
        loss.backward()
        self.alpha_optimizer.step()
        return loss
    
    def soft_update(self, network, target_network, rate):
        for network_params, target_network_params in zip(network.parameters(), target_network.parameters()):
            target_network_params.data.copy_(target_network_params.data * (1.0 - rate) + network_params.data * rate)
    
    def get_expert_data(self):        
        # define train and validation dataset        
        self.expert_dataloader = DataLoader(True, self.args)     
        # load expert and training dataset
        expert_depth, expert_imu, expert_dir_vector, expert_action, reward, done, expert_next_depth, expert_next_imu, expert_next_dir_vector \
            = self.expert_dataloader.__getitem__(batch_size=self.args.discrim_batch_size)                                                
        # prepocessing training label        
        expert_action = self.label_preprocessing(expert_action)
        
        # convert numpy array into tensor
        expert_depth = torch.Tensor(expert_depth).cuda()
        expert_imu = torch.Tensor(expert_imu).cuda()
        expert_dir_vector = torch.Tensor(expert_dir_vector).cuda()
        expert_action = torch.Tensor(expert_action).cuda()
        
        expert_next_depth = torch.Tensor(expert_next_depth).cuda()
        expert_next_imu = torch.Tensor(expert_next_imu).cuda()
        expert_next_dir_vector = torch.Tensor(expert_next_dir_vector).cuda()

        return expert_depth, expert_imu, expert_dir_vector, expert_action, expert_next_depth, expert_next_imu, expert_next_dir_vector

    def getV(self, critic, depth, imu, dir_vector):
        action, log_prob = self.get_action(depth, imu, dir_vector)        
        current_Q1, current_Q2 = critic(depth, imu, dir_vector, action)
        current_V1 = current_Q1 - self.alpha.detach() * log_prob
        current_V2 = current_Q2 - self.alpha.detach() * log_prob
        return current_V1, current_V2

    def get_targetV(self, critic_target, depth, imu, dir_vector):
        action, log_prob = self.get_action(depth, imu, dir_vector)
        target_Q1, target_Q2 = critic_target(depth, imu, dir_vector, action)
        target_V1 = target_Q1 - self.alpha.detach() * log_prob
        target_V2 = target_Q2 - self.alpha.detach() * log_prob
        return target_V1, target_V2```

Config for expert generation

Hi,

What would you suggest as a basic config setup for generating expert demonstrations on a custom environment?

Thanks,

Issue on Ant-v2 expertd data and Humanoid-v2 random seed Experiments

Hi~Thank you very much for sharing your paper and source code !!! I am new to inverse RL and I want to implement your method on the robot recently.
About Ant-v2

And I found that the reward for each step in your Ant-v2 expert data is 1. Why set the reward like this? And how to run sqil correctly in your code

About random seeds

I found that the results with different random seeds in the humanoid experiments are very different, some results are around 1500 points, is it because the number of learning steps is only 50000 or the expert data is 1?

I runned with this python train_iq.py env=humanoid agent=sac expert.demos=1 method.loss=v0 method.regularize=True agent.actor_lr=3e-05 seed=0/1/2/3/4/5 agent.init_temp=1

Your work is very valuable and I look forward to your help in solving my doubts.

Issue on reproduce MuJoCo results-HalfCheetah-v2

Dear Author, it's an honor to see your paper and code! I am a novice in this area and now I am trying to reproduce the effect of your experiment, but I have encountered some obstacles. In Half-Cheetah, I don't get the 5076.6 points in the paper, even my reward is less than 0 in most cases, the code is not modified, is the reason the hyperparameter setting? If so, could you share your hyperparameter setting? Thanks for sharing!

Can you provide the expert demo of Carracing-v1 environment?

Can you add the expert demos and support of the Carracing-v1 environment?

Offline Learning without access to environment.

Hello,
I started working with the provided implementation recently, thank you for sharing it. Just wanted to know why we require access to env for offline training . Do we need to make changes in code for it, if we have expert data but no access to env for training ?

Getting missing args error running train_iq.py examples from run_offline.sh

Hi!

I've been trying to get some of the examples going, but I keep getting a issue with initating the models. The error belowe is from running the CartPole_v1 from the run_offline.sh.

Error in call to target 'agent.softq_models.OfflineQNetwork':
TypeError("init() missing 1 required positional argument: 'args'")
full_key: q_net.args.agent.critic_cfg

As I get the same error trying several models and env I assume it's something I've overlooked. I'm not getting any errors about dependecies.

div99 / iq-learn Goto Github PK

iq-learn's Introduction

Inverse Q-Learning (IQ-Learn)

[Project Page] [Blog Post] Official code base for IQ-Learn: Inverse soft-Q Learning for Imitation, NeurIPS '21 Spotlight

Citation

Key Advantages

Usage

Imitation

Rewards

Questions

iq-learn's People

Contributors

Stargazers

Watchers

Forkers

iq-learn's Issues

Questions

Pseudocode

Recommend Projects

Recommend Topics

Recommend Org