div99 / iq-learn Goto Github PK

View Code? Open in Web Editor NEW

185.0 3.0 30.0 195.27 MB

(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation

Home Page: https://div99.github.io/IQ-Learn/

License: Other

Python 97.96% Shell 2.04%

imitation-learning inverse-reinforcement-learning reinforcement-learning

iq-learn's Issues

Pseudocode and questions

Hey thanks for sharing this work! And I really appreciate the in depth beginner friendly blog post! I was wondering if this pseudocode was

Correct
Helpful to anyone else trying to understand the code

If not feel free to close. But I would appreciate it if you could help me understand a few parts about the code! Thanks!

Questions

How come the environment reward env_reward is unused and reward is entirely dependent on the output of the model? Does this algorithm only learn the expert and never take into account environment reward?
Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse?

Pseudocode

def init_network():
  q_net = torch.nn.Linear(state_size, action_size)
  target_net = deepcopy(q_net)
  
def episode_step():
  action = softmax(q_net(state))
  next_state, reward = env.step(action)
  memory.add((state, next_state, action, reward)) # memory = collections.deque
  update_critic(memory, expert_memory)
  target_net = deepcopy(q_net)
  
def update_critic(memory, expert_memory):
  # The idea here is that we backprop both the rewards for the expert's actions and the agent's actions
  # the batch dimension contains examples from the expert and the agent
  state = torch.cat((memory[:][0], expert_memory[:][0]))
  next_state = torch.cat((memory[:][1], expert_memory[:][1]))
  action = torch.cat((memory[:][2], expert_memory[:][2]))
  # v = sum of future rewards for all possible actions given current state
  v = torch.logsumexp(q_net(state), dim=1, keepdim=True)
  # next_v = sum of future rewards for all possible actions given state(t+1)
  next_v = torch.logsumexp(q_net(next_state), dim=1, keepdim=True)
  # q = sum of future rewards predicted given current state, action pair
  q = q_net(state).gather(action) 
  loss = iq_loss(q, v, next_v)
  critic_optimizer.zero_grad()
  loss.backward()
  critic_optimizer.step()
  
def iq_loss(q, v, next_v):
  if done:
    expert_reward = q[where_expert]
    # Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse? 
    value_loss = v.mean()
  else:
    expert_reward = (q - next_v)[where_expert]
    value_loss = (v - next_v).mean()
  # Why is this negative?
  expert_reward_loss = -expert_reward.mean()
  loss = reward_loss + value_loss
  return loss

Question regarding iq_loss implementation

Hi!
First, thanks for sharing the codebase of this great work.
I have a question regarding the implementation of iq_loss function which resides in the "iq.py" file.

In line 26-43, I don't quite understand why you used the gradient of \phi instead of directly following Equation (9) of the IQLearn Paper with \phi in Table 4. I thought we can represent loss function just the same as the equation because pytorch will automatically calculate gradients via autograd. I wonder what I'm missing here.

Offline Learning without access to environment.

Hello,
I started working with the provided implementation recently, thank you for sharing it. Just wanted to know why we require access to env for offline training . Do we need to make changes in code for it, if we have expert data but no access to env for training ?

Getting missing args error running train_iq.py examples from run_offline.sh

Hi!

I've been trying to get some of the examples going, but I keep getting a issue with initating the models. The error belowe is from running the CartPole_v1 from the run_offline.sh.

Error in call to target 'agent.softq_models.OfflineQNetwork':
TypeError("init() missing 1 required positional argument: 'args'")
full_key: q_net.args.agent.critic_cfg

As I get the same error trying several models and env I assume it's something I've overlooked. I'm not getting any errors about dependecies.

Critic function is diverging while using SAC

Hi, Thank you for providing us a wonderful code. I am trying to adopt IQ method in my custom environment. However, I faced with diverging loss critic loss function. I tried to copy and paste the original code from github but this event is happening again and again. Is it a normal event if IQ imitation learning method is combined with SAC or am i using it in a wrong way. I uploaded my code with post. I also upload my loss function together.

class IQ(nn.Module):
    def __init__(self, args):
        super(IQ, self).__init__()

        self.args = args
        
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")        
        self.actor = Actor(self.args).to(self.device)

        self.q = Critic(self.args).to(self.device)
        # self.q_2 = Critic(self.args).to(self.device)

        self.target_q = Critic(self.args).to(self.device)
        # self.target_q_2 = Critic(self.args).to(self.device)

        self.soft_update(self.q, self.target_q, 1.)
        # self.soft_update(self.q_2, self.target_q_2, 1.)
    
        # self.alpha = nn.Parameter(torch.tensor(self.args.alpha_init))
        self.log_alpha = nn.Parameter(torch.log(torch.tensor(1e-3)))
        self.target_entropy = - torch.tensor(self.args.p_len * self.args.state_dim)

        self.q_optimizer = optim.Adam(self.q.parameters(), lr=self.args.q_lr)
        # self.q_2_optimizer = optim.Adam(self.q_2.parameters(), lr=self.args.q_lr)
        
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=self.args.actor_lr)
        self.alpha_optimizer = optim.Adam([self.log_alpha], lr=self.args.q_lr)

       # check directory
        isExist = os.path.exists(self.args.pretrain_model_dir)
        if not isExist:
            os.mkdir(self.args.pretrain_model_dir)

    @property
    def alpha(self):
        return self.log_alpha.exp()

    def get_action(self, depth, imu, dir_vector):
        # normalization
        depth, imu, dir_vector = self.normalization(depth, imu, dir_vector)
        mu,std = self.actor(depth, imu, dir_vector)
        std = std.exp()
        dist = Normal(mu, std)
        u = dist.rsample()
        u_log_prob = dist.log_prob(u)
        a = torch.tanh(u)
        a_log_prob = u_log_prob - torch.log(1 - torch.square(a) +1e-3)
        return a, a_log_prob.sum(-1, keepdim=True)

    def q_update(self, current_Q, current_v, next_v, done_masks, is_expert):
        #  calculate 1st term for IQ loss
        #  -E_(ρ_expert)[Q(s, a) - γV(s')]
        with torch.no_grad():
            y = (1 - done_masks) * self.args.gamma * next_v
        reward = (current_Q - y)[is_expert]
        
        # our proposed unbiased form for fixing kl divergence
        # 1st loss function     
        phi_grad = torch.exp(-reward)   
        loss = -(phi_grad * reward).mean()

        ######
        # sample using expert and policy states (works online)
        # E_(ρ)[V(s) - γV(s')], 2nd loss function
        value_loss = (current_v - y).mean()
        loss += value_loss        

        # Use χ2 divergence (calculate the regularization term for IQ loss using expert and policy states) (works online)
        reward = current_Q - y         

        # alpha value가 fixed 형태로 0.5로 설정되어 있음
        # chi2_loss = 1/(4 * self.alpha) * (reward**2).mean()
        chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
        loss += chi2_loss
        ######
        return loss

    def train_network(self, writer, n_epi, train_memory):
        print("SAC UPDATE")
        depth, imu, dir_vector, actions, rewards, next_depth, next_imu, next_dir_vector, done_masks, is_expert = \
            self.get_samples(train_memory)
        q1, q2 = self.q(depth, imu, dir_vector, actions)    
        v1, v2 = self.getV(self.q, depth, imu, dir_vector)        
        
        with torch.no_grad():
            next_v1, next_v2 = self.get_targetV(self.target_q, next_depth, next_imu, next_dir_vector)            
        
        #q_update
        q1_loss = self.q_update(q1, v1, next_v1, done_masks, is_expert)        
        q2_loss = self.q_update(q2, v2, next_v2, done_masks, is_expert)        

        # define critic loss
        critic_loss = 1/2 * (q1_loss + q2_loss)
        # update
        self.q_optimizer.zero_grad()
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q.parameters(), 1.0)
        # step critic
        self.q_optimizer.step()

        ### actor update
        actor_loss,prob = self.actor_update(depth, imu, dir_vector)        
        ###alpha update
        # alpha_loss = self.alpha_update(prob)
        
        self.soft_update(self.q, self.target_q, self.args.soft_update_rate)
        self.soft_update(self.q_2, self.target_q_2, self.args.soft_update_rate)
        
        if writer != None:
            writer.add_scalar("loss/q_1", q1_loss, n_epi)
            writer.add_scalar("loss/q_2", q2_loss, n_epi)
            writer.add_scalar("loss/actor_loss", actor_loss, n_epi)
            writer.add_scalar("loss/alpha", alpha_loss, n_epi)
                # save model
        if np.mod(n_epi, self.args.save_period)==0 and n_epi > 0:
            # save models
            torch.save(self.actor.state_dict(), self.args.pretrain_model_dir + str('actor.pt'))            

    def actor_update(self, depth, imu, dir_vector):
        now_actions, now_action_log_prob = self.get_action(depth, imu, dir_vector)
        q_1, q_2 = self.q(depth, imu, dir_vector, now_actions)        
        q = torch.min(q_1, q_2)
        loss = (self.alpha.detach() * now_action_log_prob - q).mean()

        self.actor_optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 1.0)
        self.actor_optimizer.step()
        return loss,now_action_log_prob
    
    def alpha_update(self, now_action_log_prob):
        loss = (- self.alpha * (now_action_log_prob + self.target_entropy).detach()).mean()
        self.alpha_optimizer.zero_grad()    
        loss.backward()
        self.alpha_optimizer.step()
        return loss
    
    def soft_update(self, network, target_network, rate):
        for network_params, target_network_params in zip(network.parameters(), target_network.parameters()):
            target_network_params.data.copy_(target_network_params.data * (1.0 - rate) + network_params.data * rate)
    
    def get_expert_data(self):        
        # define train and validation dataset        
        self.expert_dataloader = DataLoader(True, self.args)     
        # load expert and training dataset
        expert_depth, expert_imu, expert_dir_vector, expert_action, reward, done, expert_next_depth, expert_next_imu, expert_next_dir_vector \
            = self.expert_dataloader.__getitem__(batch_size=self.args.discrim_batch_size)                                                
        # prepocessing training label        
        expert_action = self.label_preprocessing(expert_action)
        
        # convert numpy array into tensor
        expert_depth = torch.Tensor(expert_depth).cuda()
        expert_imu = torch.Tensor(expert_imu).cuda()
        expert_dir_vector = torch.Tensor(expert_dir_vector).cuda()
        expert_action = torch.Tensor(expert_action).cuda()
        
        expert_next_depth = torch.Tensor(expert_next_depth).cuda()
        expert_next_imu = torch.Tensor(expert_next_imu).cuda()
        expert_next_dir_vector = torch.Tensor(expert_next_dir_vector).cuda()

        return expert_depth, expert_imu, expert_dir_vector, expert_action, expert_next_depth, expert_next_imu, expert_next_dir_vector

    def getV(self, critic, depth, imu, dir_vector):
        action, log_prob = self.get_action(depth, imu, dir_vector)        
        current_Q1, current_Q2 = critic(depth, imu, dir_vector, action)
        current_V1 = current_Q1 - self.alpha.detach() * log_prob
        current_V2 = current_Q2 - self.alpha.detach() * log_prob
        return current_V1, current_V2

    def get_targetV(self, critic_target, depth, imu, dir_vector):
        action, log_prob = self.get_action(depth, imu, dir_vector)
        target_Q1, target_Q2 = critic_target(depth, imu, dir_vector, action)
        target_V1 = target_Q1 - self.alpha.detach() * log_prob
        target_V2 = target_Q2 - self.alpha.detach() * log_prob
        return target_V1, target_V2```

expert datasets

When I use trajectory in iq_learn/experts/, the results are not optimal. Are these just demo datas? It seems that expert datasets from Dropbox cannot be downloaded successfully.
Thanks for your help!

How to judge the convergence

Hello!

Thanks so much for sharing the code!

I am new at inverse reinforcement learning. Now I am trying to apply the code to a customized environment without knowing anything about the reward function. So are there any metrics that can be used to judge the convergence except for rewards?

Thanks ;).

Code for gridworld experiments

Hi, Thanks for making your code accessible!. I was wondering if it was possible you could also share the code to reproduce the gridworld experiments, specifically Fig 13 from your paper.

Issue on Ant-v2 expertd data and Humanoid-v2 random seed Experiments

Hi~Thank you very much for sharing your paper and source code !!! I am new to inverse RL and I want to implement your method on the robot recently.
About Ant-v2

And I found that the reward for each step in your Ant-v2 expert data is 1. Why set the reward like this? And how to run sqil correctly in your code

About random seeds

I found that the results with different random seeds in the humanoid experiments are very different, some results are around 1500 points, is it because the number of learning steps is only 50000 or the expert data is 1?

I runned with this python train_iq.py env=humanoid agent=sac expert.demos=1 method.loss=v0 method.regularize=True agent.actor_lr=3e-05 seed=0/1/2/3/4/5 agent.init_temp=1

Your work is very valuable and I look forward to your help in solving my doubts.

Issues on reproducing MuJoCo results

I tried to reproduce your results presented in the paper with single expert demonstration in MuJoCo tasks. However, I can't reach any score close to the ones reported. For instance, in Walker2d-v2, the maximum score that I can achieve is around 3500. I used scripts provided in iq_learn/scripts/run_mujoco.sh. Can you please share all hyper parameters that you used in these experiments?

Poor performance on robosuite tasks

I've tried using IQ-learn for tasks on robosuite, using their dataset as well as data I've collected myself, and IQ-Learn doesn't perform as well as BC and other offline RL inside robomimic, can you please explain what kind of reasons are there. Thank you very much.

Atari results are not reproducible

I can't reproduce the Atari results reported in your paper. I directly use your code with your hyper parameters, with same package versions. However, I can't reproduce your results. The problem also holds for MuJoCo as I explained in another issue. Can you please provide correct hyperparameters to reproduce your results?

Config for expert generation

Hi,

What would you suggest as a basic config setup for generating expert demonstrations on a custom environment?

Thanks,

Divergence Issue

Hi, I opened this issue regarding divergence problems. I think you already suggested some solutions (#11 ) to this problem but still there are some problems. I spent a couple of months to handle this problem but still faced with same issue. For a better discussion, I think I need to explain the training environment.

I am using your approach in vision-based collision avoidance problem. If the agent reaches to a goal point or encounter collision, I restart my episode in a random position. It seems like my agent is learning well initially; however, the critic network starts to diverge and the network is totally destroyed at the end. It seems like the critic network thinks that current states are very good.

Here are my training settings:

initial alpha value: 1e-3
actor learning rate: 3e-5
critic learnign rate: 3e-4
CNN layers for actor and critic: DQN structure layers
Actor network uses Squashed Normal
phi value (The value in the 26~44 line in iq.py): 1
SAC with single critic without updating alpha

I have tried with different initial alpha value and found out that using higher initial alpha value gives more stability to network but resulted in poor performance (The behavior of agent is far from the expert behavior). Am I using it in a wrong way or needs more hyperparameter tunings?

I attach loss figures for more clear understanding. Waiting for your response. Many Thanks.

This is my iq.py to update critic.

   def q_update(self, current_Q, current_v, next_v, done_masks, is_expert):
        with torch.no_grad():
            y = (1-done) * self.args.gamma * next_v
        reward = (current_Q - y)[is_expert]        
        # 1st loss function             
        loss = -(reward).mean()
        # 2nd loss function
        value_loss = (current_v - y).mean()
        loss += value_loss        

        # Use χ2 divergence (calculate the regularization term for IQ loss using expert and policy states) (works online)
        reward = current_Q - y                 
        chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
        return loss

Issue on reproduce MuJoCo results-HalfCheetah-v2

Dear Author, it's an honor to see your paper and code! I am a novice in this area and now I am trying to reproduce the effect of your experiment, but I have encountered some obstacles. In Half-Cheetah, I don't get the 5076.6 points in the paper, even my reward is less than 0 in most cases, the code is not modified, is the reason the hyperparameter setting? If so, could you share your hyperparameter setting? Thanks for sharing!

Issue on reproduce MuJoCo results

Hello! Could you provide the hyperparameters and the number of training steps of each MuJoCo env to reproduce the Table5 results?(Appendix D.2 in the original paper)
I've tried the iq_learn/scripts/run_mujoco.sh script to train on Ant-v2 for ~300k steps with default hyperparameters and 10 expert trajectories. But only got the eval returns around 3000~4000. The eval/episode_reward shows 3301.59521 and the best_returns is 4275.31665. thank you!

Can you provide the expert demo of Carracing-v1 environment?

Can you add the expert demos and support of the Carracing-v1 environment?

Issue on reproducing pointmaze experiments

Hi, thanks for sharing your work.

Currently I'm trying to reproduce the result in pointmaze environment. I am wondering why there is a negation in
visualize_reward function in vis/maze_vis.py (line 144).

Also, I would like to know whether only_expert_state option works in pointmaze environment. If so, is there a suitable set of hyperparameters for pointmaze environment? Thank you!

div99 / iq-learn Goto Github PK

iq-learn's Issues

Questions

Pseudocode

Recommend Projects

Recommend Topics

Recommend Org