Hi, Thank you for providing us a wonderful code. I am trying to adopt IQ method in my

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Critic function is diverging while using SAC about iq-learn HOT 17 CLOSED

div99 commented on May 26, 2024

Critic function is diverging while using SAC

from iq-learn.

Comments (17)

Altriaex commented on May 26, 2024 2

Hi, I observed similar behaviors.
In your code you set

reward = (current_Q - y)[is_expert]

and compute the chi2 regularization only for expert "reward" as

chi2_loss = 1/(4 * 0.5) * (reward**2).mean()

which in my experience, lead to divergence. The reason is that, these "rewards" are in fact very large. So if you look at the iq.py file you see that the authors compute chi2 regularization on both the policy's and expert's "reward". In this case I do not have divergence problem, but I am still not able to get good policies, though.

Another thing to point out is that, the authors do not update alpha.

from iq-learn.

Altriaex commented on May 26, 2024 2

@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.

For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.

from iq-learn.

mw9385 commented on May 26, 2024 1

Thanks. I will try without training alpha and let you know the results :)

from iq-learn.

mw9385 commented on May 26, 2024

and these are my hyperparameters
parser.add_argument("--memory_size", type=int, default=20000)
parser.add_argument("--random_action", type=int, default=1000)#Don't need seeding for IL (Use 1000 for RL)
parser.add_argument("--min_samples_to_start", type=int, default=1000)
parser.add_argument("--alpha_init", type=float, default=0.5)
parser.add_argument("--soft_update_rate", type=float, default=0.005)
parser.add_argument("--mini_batch_size", type=int, default=128)
parser.add_argument("--save_period", type=int, default=200)
parser.add_argument("--gamma", type=float, default=0.99)
parser.add_argument("--lambda_", type=float, default=0.95)
parser.add_argument("--actor_lr", type=float, default=3e-5)
parser.add_argument("--q_lr", type=float, default=3e-5)
parser.add_argument("--actor_train_epoch", type=int, default=1)

from iq-learn.

mw9385 commented on May 26, 2024

@Altriaex Thanks for your reply! Actually, I computed my reward using both expert and learner data set. In the first loss term, I set my reward as
reward = (current_Q - y)[is_expert] and then, corresponding loss function is defined as:
loss = -(reward)

In the chi2 regularization, I again set my reward as
reward = (current_Q - y) and corresponding chi2_loss is defined as:
chi2_loss = (4*0.5) * (reward)**2.mean(), which I already using both expert and learner data set.

Should I set my first reward from (current_Q-y)[is_expert] to current_Q - y for all loss terms or just apply current_Q - y term for chi2_loss?

I will try without updating alpha. And If you have any loss plot for your own custom environment, could you share it?

Many thanks.

from iq-learn.

Altriaex commented on May 26, 2024

I myself still cannot make this algo work, so I also don't what it the best thing to do.

from iq-learn.

mw9385 commented on May 26, 2024

@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way?
Thanks in advance :)

from iq-learn.

Div99 commented on May 26, 2024

Hi, sorry for the delay in replying back. I have observed that for continuous spaces you need to add the chi2 regularization on both the policy and the expert samples. The reason here is that you have a separate policy network in the continuous setting, and without also regularizing policy samples, we can learn large negative rewards for the policy that can diverge toward the negative infinity, preventing the method from converging.

For IQ-Learn on continuous spaces, I will recommend the setting method.regularize=True to enable the above behavior, try training using a single Q-network (instead of double critic) and try disabling alpha training and playing with small alpha values like 1e-3, 1e-2. If you are using the original code in the repo, you can try one of the settings used in our Mujoco experiments script run_mujoco.sh

For using automatic alpha training, you can see this issue: #5
In general, we want the imitation policy to have a very low entropy, as compared to SAC, and setting an entropy_target = -4 * dim(A) works well on most Mujoco environments when learning the alpha

from iq-learn.

Div99 commented on May 26, 2024

@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way? Thanks in advance :)

No, the critic should not diverge if the method is working well. It is likely indicating a bug in the code or a wrong hyperparam setting

from iq-learn.

mw9385 commented on May 26, 2024

@Div99 Thanks for your reply. I was waiting for you! I am running my custom code in vision-based collision avoidance environment. The policy networks get visual inputs and produces collision-free trajectory (3 points in 2D space, dim(A) = 6). The policy network and critic network follows the same structure like the one used in Atari example.

I have tried with the following settings:

method.loss = "value"
method.regularize= True
Double Q-networks
Disabled alpha training
Initial alpha value = 1e-3
Learning rate of actor and critic = 3e-5
Soft target update with tau rate 0.005

After training, I got these two loss functions: First one is actor loss and the other one is q loss. I still suffering from q function divergence. When I printed the q values, I could see some large negative values which result in huge q loss. I need to check whether I am using your method correctly again by implementing your original code. Or, can you guess any potential cause of divergence?

Actor loss

Q loss

from iq-learn.

Div99 commented on May 26, 2024

We use critic rate=3e-4 so that could be one source of divergence.

Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)

from iq-learn.

Div99 commented on May 26, 2024

@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.

For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.

Great! Glad the single q network worked, it's not clear why the double q trick works for SAC but not here, maybe the min prevents learning the correct rewards

from iq-learn.

mw9385 commented on May 26, 2024

@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.

For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.

When I use double q networks, the reward values are in [-1, 1] range, which are not that high. I will try with single Q-function! Thanks you so much.

from iq-learn.

mw9385 commented on May 26, 2024

We use critic rate=3e-4 so that could be one source of divergence.

Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)

I will try with critic learning rate 3e-4 with single Q-networks. Also, set my initial alpha value as 1e-2. I will let you know my results. Also I will check my network input whether they are correctly normalized.

from iq-learn.

mw9385 commented on May 26, 2024

@Altriaex @Div99 Hi, I have tried with single Q critic and it works. I didn't see any divergence of critic loss. I ran the original code in the repo and my loss shows similar behavior. The reason of divergence is that the critic produces negative output (which means that the critic thinks that current states and actions are bad), and as the training iteration goes on q values become more and more negative resulting in divergence. Using single critic removes this dug.

Many thanks :)

from iq-learn.

mw9385 commented on May 26, 2024

@Div99 Sorry, I have to reopen the issue, because the loss function seems to be very unstable. It is fluctuating in a large magnitude. The followings are my hyperparameters:

actor learning rate: 3e-5
critic learning rate: 3e-4
initial temperature: 1e-2
soft target update: True
single Q-network

from iq-learn.

mw9385 commented on May 26, 2024

I have solved this issue by tuning hyperparameters. Closing this issue.

from iq-learn.

Critic function is diverging while using SAC about iq-learn HOT 17 CLOSED

Comments (17)

Actor loss

Q loss

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent