Git Product home page Git Product logo

Comments (17)

Altriaex avatar Altriaex commented on May 26, 2024 2

Hi, I observed similar behaviors.
In your code you set

reward = (current_Q - y)[is_expert]

and compute the chi2 regularization only for expert "reward" as

chi2_loss = 1/(4 * 0.5) * (reward**2).mean()

which in my experience, lead to divergence. The reason is that, these "rewards" are in fact very large. So if you look at the iq.py file you see that the authors compute chi2 regularization on both the policy's and expert's "reward". In this case I do not have divergence problem, but I am still not able to get good policies, though.

Another thing to point out is that, the authors do not update alpha.

from iq-learn.

Altriaex avatar Altriaex commented on May 26, 2024 2

@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.

For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024 1

Thanks. I will try without training alpha and let you know the results :)

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

and these are my hyperparameters
parser.add_argument("--memory_size", type=int, default=20000)
parser.add_argument("--random_action", type=int, default=1000)#Don't need seeding for IL (Use 1000 for RL)
parser.add_argument("--min_samples_to_start", type=int, default=1000)
parser.add_argument("--alpha_init", type=float, default=0.5)
parser.add_argument("--soft_update_rate", type=float, default=0.005)
parser.add_argument("--mini_batch_size", type=int, default=128)
parser.add_argument("--save_period", type=int, default=200)
parser.add_argument("--gamma", type=float, default=0.99)
parser.add_argument("--lambda_", type=float, default=0.95)
parser.add_argument("--actor_lr", type=float, default=3e-5)
parser.add_argument("--q_lr", type=float, default=3e-5)
parser.add_argument("--actor_train_epoch", type=int, default=1)

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

@Altriaex Thanks for your reply! Actually, I computed my reward using both expert and learner data set. In the first loss term, I set my reward as
reward = (current_Q - y)[is_expert] and then, corresponding loss function is defined as:
loss = -(reward)

In the chi2 regularization, I again set my reward as
reward = (current_Q - y) and corresponding chi2_loss is defined as:
chi2_loss = (4*0.5) * (reward)**2.mean(), which I already using both expert and learner data set.

Should I set my first reward from (current_Q-y)[is_expert] to current_Q - y for all loss terms or just apply current_Q - y term for chi2_loss?

I will try without updating alpha. And If you have any loss plot for your own custom environment, could you share it?

Many thanks.

from iq-learn.

Altriaex avatar Altriaex commented on May 26, 2024

I myself still cannot make this algo work, so I also don't what it the best thing to do.

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way?
Thanks in advance :)

from iq-learn.

Div99 avatar Div99 commented on May 26, 2024

Hi, sorry for the delay in replying back. I have observed that for continuous spaces you need to add the chi2 regularization on both the policy and the expert samples. The reason here is that you have a separate policy network in the continuous setting, and without also regularizing policy samples, we can learn large negative rewards for the policy that can diverge toward the negative infinity, preventing the method from converging.

For IQ-Learn on continuous spaces, I will recommend the setting method.regularize=True to enable the above behavior, try training using a single Q-network (instead of double critic) and try disabling alpha training and playing with small alpha values like 1e-3, 1e-2. If you are using the original code in the repo, you can try one of the settings used in our Mujoco experiments script run_mujoco.sh

For using automatic alpha training, you can see this issue: #5
In general, we want the imitation policy to have a very low entropy, as compared to SAC, and setting an entropy_target = -4 * dim(A) works well on most Mujoco environments when learning the alpha

from iq-learn.

Div99 avatar Div99 commented on May 26, 2024

@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way? Thanks in advance :)

No, the critic should not diverge if the method is working well. It is likely indicating a bug in the code or a wrong hyperparam setting

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

@Div99 Thanks for your reply. I was waiting for you! I am running my custom code in vision-based collision avoidance environment. The policy networks get visual inputs and produces collision-free trajectory (3 points in 2D space, dim(A) = 6). The policy network and critic network follows the same structure like the one used in Atari example.

I have tried with the following settings:

  • method.loss = "value"
  • method.regularize= True
  • Double Q-networks
  • Disabled alpha training
  • Initial alpha value = 1e-3
  • Learning rate of actor and critic = 3e-5
  • Soft target update with tau rate 0.005

After training, I got these two loss functions: First one is actor loss and the other one is q loss. I still suffering from q function divergence. When I printed the q values, I could see some large negative values which result in huge q loss. I need to check whether I am using your method correctly again by implementing your original code. Or, can you guess any potential cause of divergence?

Actor loss

loss_actor_loss

Q loss

loss_q_1

from iq-learn.

Div99 avatar Div99 commented on May 26, 2024

We use critic rate=3e-4 so that could be one source of divergence.

Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)

from iq-learn.

Div99 avatar Div99 commented on May 26, 2024

@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.

For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.

Great! Glad the single q network worked, it's not clear why the double q trick works for SAC but not here, maybe the min prevents learning the correct rewards

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.

For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.

When I use double q networks, the reward values are in [-1, 1] range, which are not that high. I will try with single Q-function! Thanks you so much.

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

We use critic rate=3e-4 so that could be one source of divergence.

Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)

I will try with critic learning rate 3e-4 with single Q-networks. Also, set my initial alpha value as 1e-2. I will let you know my results. Also I will check my network input whether they are correctly normalized.

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

@Altriaex @Div99 Hi, I have tried with single Q critic and it works. I didn't see any divergence of critic loss. I ran the original code in the repo and my loss shows similar behavior. The reason of divergence is that the critic produces negative output (which means that the critic thinks that current states and actions are bad), and as the training iteration goes on q values become more and more negative resulting in divergence. Using single critic removes this dug.

Many thanks :)

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

@Div99 Sorry, I have to reopen the issue, because the loss function seems to be very unstable. It is fluctuating in a large magnitude. The followings are my hyperparameters:

  • actor learning rate: 3e-5
  • critic learning rate: 3e-4
  • initial temperature: 1e-2
  • soft target update: True
  • single Q-network

스크린샷 2023-01-20 오후 10 52 36

스크린샷 2023-01-20 오후 10 52 43

from iq-learn.

mw9385 avatar mw9385 commented on May 26, 2024

I have solved this issue by tuning hyperparameters. Closing this issue.

from iq-learn.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.