Comments (17)
Hi, I observed similar behaviors.
In your code you set
reward = (current_Q - y)[is_expert]
and compute the chi2 regularization only for expert "reward" as
chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
which in my experience, lead to divergence. The reason is that, these "rewards" are in fact very large. So if you look at the iq.py file you see that the authors compute chi2 regularization on both the policy's and expert's "reward". In this case I do not have divergence problem, but I am still not able to get good policies, though.
Another thing to point out is that, the authors do not update alpha.
from iq-learn.
@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.
For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.
from iq-learn.
Thanks. I will try without training alpha and let you know the results :)
from iq-learn.
and these are my hyperparameters
parser.add_argument("--memory_size", type=int, default=20000)
parser.add_argument("--random_action", type=int, default=1000)#Don't need seeding for IL (Use 1000 for RL)
parser.add_argument("--min_samples_to_start", type=int, default=1000)
parser.add_argument("--alpha_init", type=float, default=0.5)
parser.add_argument("--soft_update_rate", type=float, default=0.005)
parser.add_argument("--mini_batch_size", type=int, default=128)
parser.add_argument("--save_period", type=int, default=200)
parser.add_argument("--gamma", type=float, default=0.99)
parser.add_argument("--lambda_", type=float, default=0.95)
parser.add_argument("--actor_lr", type=float, default=3e-5)
parser.add_argument("--q_lr", type=float, default=3e-5)
parser.add_argument("--actor_train_epoch", type=int, default=1)
from iq-learn.
@Altriaex Thanks for your reply! Actually, I computed my reward using both expert and learner data set. In the first loss term, I set my reward as
reward = (current_Q - y)[is_expert]
and then, corresponding loss function is defined as:
loss = -(reward)
In the chi2 regularization, I again set my reward as
reward = (current_Q - y)
and corresponding chi2_loss is defined as:
chi2_loss = (4*0.5) * (reward)**2.mean()
, which I already using both expert and learner data set.
Should I set my first reward from (current_Q-y)[is_expert]
to current_Q - y
for all loss terms or just apply current_Q - y
term for chi2_loss?
I will try without updating alpha. And If you have any loss plot for your own custom environment, could you share it?
Many thanks.
from iq-learn.
I myself still cannot make this algo work, so I also don't what it the best thing to do.
from iq-learn.
@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way?
Thanks in advance :)
from iq-learn.
Hi, sorry for the delay in replying back. I have observed that for continuous spaces you need to add the chi2 regularization on both the policy and the expert samples. The reason here is that you have a separate policy network in the continuous setting, and without also regularizing policy samples, we can learn large negative rewards for the policy that can diverge toward the negative infinity, preventing the method from converging.
For IQ-Learn on continuous spaces, I will recommend the setting method.regularize=True
to enable the above behavior, try training using a single Q-network (instead of double critic) and try disabling alpha training and playing with small alpha values like 1e-3, 1e-2. If you are using the original code in the repo, you can try one of the settings used in our Mujoco experiments script run_mujoco.sh
For using automatic alpha training, you can see this issue: #5
In general, we want the imitation policy to have a very low entropy, as compared to SAC, and setting an entropy_target = -4 * dim(A)
works well on most Mujoco environments when learning the alpha
from iq-learn.
@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way? Thanks in advance :)
No, the critic should not diverge if the method is working well. It is likely indicating a bug in the code or a wrong hyperparam setting
from iq-learn.
@Div99 Thanks for your reply. I was waiting for you! I am running my custom code in vision-based collision avoidance environment. The policy networks get visual inputs and produces collision-free trajectory (3 points in 2D space, dim(A) = 6). The policy network and critic network follows the same structure like the one used in Atari example.
I have tried with the following settings:
method.loss = "value"
method.regularize= True
- Double Q-networks
- Disabled alpha training
- Initial alpha value = 1e-3
- Learning rate of actor and critic = 3e-5
- Soft target update with tau rate 0.005
After training, I got these two loss functions: First one is actor loss and the other one is q loss. I still suffering from q function divergence. When I printed the q values, I could see some large negative values which result in huge q loss. I need to check whether I am using your method correctly again by implementing your original code. Or, can you guess any potential cause of divergence?
Actor loss
Q loss
from iq-learn.
We use critic rate=3e-4 so that could be one source of divergence.
Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)
from iq-learn.
@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.
For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.
Great! Glad the single q network worked, it's not clear why the double q trick works for SAC but not here, maybe the min prevents learning the correct rewards
from iq-learn.
@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.
For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.
When I use double q networks, the reward values are in [-1, 1] range, which are not that high. I will try with single Q-function! Thanks you so much.
from iq-learn.
We use critic rate=3e-4 so that could be one source of divergence.
Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)
I will try with critic learning rate 3e-4 with single Q-networks. Also, set my initial alpha value as 1e-2. I will let you know my results. Also I will check my network input whether they are correctly normalized.
from iq-learn.
@Altriaex @Div99 Hi, I have tried with single Q critic and it works. I didn't see any divergence of critic loss. I ran the original code in the repo and my loss shows similar behavior. The reason of divergence is that the critic produces negative output (which means that the critic thinks that current states and actions are bad), and as the training iteration goes on q values become more and more negative resulting in divergence. Using single critic removes this dug.
Many thanks :)
from iq-learn.
@Div99 Sorry, I have to reopen the issue, because the loss function seems to be very unstable. It is fluctuating in a large magnitude. The followings are my hyperparameters:
- actor learning rate: 3e-5
- critic learning rate: 3e-4
- initial temperature: 1e-2
- soft target update: True
- single Q-network
from iq-learn.
I have solved this issue by tuning hyperparameters. Closing this issue.
from iq-learn.
Related Issues (16)
- expert datasets HOT 4
- Getting missing args error running train_iq.py examples from run_offline.sh HOT 2
- Question regarding iq_loss implementation HOT 2
- Divergence Issue
- Poor performance on robosuite tasks
- Issue on robosuite tasks
- Offline Learning without access to environment.
- Issue on reproduce MuJoCo results HOT 5
- How to judge the convergence HOT 10
- Issue on reproducing pointmaze experiments HOT 1
- Issue on reproduce MuJoCo results-HalfCheetah-v2 HOT 14
- Code for gridworld experiments HOT 3
- Issue on Ant-v2 expertd data and Humanoid-v2 random seed Experiments HOT 1
- Config for expert generation
- Pseudocode and questions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iq-learn.