Beautiful code guys! Just a quick question: In the original <a href=

Does MDRNN loss include reward? about world-models HOT 4 CLOSED

ctallec commented on August 20, 2024

Does MDRNN loss include reward?

from world-models.

Comments (4)

ranahanocka commented on August 20, 2024

Any update on this? I am also wondering the same thing.

If this implementation uses rewards to train the LSTM then it is no longer a model-based approach (which is very different from the whole concept of the paper).

from world-models.

ctallec commented on August 20, 2024

We did model the reward as well in the previous version. This does not change the fact that this is a model-based approach. The reward is just included in the things about the world we aim at modelling. Besides, I've just pushed a version where the reward is no longer modelled by default. I don't expect that to significantly change the results.

from world-models.

ranahanocka commented on August 20, 2024

Thanks for the reply. But I don't understand how the rewards that drive the controller would also drive the model of the world (the dynamics). A huge advantage of model based RL is the separation of the rewards and the dynamics, right?

For example, the trained LSTM will be used to train the controller to lane follow. Suppose the LSTM also used the lane following reward during training, but now if I wanted a different policy (e.g., have the car drive on the left side of the road, because it's England), it wouldn't be enough to just retrain the controller. Now my dynamics may not work for this new policy, since the rewards changed.

RE:your new fix
BTW - the MDRNN network (models/mdrnn.py) still regresses a reward.

from world-models.

ctallec commented on August 20, 2024

What you might be missing here is that the model is trained to predict the reward, not to optimize it. Modelling the problem's reward is unlikely to degrade your dynamic modelisation; if the dynamic is perfectly modelled, then predicting the reward should not be problematic, and should come at little computational overhead (since the reward is mostly a deterministic function of the dynamic hidden state). On the other hand, if the dynamic is very hard to model, you are giving hints to the network as to what part of the environment may be of use for your own task, and probably for many other tasks as well.

Predicting the reward for one task may be useful in a wide variety of tasks, other than the original one. In general, model-based RL is all about predicting things that are not necessarily directly related to your own task, but which could help as side info. Typically, in your example, if you want both to move forward, and to remain on the left side of the road, having access to variables that are predictive of wether you are moving forward or not (basically the variables you learnt from modelling the reward of the original task) is likely to be useful. This is quite general: if your task is to learn how to walk, having access to variables that are predictive of how high your head is (which is a proper reward for the task "Standing up") is also likely to be useful. Overall, what I am saying here is that by additionnally modelling the original task reward, we are not restraining the model, since we are only telling it to predict more things, not less. The only problem that there could be here is a problem of network capacity: the network could be only capable of modelling either the dynamic or the reward, but not both. As the reward is a simple function of the dynamic, and is low dimensional compared to the latent dynamic, my guess is that this is not the bottleneck here. Hope this makes things clearer. If not I'd be happy to discuss this more.

Besides, if I haven't made any mistake in the code (entirely possible), the reward is no longer regressed. The reward prediction loss is zeroed by default in trainmdrnn.py (l137-142). You still have a network head that could be used to predict the reward, but this head is no longer trained, and the corresponding error is no longer backpropagated in the LSTM:
if include_reward:
mse = f.mse_loss(rs, reward)
scale = LSIZE + 2
else:
mse = 0
scale = LSIZE + 1

from world-models.

Does MDRNN loss include reward? about world-models HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent