Hi, these implementations are amazing, thank you for sharing them. I have a question o

How are we using rewards in imitation learning? about pytorch-rl HOT 4 CLOSED

khrylx commented on August 23, 2024

How are we using rewards in imitation learning?

from pytorch-rl.

Comments (4)

Khrylx commented on August 23, 2024 1

The reward is computed exactly as log of discriminator, as shown in this line:

PyTorch-RL/gail/gail_gym.py

Line 99 in d94e147

return -math.log(discrim_net(state_action)[0].item())

from pytorch-rl.

zbzhu99 commented on August 23, 2024 1

The reward is computed exactly as log of discriminator, as shown in this line:

PyTorch-RL/gail/gail_gym.py

Line 99 in d94e147

return -math.log(discrim_net(state_action)[0].item())

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

The D in this code is actually equivalent to the minus D in the original paper.

PyTorch-RL/gail/gail_gym.py

Lines 125 to 126 in d94e147

 discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \ 

 discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))

From the above two lines of code, it can be seen that the discriminator's update goal is to output 1 for the generated data g_o and 0 for the expert data e_o. The goal of the policy update should then be to minimize the output of the discriminator, i.e., to maximize the -log(D(g_o)) reward, which makes the so-called adversarial training.

from pytorch-rl.

SiddharthSingi commented on August 23, 2024

Thank you for your response. For anyone else also wondering the same thing. Please check this line as well:

PyTorch-RL/core/agent.py

Line 42 in d94e147

reward = custom_reward(state, action)

from pytorch-rl.

SiddharthSingi commented on August 23, 2024

The reward is computed exactly as log of discriminator, as shown in this line:

PyTorch-RL/gail/gail_gym.py

Line 99 in d94e147

return -math.log(discrim_net(state_action)[0].item())

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

from pytorch-rl.

Recommend Projects

How are we using rewards in imitation learning? about pytorch-rl HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \
	discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))