<input type="checkbox" id="" disabled=""

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

sure. <div class="snippet-clipboard-content notranslate position-relative overflow

potential bug in the implementation of DDPG about tianshou HOT 18 CLOSED

thu-ml commented on July 16, 2024

potential bug in the implementation of DDPG

from tianshou.

Comments (18)

Trinkle23897 commented on July 16, 2024

Hi, sorry for the late response. Could you please provide your code snippets for the network (because I add one line of F.softmax in the Actor net, it works well without an exception)

def forward(self, s, **kwargs):
    if not isinstance(s, torch.Tensor):
        s = torch.tensor(s, device=self.device, dtype=torch.float)
    batch = s.shape[0]
    s = s.view(batch, -1)
    logits = self.model(s)
    logits = F.softmax(logits, dim=-1)  # <--- here
    logits = self._max * torch.tanh(logits)
    return logits, None

from tianshou.

Trinkle23897 commented on July 16, 2024

I'll check all the implemented algorithms again this week, especially PPO.

from tianshou.

fengredrum commented on July 16, 2024

Hi, @liamz39
May be there're some misunderstanding here.

DDPG is meant to solve continuous control tasks, there is no reason that use a SoftMax layer for output. You can check here: https://github.com/thu-ml/tianshou/blob/master/discrete/net.py. For discrete control tasks, they do add a SoftMax layer, and it works fine for DQN.
actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
When use backward() method, PyTorch will calculate the corresponding partial derivatives for us automatically.

F.Y.I.

self(batch, eps=0) will call the forward() method, I don't like this way either, I prefer to use self.actor(batch) instead :)

from tianshou.

junfanlin commented on July 16, 2024

I think there might be some potential rational bugs in the implementation. When I set the number of the training environments to 1, the provided test_ddpg and test_td3 are both failed to play Pendulum ... But SAC works well.

from tianshou.

Trinkle23897 commented on July 16, 2024

@junfanlin I play with the test script, in fact, it can work in only one env:

Change reward_normalization=False first, because in single env the reward bias is larger than multi-env. Then

python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 1
python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1

It takes about 4 epochs in my computer:

± % python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 2                                                                                                                      !10150
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 165.06it/s, len=200.00, loss/actor=74.273026, loss/critic=0.385957, n/ep=1.00, n/st=200.00, rew=-1514.12, v/ep=10.75, v/st=2150.88]                                                        
Epoch #1: test_reward: -1219.784197, best_reward: -1219.784197 in #1
Epoch #2: 2401it [00:14, 164.65it/s, len=200.00, loss/actor=119.941541, loss/critic=0.696872, n/ep=1.00, n/st=200.00, rew=-946.87, v/ep=10.78, v/st=2156.70]                                                        
Epoch #2: test_reward: -857.544738, best_reward: -857.544738 in #2
Epoch #3: 2401it [00:25, 95.51it/s, len=200.00, loss/actor=133.435159, loss/critic=1.195252, n/ep=1.00, n/st=200.00, rew=-254.75, v/ep=10.79, v/st=2158.83]                                                         
Epoch #3: test_reward: -283.552947, best_reward: -283.552947 in #3
Epoch #4:  29%|#########################3                                                             | 700/2400 [00:12<00:30, 55.36it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-248.40, v/ep=10.79, v/st=2157.22]
{'best_reward': -233.88520793467467,
 'duration': '70.94s',
 'test_episode': 1700.0,
 'test_speed': '14988.37 step/s',
 'test_step': 340000,
 'test_time': '22.68s',
 'train_episode': 80.0,
 'train_speed': '331.59 step/s',
 'train_step': 16000,
 'train_time/collector': '7.43s',
 'train_time/model': '40.82s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -123.6083534985761, length: 200.0

± % python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1                                                                                                                       !10151
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 161.66it/s, len=200.00, loss/actor=43.141165, loss/critic1=0.086161, loss/critic2=0.065396, n/ep=1.00, n/st=200.00, rew=-1258.21, v/ep=10.59, v/st=2118.76]                                
Epoch #1: test_reward: -1430.842098, best_reward: -1430.842098 in #1
Epoch #2: 2401it [00:14, 161.55it/s, len=200.00, loss/actor=79.333228, loss/critic1=0.171390, loss/critic2=0.161232, n/ep=1.00, n/st=200.00, rew=-1512.23, v/ep=10.63, v/st=2126.16]                                
Epoch #2: test_reward: -1194.858885, best_reward: -1194.858885 in #2
Epoch #3: 2401it [00:14, 161.87it/s, len=200.00, loss/actor=107.440990, loss/critic1=0.355920, loss/critic2=0.340280, n/ep=1.00, n/st=200.00, rew=-944.28, v/ep=10.67, v/st=2133.35]                                
Epoch #3: test_reward: -848.226526, best_reward: -848.226526 in #3
Epoch #4:  92%|#############################################################################9       | 2200/2400 [00:15<00:01, 145.75it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-127.55, v/ep=10.67, v/st=2134.07]
{'best_reward': -219.068775899375,
 'duration': '63.66s',
 'test_episode': 400.0,
 'test_speed': '15019.68 step/s',
 'test_step': 80000,
 'test_time': '5.33s',
 'train_episode': 48.0,
 'train_speed': '164.57 step/s',
 'train_step': 9600,
 'train_time/collector': '4.51s',
 'train_time/model': '53.83s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -352.5223017492268, length: 200.0

from tianshou.

junfanlin commented on July 16, 2024

Thank you for the extremely quick reply :)

from tianshou.

lmzhangucsd commented on July 16, 2024

Hey @Trinkle23897, thanks so much for your reply! really appreciate it. Yeah, I added that softmax in the model's Sequential.

I understand putting a softmax in this continuous action space might not be a good option, but nevertheless, mathematically it should work since it is a differentiable function.

...
self.model += [nn.Linear(256, np.prod(action_shape))]
...
self.model += [nn.Softmax(dim=1)]
self.model = nn.Sequential(*self.model)

I didn't add F.softmax in the forward function like what you did but directly added it to the self.model

What's the difference between constructing the network in the self.model versus directly adding computation in forward function, why mine doesn't work?

from tianshou.

lmzhangucsd commented on July 16, 2024

actually, by adding F.softmax to my forward still raises the same exception...

here is my Actor network code:

class Actor(nn.Module):
    def __init__(self, layer_num, state_shape, action_shape,
                 device='cpu'):
        super().__init__()
        self.device = device
        self.model = [
            nn.Linear(np.prod(state_shape), 256),
            nn.ReLU()]
        for i in range(layer_num):
            self.model += [nn.Linear(256, 256), nn.ReLU()]
        self.model += [nn.Linear(256, np.prod(action_shape))]
        self.model = nn.Sequential(*self.model)
        # self._max = max_action

    def forward(self, s, **kwargs):
        s = torch.tensor(s, device=self.device, dtype=torch.float)
        batch = s.shape[0]
        s = s.view(batch, -1)
        score = self.model(s)
        score_norm = F.softmax(score, dim=-1)

        # logits = self._max * torch.tanh(logits)
        return score_norm, None

from tianshou.

Trinkle23897 commented on July 16, 2024

Could you please provide the exception log?

from tianshou.

lmzhangucsd commented on July 16, 2024

sure.

Epoch #1:   0%|          | 0/2400 [00:00<?, ?it/s]Warning: Traceback of forward call that caused the error:
  File "***", line 118, in <module>
    test_ddpg()
  File "***", line 103, in test_ddpg
    args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
  File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
    losses = policy.learn(train_collector.sample(batch_size))
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 147, in learn
    actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
  File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 119, in forward
    logits, h = model(obs, state=state, info=batch.info)
  File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "***", line 28, in forward
    score_norm = F.softmax(score, dim=-1)
  File "***python3.7/site-packages/torch/nn/functional.py", line 1231, in softmax
    ret = input.softmax(dim)
 (print_stack at /opt/conda/conda-bld/pytorch_1579022060824/work/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Epoch #1:   0%|          | 0/2400 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "***", line 118, in <module>
    test_ddpg()
  File "***", line 103, in test_ddpg
    args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
  File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
    losses = policy.learn(train_collector.sample(batch_size))
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 149, in learn
    actor_loss.backward()
  File "***python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "***python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 21]], which is output 0 of SoftmaxBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

from tianshou.

lmzhangucsd commented on July 16, 2024

I changed that line to:

actor_loss = -self.critic(batch.obs, self.actor(batch.obs)[0]).mean()

and it no longer has the exception.

from tianshou.

Trinkle23897 commented on July 16, 2024

You could not use softmax over action_space=1.
For example, the logits's shape before softmax is torch.Size([128, 1]), and after the softmax operation it is going to be [1,1,1,1,..., 1] of length 128.

from tianshou.

Trinkle23897 commented on July 16, 2024

By the way, the softmax is only used for Categorical distribution for discrete action tasks. It is not recommended to use softmax over continuous action space.
For your reference: #29 (comment)

from tianshou.

lmzhangucsd commented on July 16, 2024

this is what my forward returns,

def forward(self, s, **kwargs):
    ...
    return logits, None

so self.actor(batch.obs)[0] will return the logits. it will be [128, whatever my output dim size]

from tianshou.

junfanlin commented on July 16, 2024

When I used ddpg to train LunarLanderContinuous-v2, the reward seems like not increasing at all. Is there any modification needed in this environment?

BTW, SAC algorithm might need a .sum(-1, keepdims=True) when calculating log_prob. And it seems like not converging in LunarLanderContinuous-v2 either.

Thanks in advance.

from tianshou.

Trinkle23897 commented on July 16, 2024

@junfanlin I'll try later. Thanks for pointing out! You can try the observation & reward normalization first.

from tianshou.

dylanqyuan commented on July 16, 2024

I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
I have visited the source website, and in particular read the known issues
I have searched through the issue tracker for duplicates
I have mentioned version numbers, operating system and environment, where applicable:

when I run /tianshou/test/discrete/test_dqn.py for testing, the following issue happened:

traceback (most recent call last):
File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 114, in
test_dqn(get_args())
File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 76, in test_dqn
train_collector.collect(n_step=args.batch_size)
File "D:\Users\Administrator\anaconda3\envs\tianshou\lib\site-packages\tianshou\data\collector.py", line 296, in collect
self.step_speed.add(cur_step / duration)
ZeroDivisionError: float division by zero

Process finished with exit code 1

In the first two episodes it is just work fine, in the third eposide, the following issue happened, could you please tell me why?

from tianshou.

Trinkle23897 commented on July 16, 2024

@dylan0828 I fix this issue in #26, you can re-install the newest version of tianshou through GitHub: pip3 install git+https://github.com/thu-ml/tianshou.git@master
Further questions are welcomed.

from tianshou.

potential bug in the implementation of DDPG about tianshou HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent