Git Product home page Git Product logo

Comments (18)

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

Hi, sorry for the late response. Could you please provide your code snippets for the network (because I add one line of F.softmax in the Actor net, it works well without an exception)

def forward(self, s, **kwargs):
    if not isinstance(s, torch.Tensor):
        s = torch.tensor(s, device=self.device, dtype=torch.float)
    batch = s.shape[0]
    s = s.view(batch, -1)
    logits = self.model(s)
    logits = F.softmax(logits, dim=-1)  # <--- here
    logits = self._max * torch.tanh(logits)
    return logits, None

from tianshou.

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

I'll check all the implemented algorithms again this week, especially PPO.

from tianshou.

fengredrum avatar fengredrum commented on July 16, 2024

Hi, @liamz39
May be there're some misunderstanding here.

  1. DDPG is meant to solve continuous control tasks, there is no reason that use a SoftMax layer for output. You can check here: https://github.com/thu-ml/tianshou/blob/master/discrete/net.py. For discrete control tasks, they do add a SoftMax layer, and it works fine for DQN.

  2. actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
    When use backward() method, PyTorch will calculate the corresponding partial derivatives for us automatically.

F.Y.I.

self(batch, eps=0) will call the forward() method, I don't like this way either, I prefer to use self.actor(batch) instead :)

from tianshou.

junfanlin avatar junfanlin commented on July 16, 2024

I think there might be some potential rational bugs in the implementation. When I set the number of the training environments to 1, the provided test_ddpg and test_td3 are both failed to play Pendulum ... But SAC works well.

from tianshou.

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

@junfanlin I play with the test script, in fact, it can work in only one env:

Change reward_normalization=False first, because in single env the reward bias is larger than multi-env. Then

python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 1
python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1

It takes about 4 epochs in my computer:

± % python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 2                                                                                                                      !10150
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 165.06it/s, len=200.00, loss/actor=74.273026, loss/critic=0.385957, n/ep=1.00, n/st=200.00, rew=-1514.12, v/ep=10.75, v/st=2150.88]                                                        
Epoch #1: test_reward: -1219.784197, best_reward: -1219.784197 in #1
Epoch #2: 2401it [00:14, 164.65it/s, len=200.00, loss/actor=119.941541, loss/critic=0.696872, n/ep=1.00, n/st=200.00, rew=-946.87, v/ep=10.78, v/st=2156.70]                                                        
Epoch #2: test_reward: -857.544738, best_reward: -857.544738 in #2
Epoch #3: 2401it [00:25, 95.51it/s, len=200.00, loss/actor=133.435159, loss/critic=1.195252, n/ep=1.00, n/st=200.00, rew=-254.75, v/ep=10.79, v/st=2158.83]                                                         
Epoch #3: test_reward: -283.552947, best_reward: -283.552947 in #3
Epoch #4:  29%|#########################3                                                             | 700/2400 [00:12<00:30, 55.36it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-248.40, v/ep=10.79, v/st=2157.22]
{'best_reward': -233.88520793467467,
 'duration': '70.94s',
 'test_episode': 1700.0,
 'test_speed': '14988.37 step/s',
 'test_step': 340000,
 'test_time': '22.68s',
 'train_episode': 80.0,
 'train_speed': '331.59 step/s',
 'train_step': 16000,
 'train_time/collector': '7.43s',
 'train_time/model': '40.82s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -123.6083534985761, length: 200.0
± % python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1                                                                                                                       !10151
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 161.66it/s, len=200.00, loss/actor=43.141165, loss/critic1=0.086161, loss/critic2=0.065396, n/ep=1.00, n/st=200.00, rew=-1258.21, v/ep=10.59, v/st=2118.76]                                
Epoch #1: test_reward: -1430.842098, best_reward: -1430.842098 in #1
Epoch #2: 2401it [00:14, 161.55it/s, len=200.00, loss/actor=79.333228, loss/critic1=0.171390, loss/critic2=0.161232, n/ep=1.00, n/st=200.00, rew=-1512.23, v/ep=10.63, v/st=2126.16]                                
Epoch #2: test_reward: -1194.858885, best_reward: -1194.858885 in #2
Epoch #3: 2401it [00:14, 161.87it/s, len=200.00, loss/actor=107.440990, loss/critic1=0.355920, loss/critic2=0.340280, n/ep=1.00, n/st=200.00, rew=-944.28, v/ep=10.67, v/st=2133.35]                                
Epoch #3: test_reward: -848.226526, best_reward: -848.226526 in #3
Epoch #4:  92%|#############################################################################9       | 2200/2400 [00:15<00:01, 145.75it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-127.55, v/ep=10.67, v/st=2134.07]
{'best_reward': -219.068775899375,
 'duration': '63.66s',
 'test_episode': 400.0,
 'test_speed': '15019.68 step/s',
 'test_step': 80000,
 'test_time': '5.33s',
 'train_episode': 48.0,
 'train_speed': '164.57 step/s',
 'train_step': 9600,
 'train_time/collector': '4.51s',
 'train_time/model': '53.83s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -352.5223017492268, length: 200.0

from tianshou.

junfanlin avatar junfanlin commented on July 16, 2024

Thank you for the extremely quick reply :)

from tianshou.

lmzhangucsd avatar lmzhangucsd commented on July 16, 2024

Hey @Trinkle23897, thanks so much for your reply! really appreciate it. Yeah, I added that softmax in the model's Sequential.

I understand putting a softmax in this continuous action space might not be a good option, but nevertheless, mathematically it should work since it is a differentiable function.

...
self.model += [nn.Linear(256, np.prod(action_shape))]
...
self.model += [nn.Softmax(dim=1)]
self.model = nn.Sequential(*self.model)

I didn't add F.softmax in the forward function like what you did but directly added it to the self.model

What's the difference between constructing the network in the self.model versus directly adding computation in forward function, why mine doesn't work?

from tianshou.

lmzhangucsd avatar lmzhangucsd commented on July 16, 2024

actually, by adding F.softmax to my forward still raises the same exception...

here is my Actor network code:

class Actor(nn.Module):
    def __init__(self, layer_num, state_shape, action_shape,
                 device='cpu'):
        super().__init__()
        self.device = device
        self.model = [
            nn.Linear(np.prod(state_shape), 256),
            nn.ReLU()]
        for i in range(layer_num):
            self.model += [nn.Linear(256, 256), nn.ReLU()]
        self.model += [nn.Linear(256, np.prod(action_shape))]
        self.model = nn.Sequential(*self.model)
        # self._max = max_action

    def forward(self, s, **kwargs):
        s = torch.tensor(s, device=self.device, dtype=torch.float)
        batch = s.shape[0]
        s = s.view(batch, -1)
        score = self.model(s)
        score_norm = F.softmax(score, dim=-1)

        # logits = self._max * torch.tanh(logits)
        return score_norm, None

from tianshou.

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

Could you please provide the exception log?

from tianshou.

lmzhangucsd avatar lmzhangucsd commented on July 16, 2024

sure.

Epoch #1:   0%|          | 0/2400 [00:00<?, ?it/s]Warning: Traceback of forward call that caused the error:
  File "***", line 118, in <module>
    test_ddpg()
  File "***", line 103, in test_ddpg
    args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
  File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
    losses = policy.learn(train_collector.sample(batch_size))
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 147, in learn
    actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
  File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 119, in forward
    logits, h = model(obs, state=state, info=batch.info)
  File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "***", line 28, in forward
    score_norm = F.softmax(score, dim=-1)
  File "***python3.7/site-packages/torch/nn/functional.py", line 1231, in softmax
    ret = input.softmax(dim)
 (print_stack at /opt/conda/conda-bld/pytorch_1579022060824/work/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Epoch #1:   0%|          | 0/2400 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "***", line 118, in <module>
    test_ddpg()
  File "***", line 103, in test_ddpg
    args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
  File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
    losses = policy.learn(train_collector.sample(batch_size))
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 149, in learn
    actor_loss.backward()
  File "***python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "***python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 21]], which is output 0 of SoftmaxBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

from tianshou.

lmzhangucsd avatar lmzhangucsd commented on July 16, 2024

I changed that line to:

actor_loss = -self.critic(batch.obs, self.actor(batch.obs)[0]).mean()

and it no longer has the exception.

from tianshou.

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

You could not use softmax over action_space=1.
For example, the logits's shape before softmax is torch.Size([128, 1]), and after the softmax operation it is going to be [1,1,1,1,..., 1] of length 128.

from tianshou.

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

By the way, the softmax is only used for Categorical distribution for discrete action tasks. It is not recommended to use softmax over continuous action space.
For your reference: #29 (comment)

from tianshou.

lmzhangucsd avatar lmzhangucsd commented on July 16, 2024

this is what my forward returns,

def forward(self, s, **kwargs):
    ...
    return logits, None

so self.actor(batch.obs)[0] will return the logits. it will be [128, whatever my output dim size]

from tianshou.

junfanlin avatar junfanlin commented on July 16, 2024

When I used ddpg to train LunarLanderContinuous-v2, the reward seems like not increasing at all. Is there any modification needed in this environment?

BTW, SAC algorithm might need a .sum(-1, keepdims=True) when calculating log_prob. And it seems like not converging in LunarLanderContinuous-v2 either.

Thanks in advance.

from tianshou.

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

@junfanlin I'll try later. Thanks for pointing out! You can try the observation & reward normalization first.

from tianshou.

dylanqyuan avatar dylanqyuan commented on July 16, 2024
  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
  • I have visited the source website, and in particular read the known issues
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:

when I run /tianshou/test/discrete/test_dqn.py for testing, the following issue happened:

traceback (most recent call last):
File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 114, in
test_dqn(get_args())
File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 76, in test_dqn
train_collector.collect(n_step=args.batch_size)
File "D:\Users\Administrator\anaconda3\envs\tianshou\lib\site-packages\tianshou\data\collector.py", line 296, in collect
self.step_speed.add(cur_step / duration)
ZeroDivisionError: float division by zero

Process finished with exit code 1

In the first two episodes it is just work fine, in the third eposide, the following issue happened, could you please tell me why?

from tianshou.

Trinkle23897 avatar Trinkle23897 commented on July 16, 2024

@dylan0828 I fix this issue in #26, you can re-install the newest version of tianshou through GitHub: pip3 install git+https://github.com/thu-ml/tianshou.git@master
Further questions are welcomed.

from tianshou.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.