Comments (18)
Hi, sorry for the late response. Could you please provide your code snippets for the network (because I add one line of F.softmax
in the Actor net, it works well without an exception)
def forward(self, s, **kwargs):
if not isinstance(s, torch.Tensor):
s = torch.tensor(s, device=self.device, dtype=torch.float)
batch = s.shape[0]
s = s.view(batch, -1)
logits = self.model(s)
logits = F.softmax(logits, dim=-1) # <--- here
logits = self._max * torch.tanh(logits)
return logits, None
from tianshou.
I'll check all the implemented algorithms again this week, especially PPO.
from tianshou.
Hi, @liamz39
May be there're some misunderstanding here.
-
DDPG is meant to solve continuous control tasks, there is no reason that use a SoftMax layer for output. You can check here: https://github.com/thu-ml/tianshou/blob/master/discrete/net.py. For discrete control tasks, they do add a SoftMax layer, and it works fine for DQN.
-
actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
When use backward() method, PyTorch will calculate the corresponding partial derivatives for us automatically.
F.Y.I.
self(batch, eps=0) will call the forward() method, I don't like this way either, I prefer to use self.actor(batch) instead :)
from tianshou.
I think there might be some potential rational bugs in the implementation. When I set the number of the training environments to 1, the provided test_ddpg and test_td3 are both failed to play Pendulum ... But SAC works well.
from tianshou.
@junfanlin I play with the test script, in fact, it can work in only one env:
Change reward_normalization=False
first, because in single env the reward bias is larger than multi-env. Then
python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 1
python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1
It takes about 4 epochs in my computer:
± % python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 2 !10150
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 165.06it/s, len=200.00, loss/actor=74.273026, loss/critic=0.385957, n/ep=1.00, n/st=200.00, rew=-1514.12, v/ep=10.75, v/st=2150.88]
Epoch #1: test_reward: -1219.784197, best_reward: -1219.784197 in #1
Epoch #2: 2401it [00:14, 164.65it/s, len=200.00, loss/actor=119.941541, loss/critic=0.696872, n/ep=1.00, n/st=200.00, rew=-946.87, v/ep=10.78, v/st=2156.70]
Epoch #2: test_reward: -857.544738, best_reward: -857.544738 in #2
Epoch #3: 2401it [00:25, 95.51it/s, len=200.00, loss/actor=133.435159, loss/critic=1.195252, n/ep=1.00, n/st=200.00, rew=-254.75, v/ep=10.79, v/st=2158.83]
Epoch #3: test_reward: -283.552947, best_reward: -283.552947 in #3
Epoch #4: 29%|#########################3 | 700/2400 [00:12<00:30, 55.36it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-248.40, v/ep=10.79, v/st=2157.22]
{'best_reward': -233.88520793467467,
'duration': '70.94s',
'test_episode': 1700.0,
'test_speed': '14988.37 step/s',
'test_step': 340000,
'test_time': '22.68s',
'train_episode': 80.0,
'train_speed': '331.59 step/s',
'train_step': 16000,
'train_time/collector': '7.43s',
'train_time/model': '40.82s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -123.6083534985761, length: 200.0
± % python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1 !10151
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 161.66it/s, len=200.00, loss/actor=43.141165, loss/critic1=0.086161, loss/critic2=0.065396, n/ep=1.00, n/st=200.00, rew=-1258.21, v/ep=10.59, v/st=2118.76]
Epoch #1: test_reward: -1430.842098, best_reward: -1430.842098 in #1
Epoch #2: 2401it [00:14, 161.55it/s, len=200.00, loss/actor=79.333228, loss/critic1=0.171390, loss/critic2=0.161232, n/ep=1.00, n/st=200.00, rew=-1512.23, v/ep=10.63, v/st=2126.16]
Epoch #2: test_reward: -1194.858885, best_reward: -1194.858885 in #2
Epoch #3: 2401it [00:14, 161.87it/s, len=200.00, loss/actor=107.440990, loss/critic1=0.355920, loss/critic2=0.340280, n/ep=1.00, n/st=200.00, rew=-944.28, v/ep=10.67, v/st=2133.35]
Epoch #3: test_reward: -848.226526, best_reward: -848.226526 in #3
Epoch #4: 92%|#############################################################################9 | 2200/2400 [00:15<00:01, 145.75it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-127.55, v/ep=10.67, v/st=2134.07]
{'best_reward': -219.068775899375,
'duration': '63.66s',
'test_episode': 400.0,
'test_speed': '15019.68 step/s',
'test_step': 80000,
'test_time': '5.33s',
'train_episode': 48.0,
'train_speed': '164.57 step/s',
'train_step': 9600,
'train_time/collector': '4.51s',
'train_time/model': '53.83s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -352.5223017492268, length: 200.0
from tianshou.
Thank you for the extremely quick reply :)
from tianshou.
Hey @Trinkle23897, thanks so much for your reply! really appreciate it. Yeah, I added that softmax in the model's Sequential.
I understand putting a softmax in this continuous action space might not be a good option, but nevertheless, mathematically it should work since it is a differentiable function.
...
self.model += [nn.Linear(256, np.prod(action_shape))]
...
self.model += [nn.Softmax(dim=1)]
self.model = nn.Sequential(*self.model)
I didn't add F.softmax
in the forward
function like what you did but directly added it to the self.model
What's the difference between constructing the network in the self.model
versus directly adding computation in forward
function, why mine doesn't work?
from tianshou.
actually, by adding F.softmax
to my forward still raises the same exception...
here is my Actor network code:
class Actor(nn.Module):
def __init__(self, layer_num, state_shape, action_shape,
device='cpu'):
super().__init__()
self.device = device
self.model = [
nn.Linear(np.prod(state_shape), 256),
nn.ReLU()]
for i in range(layer_num):
self.model += [nn.Linear(256, 256), nn.ReLU()]
self.model += [nn.Linear(256, np.prod(action_shape))]
self.model = nn.Sequential(*self.model)
# self._max = max_action
def forward(self, s, **kwargs):
s = torch.tensor(s, device=self.device, dtype=torch.float)
batch = s.shape[0]
s = s.view(batch, -1)
score = self.model(s)
score_norm = F.softmax(score, dim=-1)
# logits = self._max * torch.tanh(logits)
return score_norm, None
from tianshou.
Could you please provide the exception log?
from tianshou.
sure.
Epoch #1: 0%| | 0/2400 [00:00<?, ?it/s]Warning: Traceback of forward call that caused the error:
File "***", line 118, in <module>
test_ddpg()
File "***", line 103, in test_ddpg
args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
losses = policy.learn(train_collector.sample(batch_size))
File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 147, in learn
actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 119, in forward
logits, h = model(obs, state=state, info=batch.info)
File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "***", line 28, in forward
score_norm = F.softmax(score, dim=-1)
File "***python3.7/site-packages/torch/nn/functional.py", line 1231, in softmax
ret = input.softmax(dim)
(print_stack at /opt/conda/conda-bld/pytorch_1579022060824/work/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Epoch #1: 0%| | 0/2400 [00:02<?, ?it/s]
Traceback (most recent call last):
File "***", line 118, in <module>
test_ddpg()
File "***", line 103, in test_ddpg
args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
losses = policy.learn(train_collector.sample(batch_size))
File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 149, in learn
actor_loss.backward()
File "***python3.7/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "***python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 21]], which is output 0 of SoftmaxBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
from tianshou.
I changed that line to:
actor_loss = -self.critic(batch.obs, self.actor(batch.obs)[0]).mean()
and it no longer has the exception.
from tianshou.
You could not use softmax over action_space=1.
For example, the logits's shape before softmax is torch.Size([128, 1]), and after the softmax operation it is going to be [1,1,1,1,..., 1] of length 128.
from tianshou.
By the way, the softmax is only used for Categorical distribution for discrete action tasks. It is not recommended to use softmax over continuous action space.
For your reference: #29 (comment)
from tianshou.
this is what my forward
returns,
def forward(self, s, **kwargs):
...
return logits, None
so self.actor(batch.obs)[0]
will return the logits. it will be [128, whatever my output dim size]
from tianshou.
When I used ddpg to train LunarLanderContinuous-v2, the reward seems like not increasing at all. Is there any modification needed in this environment?
BTW, SAC algorithm might need a .sum(-1, keepdims=True) when calculating log_prob. And it seems like not converging in LunarLanderContinuous-v2 either.
Thanks in advance.
from tianshou.
@junfanlin I'll try later. Thanks for pointing out! You can try the observation & reward normalization first.
from tianshou.
- I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
- I have visited the source website, and in particular read the known issues
- I have searched through the issue tracker for duplicates
- I have mentioned version numbers, operating system and environment, where applicable:
when I run /tianshou/test/discrete/test_dqn.py for testing, the following issue happened:
traceback (most recent call last):
File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 114, in
test_dqn(get_args())
File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 76, in test_dqn
train_collector.collect(n_step=args.batch_size)
File "D:\Users\Administrator\anaconda3\envs\tianshou\lib\site-packages\tianshou\data\collector.py", line 296, in collect
self.step_speed.add(cur_step / duration)
ZeroDivisionError: float division by zero
Process finished with exit code 1
In the first two episodes it is just work fine, in the third eposide, the following issue happened, could you please tell me why?
from tianshou.
@dylan0828 I fix this issue in #26, you can re-install the newest version of tianshou through GitHub: pip3 install git+https://github.com/thu-ml/tianshou.git@master
Further questions are welcomed.
from tianshou.
Related Issues (20)
- Wrong output of forward for custom policy HOT 1
- Support MultiBinary action space for SAC or A2C HOT 2
- Clearer separation between the trainer and the algorithm and refactoring of policy classes HOT 1
- When is the reset() function being called in tictactoe? HOT 2
- Rename state_shape to obs_shape HOT 1
- logger and save_best_fn do not work for Custom Environment HOT 1
- Regarding the error related to SEED when I train in a homebrew environment HOT 6
- How to train a offline BCQ model with a custom logged data? HOT 1
- Collector sampling with multiple environment does not seem to be unbiased with n_episodes HOT 10
- Colab button not working in docs tutorials HOT 2
- My custom PettingZoo env is working with DQNPolicy but not with PPOPolicy : AttributeError: 'str' object has no attribute 'ndim' HOT 2
- Update Gymnasium to v1.0.0a1 HOT 3
- Allow number of episodes per test step to be configured in high-level API
- Improve and extend Documentation Content
- Code duplication between ReplayBuffer and ReplayBufferManager
- Remove data from state in Collector, and remove preprocess_fn there HOT 7
- how to convert Batch into ndarray/tensor HOT 5
- Support Dict observation spaces HOT 7
- Revisit and maybe optimize Collectors
- two dimensional input action in DDPG HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tianshou.