deeprl-chinese's People
Forkers
wangsiwei2010 sunyong2016 wangyantong cherryblueberry xiantaoxiao leopold-fitz-ai xujinxue dongcf belkov0912 bowenff unclexiao ycghb-yuyongzhong shiyuzh2007 czh5712 guozhong23 collin-balanis danielqujun chengmuni66 lilelr xdflyq ella-momo xieyj17 zephyr0703 anranyicheng sjtuguofei wangyu204 jimmmy0 naive-one wenghongda pyramid-wayne skeli9989 einyboycode quwei7 hippozhibos fanzhao1997 thulc liuziyuan-cs kungfu-crab wj2312 wghaotian mysticmountain vr3d cs231 nico-robin-mm zkangning xyzz1223 zxh4well flyingpig233 zhili51 xql-l shixuejun mingkin try110 blog666 dongqiang123456 dahehe98 lianhuaxindi ych0605 advancingsweet i-am-angry liatli dingjia123 cloud-za writewhatlearned grantsj freebooterish wangyuting9836 yuting-li chargerkong gdz351 nancycyzl nnyilun icaseyluong jianglul yale1417 aioim echo-b longxingzhe johnvon jack-ning080 maggielxj aaaeeee yunjiao-chen shyduan sunfengkai4254 rw-jhyt haikuoxindeeprl-chinese's Issues
15_mac_a2c.py目标网络未起作用
第15章意图用目标网络计算TD目标,但是下述代码认为奖励已经充分包括预测值,那目标网络的定义就没有意义了
# 注意到simple_spread_v2中,reward是根据当前状态到目标位置的距离而计算的奖励。因此,直接使用reward作为td目标值更合适。
# with torch.no_grad():
# td_value = self.target_value(bns).squeeze()
# td_value = br + self.gamma * td_value * (1 - bd)
另外有一个问题,就是P127页第8章的更新目标网络采用超参数 r 调整更新目标网络。想请问,如果使用设置同步频率,即n个周期同步一次目标网络代替文中的每个周期按照比例更新,在训练的效果上面有什么区别~
DQN的compute_loss 函数中,计算目标函数时,为什么没有和书中所写一样,进行优化,将TD target 看成常数来计算导数 ?
DQN的compute_loss 函数中,计算目标函数时,为什么没有和书中所写一样,进行优化,将TD target 看成常数来计算导数 ?
15_mac_a2c.py中Categorical的冗余操作
class MAC(nn.Module):
def policy(self, observation, agent):
# 参考https://pytorch.org/docs/stable/distributions.html#score-function
log_prob_action = self.agent2policy[agent].policy(observation)
m = Categorical(logits=log_prob_action) # 应该用prob传参
action = m.sample()
log_prob_a = m.log_prob(action)
return action.item(), log_prob_a
上文定义的策略函数返回的是归一化概率和归一化对数概率,所以创建Categorical对象时候应该传入的参数名是prob,而不是logits
m = Categorical(prob=log_prob_action)
训练loss大幅震荡且没有下降趋势
07_reinforce 和 08_reinforce_with_baseline 似乎重复了?
07_reinforce.py 是不是应该是用蒙特卡洛估计汇报的reinforce算法?
04_dqn.py has some bug
When i run the file use the command: python .\04_dqn.py --do_train
Traceback (most recent call last):
File "E:\code\github\wangshusen\DeepRL-Chinese\04_dqn.py", line 242, in <module>
main()
File "E:\code\github\wangshusen\DeepRL-Chinese\04_dqn.py", line 235, in main
train(args, env, agent)
File "E:\code\github\wangshusen\DeepRL-Chinese\04_dqn.py", line 129, in train
action = agent.get_action(torch.from_numpy(state))
File "E:\code\github\wangshusen\DeepRL-Chinese\04_dqn.py", line 43, in get_action
qvals = self.Q(state)
File "E:\develop\anaconda3\envs\ray\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\code\github\wangshusen\DeepRL-Chinese\04_dqn.py", line 29, in forward
x = F.relu(self.fc1(state))
File "E:\develop\anaconda3\envs\ray\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\develop\anaconda3\envs\ray\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
The file should become like this:
Line 129 in 55a9f5e
action = agent.get_action(torch.from_numpy(state).to(args.device))
Lines 158 to 162 in 55a9f5e
bs = torch.tensor(bs, dtype=torch.float32, device=args.device)
ba = torch.tensor(ba, dtype=torch.long, device=args.device)
br = torch.tensor(br, dtype=torch.float32, device=args.device)
bd = torch.tensor(bd, dtype=torch.float32, device=args.device)
bns = torch.tensor(bns, dtype=torch.float32, device=args.device)
DDQN 的代码实现不正确
以下是 repo 中 DDQN 的实现,可以看到target NN 在计算下一个状态的 next Q value的时候,使用的 action 并不是用self.model
得到的,而是直接用 target NN 在下一个状态时最大的价值的动作,这种实现方式是基本的target network + DQN 而不是真正的 DDQN
class DoubleDQN:
def __init__(self, dim_obs=None, num_act=None, discount=0.9):
self.discount = discount
self.model = QNet(dim_obs, num_act)
self.target_model = QNet(dim_obs, num_act)
self.target_model.load_state_dict(self.model.state_dict())
def get_action(self, obs):
qvals = self.model(obs)
return qvals.argmax()
def compute_loss(self, s_batch, a_batch, r_batch, d_batch, next_s_batch):
# Compute current Q value based on current states and actions.
qvals = self.model(s_batch).gather(1, a_batch.unsqueeze(1)).squeeze()
# next state的value不参与导数计算,避免不收敛。
next_qvals, _ = self.target_model(next_s_batch).detach().max(dim=1)
loss = F.mse_loss(r_batch + self.discount * next_qvals * (1 - d_batch), qvals)
return loss
真正的 DDQN 应该改写成
def ddqn_compute_loss(self, s_batch, a_batch, r_batch, d_batch, next_s_batch):
# Compute current Q value based on current states and actions.
qvals = self.model(s_batch).gather(1, a_batch.unsqueeze(1)).squeeze()
next_s_action = self.model(next_s_batch).argmax(dim=1)
next_qvals, _ = self.target_model(next_s_batch).gather(1, next_s_action.unsqueeze(1)).detach().max(dim=1)
loss = F.mse_loss(r_batch + self.discount * next_qvals * (1 - d_batch), qvals)
return loss
经过原始代码测试,eval 时前者平均 reward 是-142.57142857142858,后者是-138.25
可见确实是 DDQN 的效果更好,但是经过观察,DDQN 训练可能需要更多的时间,如果我把max step 设置成 20W
结果反而是 DDQN 更差,这非常诡异
诡异 2:我重新测试了一次 max step = 10w,结果发现这次原compute loss 的方法直接训崩了,avg reward = -200,DDQN 的 avg reward 高达-136,这得出第二个结论,DQN 的训练太不稳定了
我把 ER 改成 PER,结果依然非常不稳定,有时候 PER 更好,有时候 ER 更好,就离谱,RL 真的太不稳定了,哪怕是我这里的 DDQN 也不一定比前面那个版本更好
09——trpo 运行之后报错,请问怎么修改?
python : Traceback (most recent call last):
所在位置 行:1 字符: 1
- python -u 09_trpo.py --do_train --output_dir output/trpo 2>&1 | tee o ...
-
+ CategoryInfo : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException + FullyQualifiedErrorId : NativeCommandError File "09_trpo.py", line 497, in <module> main() File "09_trpo.py", line 490, in main train(args, env, agent) File "09_trpo.py", line 400, in train action = agent.get_action(torch.from_numpy(state)).item()
TypeError: expected np.ndarray (got tuple)
没有ppo的代码吗?
老师好,没看到ppo的代码,后面会补充吗?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.