I executed the run_continuous.py file for the continuous agent and found that the poli

What episode reward does your agent finally reach? What version of gym

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Why does the policy loss not decrease but increase? about alphazero-gym HOT 5 CLOSED

timoklein commented on August 11, 2024

Why does the policy loss not decrease but increase?

from alphazero-gym.

Comments (5)

timoklein commented on August 11, 2024

What episode reward does your agent finally reach?
What version of gym are you using (are you using the package versions from this repo's conda env)?
Are you using a normally distributed policy or the Beta distribution policy?

Honestly it's been so long since I worked with this code that I can't say too much.

Here's a link to some loss plots from a report I made back then. Maybe you can compare to those and diagnose a bit what doesn't work for you:
https://wandb.ai/timo_kk/a0c/reports/A0C-loss-vs-A0C-Q-loss--VmlldzoyNTA3ODQ?accessToken=wzwqwdv9sku8l90i3gyufgrwb7go5uzxt3pbxommmovakhs9w52tpdexnm3r87ow

Here's a set of parameters from a run that worked:

Agent epsilon greedy:
  desc: null
  value: 0
Batch size:
  desc: null
  value: 32
Clamp log param:
  desc: null
  value: true
Clamp loss:
  desc: null
  value: Loss scaling
Date:
  desc: null
  value: '2020-12-22 08:16:39'
Discount factor:
  desc: null
  value: 1
Distribution:
  desc: null
  value: Squashed Normal
Environment:
  desc: null
  value: Pendulum-v0
Environment seed:
  desc: null
  value: 34
Episode length:
  desc: null
  value: 200
Final selection policy:
  desc: null
  value: max_visit
LayerNorm:
  desc: null
  value: false
Learning rate:
  desc: null
  value: 0.001
Log counts scaling factor [tau]:
  desc: null
  value: 0.1
Log prob scale:
  desc: null
  value: Corrected entropy
Loss lr:
  desc: null
  value: 0.001
Loss reduction:
  desc: null
  value: mean
Loss type:
  desc: null
  value: A0C loss tuned
MCTS epsilon greedy:
  desc: null
  value: 0
MCTS rollouts:
  desc: null
  value: 25
Network hidden layers:
  desc: null
  value:
  - 128
  - 128
  - 128
Network hidden units:
  desc: null
  value: 3
Network nonlinearity:
  desc: null
  value: elu
Num mixture components:
  desc: null
  value: 2
Optimizer:
  desc: null
  value: Adam
Policy coefficient:
  desc: null
  value: 0.1
Progressive widening exponent [kappa]:
  desc: null
  value: 0.5
Progressive widening factor [c_pw]:
  desc: null
  value: 1
Replay buffer size:
  desc: null
  value: 3000
Target entropy:
  desc: null
  value: -1
Training episodes:
  desc: null
  value: 45
Training epochs:
  desc: null
  value: 1
UCT constant:
  desc: null
  value: 0.05
V target policy:
  desc: null
  value: off_policy
Value coefficient:
  desc: null
  value: 1
Weight decay:
  desc: null
  value: 0.0001
_wandb:
  desc: null
  value:
    cli_version: 0.10.12
    framework: torch
    is_jupyter_run: false
    is_kaggle_kernel: false
    python_version: 3.8.5

from alphazero-gym.

dbsxdbsx commented on August 11, 2024

@cz11233, just one thing to mention --- In the context of reinforcement learning, policy loss is not a good measurement. Because every time a policy is improved, a better TARGET policy is produced, too. That means there is always a gap between the policy(model) to be trained and the target policy, especially for the AlphaZero/MCTS algorithm.

But if the episode return is not improving after long training, this could be the case.
As for me, I've been in a kind of situation that... an RL algorithm(Not this one) never gets improved after I checked every detail of the algorithm. But finally, it gets improved by just making a full-connected layer output from 64 to 256.

from alphazero-gym.

cz11233 commented on August 11, 2024

@timoklein Well, I know what the problem is, I shouldn't have used beta distribution. So do you know why it doesn't work?
Besides, in your paper "Combining Reinforcement Learning and Search for Cooperative Trajectory Planning", I noticed that the loss plots are given in Figure 30, where the policy loss decreases consistently, neither rising nor flattening. Is this due to the network of the paper being different from this repo? Or is it something else, e.g. multi_agent?

from alphazero-gym.

cz11233 commented on August 11, 2024

@dbsxdbsx Thanks for your answer, it solves my puzzle about the policy loss

from alphazero-gym.

timoklein commented on August 11, 2024

I shouldn't have used beta distribution. So do you know why it doesn't work?

It might just be that I implemented it wrong. I wrote some tests for my thesis, the outputs etc. were in line but it never learned anything. Since the other policies worked, I pursued them further.

Is this due to the network of the paper being different from this repo?

I would think so. The architecture is totally different (way larger model). If I remember correctly, the policy loss would also plateau if you ran it further due to the reason outlined by @dbsxdbsx .

from alphazero-gym.

Why does the policy loss not decrease but increase? about alphazero-gym HOT 5 CLOSED

Comments (5)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent