Git Product home page Git Product logo

Comments (5)

timoklein avatar timoklein commented on August 11, 2024
  • What episode reward does your agent finally reach?
  • What version of gym are you using (are you using the package versions from this repo's conda env)?
  • Are you using a normally distributed policy or the Beta distribution policy?

Honestly it's been so long since I worked with this code that I can't say too much.

Here's a link to some loss plots from a report I made back then. Maybe you can compare to those and diagnose a bit what doesn't work for you:
https://wandb.ai/timo_kk/a0c/reports/A0C-loss-vs-A0C-Q-loss--VmlldzoyNTA3ODQ?accessToken=wzwqwdv9sku8l90i3gyufgrwb7go5uzxt3pbxommmovakhs9w52tpdexnm3r87ow

Here's a set of parameters from a run that worked:

Agent epsilon greedy:
  desc: null
  value: 0
Batch size:
  desc: null
  value: 32
Clamp log param:
  desc: null
  value: true
Clamp loss:
  desc: null
  value: Loss scaling
Date:
  desc: null
  value: '2020-12-22 08:16:39'
Discount factor:
  desc: null
  value: 1
Distribution:
  desc: null
  value: Squashed Normal
Environment:
  desc: null
  value: Pendulum-v0
Environment seed:
  desc: null
  value: 34
Episode length:
  desc: null
  value: 200
Final selection policy:
  desc: null
  value: max_visit
LayerNorm:
  desc: null
  value: false
Learning rate:
  desc: null
  value: 0.001
Log counts scaling factor [tau]:
  desc: null
  value: 0.1
Log prob scale:
  desc: null
  value: Corrected entropy
Loss lr:
  desc: null
  value: 0.001
Loss reduction:
  desc: null
  value: mean
Loss type:
  desc: null
  value: A0C loss tuned
MCTS epsilon greedy:
  desc: null
  value: 0
MCTS rollouts:
  desc: null
  value: 25
Network hidden layers:
  desc: null
  value:
  - 128
  - 128
  - 128
Network hidden units:
  desc: null
  value: 3
Network nonlinearity:
  desc: null
  value: elu
Num mixture components:
  desc: null
  value: 2
Optimizer:
  desc: null
  value: Adam
Policy coefficient:
  desc: null
  value: 0.1
Progressive widening exponent [kappa]:
  desc: null
  value: 0.5
Progressive widening factor [c_pw]:
  desc: null
  value: 1
Replay buffer size:
  desc: null
  value: 3000
Target entropy:
  desc: null
  value: -1
Training episodes:
  desc: null
  value: 45
Training epochs:
  desc: null
  value: 1
UCT constant:
  desc: null
  value: 0.05
V target policy:
  desc: null
  value: off_policy
Value coefficient:
  desc: null
  value: 1
Weight decay:
  desc: null
  value: 0.0001
_wandb:
  desc: null
  value:
    cli_version: 0.10.12
    framework: torch
    is_jupyter_run: false
    is_kaggle_kernel: false
    python_version: 3.8.5

from alphazero-gym.

dbsxdbsx avatar dbsxdbsx commented on August 11, 2024

@cz11233, just one thing to mention --- In the context of reinforcement learning, policy loss is not a good measurement. Because every time a policy is improved, a better TARGET policy is produced, too. That means there is always a gap between the policy(model) to be trained and the target policy, especially for the AlphaZero/MCTS algorithm.

But if the episode return is not improving after long training, this could be the case.
As for me, I've been in a kind of situation that... an RL algorithm(Not this one) never gets improved after I checked every detail of the algorithm. But finally, it gets improved by just making a full-connected layer output from 64 to 256.

from alphazero-gym.

cz11233 avatar cz11233 commented on August 11, 2024

@timoklein Well, I know what the problem is, I shouldn't have used beta distribution. So do you know why it doesn't work?
Besides, in your paper "Combining Reinforcement Learning and Search for Cooperative Trajectory Planning", I noticed that the loss plots are given in Figure 30, where the policy loss decreases consistently, neither rising nor flattening. Is this due to the network of the paper being different from this repo? Or is it something else, e.g. multi_agent?
Snipaste_2022-11-23_23-15-54

from alphazero-gym.

cz11233 avatar cz11233 commented on August 11, 2024

@dbsxdbsx Thanks for your answer, it solves my puzzle about the policy loss

from alphazero-gym.

timoklein avatar timoklein commented on August 11, 2024

I shouldn't have used beta distribution. So do you know why it doesn't work?

It might just be that I implemented it wrong. I wrote some tests for my thesis, the outputs etc. were in line but it never learned anything. Since the other policies worked, I pursued them further.

Is this due to the network of the paper being different from this repo?

I would think so. The architecture is totally different (way larger model). If I remember correctly, the policy loss would also plateau if you ran it further due to the reason outlined by @dbsxdbsx .

from alphazero-gym.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.