Comments (5)
- What episode reward does your agent finally reach?
- What version of gym are you using (are you using the package versions from this repo's conda env)?
- Are you using a normally distributed policy or the Beta distribution policy?
Honestly it's been so long since I worked with this code that I can't say too much.
Here's a link to some loss plots from a report I made back then. Maybe you can compare to those and diagnose a bit what doesn't work for you:
https://wandb.ai/timo_kk/a0c/reports/A0C-loss-vs-A0C-Q-loss--VmlldzoyNTA3ODQ?accessToken=wzwqwdv9sku8l90i3gyufgrwb7go5uzxt3pbxommmovakhs9w52tpdexnm3r87ow
Here's a set of parameters from a run that worked:
Agent epsilon greedy:
desc: null
value: 0
Batch size:
desc: null
value: 32
Clamp log param:
desc: null
value: true
Clamp loss:
desc: null
value: Loss scaling
Date:
desc: null
value: '2020-12-22 08:16:39'
Discount factor:
desc: null
value: 1
Distribution:
desc: null
value: Squashed Normal
Environment:
desc: null
value: Pendulum-v0
Environment seed:
desc: null
value: 34
Episode length:
desc: null
value: 200
Final selection policy:
desc: null
value: max_visit
LayerNorm:
desc: null
value: false
Learning rate:
desc: null
value: 0.001
Log counts scaling factor [tau]:
desc: null
value: 0.1
Log prob scale:
desc: null
value: Corrected entropy
Loss lr:
desc: null
value: 0.001
Loss reduction:
desc: null
value: mean
Loss type:
desc: null
value: A0C loss tuned
MCTS epsilon greedy:
desc: null
value: 0
MCTS rollouts:
desc: null
value: 25
Network hidden layers:
desc: null
value:
- 128
- 128
- 128
Network hidden units:
desc: null
value: 3
Network nonlinearity:
desc: null
value: elu
Num mixture components:
desc: null
value: 2
Optimizer:
desc: null
value: Adam
Policy coefficient:
desc: null
value: 0.1
Progressive widening exponent [kappa]:
desc: null
value: 0.5
Progressive widening factor [c_pw]:
desc: null
value: 1
Replay buffer size:
desc: null
value: 3000
Target entropy:
desc: null
value: -1
Training episodes:
desc: null
value: 45
Training epochs:
desc: null
value: 1
UCT constant:
desc: null
value: 0.05
V target policy:
desc: null
value: off_policy
Value coefficient:
desc: null
value: 1
Weight decay:
desc: null
value: 0.0001
_wandb:
desc: null
value:
cli_version: 0.10.12
framework: torch
is_jupyter_run: false
is_kaggle_kernel: false
python_version: 3.8.5
from alphazero-gym.
@cz11233, just one thing to mention --- In the context of reinforcement learning, policy loss is not a good measurement. Because every time a policy is improved, a better TARGET policy is produced, too. That means there is always a gap between the policy(model) to be trained and the target policy, especially for the AlphaZero/MCTS algorithm.
But if the episode return is not improving after long training, this could be the case.
As for me, I've been in a kind of situation that... an RL algorithm(Not this one) never gets improved after I checked every detail of the algorithm. But finally, it gets improved by just making a full-connected layer output from 64 to 256.
from alphazero-gym.
@timoklein Well, I know what the problem is, I shouldn't have used beta distribution. So do you know why it doesn't work?
Besides, in your paper "Combining Reinforcement Learning and Search for Cooperative Trajectory Planning", I noticed that the loss plots are given in Figure 30, where the policy loss decreases consistently, neither rising nor flattening. Is this due to the network of the paper being different from this repo? Or is it something else, e.g. multi_agent?
from alphazero-gym.
@dbsxdbsx Thanks for your answer, it solves my puzzle about the policy loss
from alphazero-gym.
I shouldn't have used beta distribution. So do you know why it doesn't work?
It might just be that I implemented it wrong. I wrote some tests for my thesis, the outputs etc. were in line but it never learned anything. Since the other policies worked, I pursued them further.
Is this due to the network of the paper being different from this repo?
I would think so. The architecture is totally different (way larger model). If I remember correctly, the policy loss would also plateau if you ran it further due to the reason outlined by @dbsxdbsx .
from alphazero-gym.
Related Issues (7)
- can you share a requirements.txt of this project? HOT 3
- Should entropy bonus be also calculated during planning? HOT 6
- Config Error HOT 2
- Use custom environment to run this code HOT 2
- How to replay game with mcts and trained agent? HOT 1
- Does game Pendulum really need take up to 1 hour to converge? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alphazero-gym.