timoklein / alphazero-gym Goto Github PK

View Code? Open in Web Editor NEW

23.0 3.0 3.0 495 KB

AlphaZero for continuous control tasks

License: MIT License

Python 100.00%

reinforcement-learning pytorch gym alphazero mcts deep-reinforcement-learning

alphazero-gym's People

Contributors

Stargazers

Watchers

Forkers

linhongbin-ws dandan0102 oskn-fish

alphazero-gym's Issues

How to replay game with mcts and trained agent?

Is there a sample code?

can you share a requirements.txt of this project?

I can not install this project in my osx 11.2.3 from latest vesion of conda. Maybe a requirements.txt is better.

Use custom environment to run this code

Hi,
I want to use A0 single player method to run on my custom environment.

My state is a picture which has [2, 50, 50] shape. The action is also discrete. I changed the DiscretePolicy self.trunk from the fully connected layer to a CNN. But I'm really confused why it can not work. The policy loss is increased and seems like the agent can not learn anything.

I'm appreciate if you have any suggestions.

Sincerely

Config Error

I was trying to run 'run_continuous.py' to test the code but I got some errors with the config files.

MCTSContinous.yaml misses a field "model"
RMSProp.yaml misses a field "params"

Both are required arguments and I can not figure out how to solve the problem.

Why does the policy loss not decrease but increase?

I executed the run_continuous.py file for the continuous agent and found that the policy loss increased approximately linearly with training episodes until it stabilized. Why is the policy loss not reduced?
And I tried to tune some hyperparameters, such as n_rollouts and hidden_dimensions, but it did not work to reduce the policy loss. The episode reward also didn't improve further over the course of the training. So is that a normal phenomenon for this repo?

Should entropy bonus be also calculated during planning?

Recently, I finished reading this repo code. And I found that the entropy bonus of a state value from SAC is only added at the last output step.

This routine let me can't help but thinking:
If the target is to find an action with the best env reward+max entropy, why not calculate it during planning?

Does game Pendulum really need take up to 1 hour to converge?

I notice the statement in README: " If your laptop is decent it shouldn't take more than an hour.", here I have no idea how long pendulum would take to converge in common. (Since from papers I read, rare of them plot a chart of pendulum, so I have no idea.)
And even more, what does "convergence" mean in the context of pendulum?
Do you mean that with your algorithm, the pendulum would converge to near 0 (the best score) with 1-hour of training?

Personally, with raw SAC, I could make pendulum converge to a range around [-500,-200) within 1 minute on laptop of a Geforce 940M GPU. And it is hard to make the score better, even with more training time.

timoklein / alphazero-gym Goto Github PK

alphazero-gym's People

Contributors

Stargazers

Watchers

Forkers

alphazero-gym's Issues

How to replay game with mcts and trained agent?

can you share a requirements.txt of this project?

Use custom environment to run this code

Config Error

Why does the policy loss not decrease but increase?

Should entropy bonus be also calculated during planning?

Does game Pendulum really need take up to 1 hour to converge?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent