ntt123 / a0-jax Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 17.0 3.47 MB

AlphaZero in JAX

Home Page: https://go.ntt123.repl.co

License: MIT License

Python 88.22% HTML 11.11% Shell 0.67%

a0-jax's People

Contributors

Stargazers

Watchers

Forkers

mbrukman johnppp vlin02 hyu2000 xieren58 zpyoung jcfrw oriskunk antonyjia159 siasio ajcutuli gy11564 davidrsewell ztztztztztztz shawwn

a0-jax's Issues

Killed unexpectedly in Colab with TPU

On a budget, I'm running the training_agent for Caro on Colab with TPU.
However, somehow it always got killed at iteration #1 around 64% without much stacktraces provided.

Any experiences or theories on why this may happen?

!TF_CPP_MIN_LOG_LEVEL=0
!time python3 train_agent.py \
    --game-class="caro_game.CaroGame" \
    --agent-class="resnet_policy.ResnetPolicyValueNet128" \
    --selfplay-batch-size=1024 \
    --training-batch-size=1024 \
    --num-simulations-per-move=32 \
    --num-self-plays-per-iteration=102400 \
    --learning-rate=1e-2 \
    --random-seed=42 \
    --ckpt-filename="./caro_agent_9x9_128.ckpt" \
    --num-iterations=100 \
    --lr-decay-steps=500000

2022-11-25 08:59:37.077139: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Cores: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
Loading weights at ./caro_agent_9x9_128.ckpt
Iteration 1
self play [######################--------------] 63% 00:09:41 /bin/bash: line 1: 2377 Killed python3 train_agent.py --game-class="caro_game.CaroGame" --agent-class="resnet_policy.ResnetPolicyValueNet128" --selfplay-batch-size=1024 --training-batch-size=1024 --num-simulations-per-move=32 --num-self-plays-per-iteration=102400 --learning-rate=1e-2 --random-seed=42 --ckpt-filename="./caro_agent_9x9_128.ckpt" --num-iterations=100 --lr-decay-steps=500000

real 17m19.797s
user 10m5.645s
sys 5m3.467s

Training on external environments

I've encountered a containerization issue when tried to implement a new environment that calls external application for game logic. I would need to call in step to get a new state, but at this point action is batched tracer so I can't extract it's value with call because batched input doesn't implement it.

class CheckersGame(Environment):
    ...

    def _step(self, action: chex.Array) -> Tuple["CheckersGame", chex.Array]:
        action = self._prepare_action(action) # get a concrete value of action
        new_state, reward = call_external_env(action)
        return self, jnp.array(reward, dtype=jnp.int32)

    @pax.pure
    def step(self, action: chex.Array) -> Tuple["CheckersGame", chex.Array]:
        # batched action comes in, but concrete value is required
        env, reward = jax.vmap(lambda a: self._step(a))(action.reshape(-1, 1))
        return self, reward

    ...

I can tap into action with id_print, id_tap here, but can't block _step that way.

What's correct way to do that?

Consider using qtransform_completed_by_mix_value.

Thanks for the nice project.
Have you tried using the default qtransform_completed_by_mix_value for the gumbel_muzero_policy?

The qtransform_by_min_max gives zero values to unvisited actions. That does not have a good theoretical justification.

2 player games with non-alternating turns.

I've implemented a game which doesn't have a strictly alternating turn order (some actions change player, others don't). How could this be used in your framework? I think it's the discount, but wanted to check. Should the discount returned be 1 for any action that doesn't change player and -1 otherwise?

Support MuZero

It's a great job! I learned a lot in your repo. Where can I find the implementation of Muzero using mctx? Thanks a lot.

Tic Tac Toe - Missing winning condition

Hi,

We have the winning condititons identified:

I think we might have missed winning by 3 in a row in the middle (vertically and horizontally), i.e., with spaces 1 , 4 7 (vertical) and 3, 5 6 (horizontal)

I have no idea how to fix this in your code :( sorry!

question on 9x9 go agent training

the 9x9 go agent is pretty strong! how many iterations was it trained on? how long does it take to train (i saw it's on TPUs)?

Can I contact you directly?

Hello Mr NTT,
Can I contact you directly? It's about AlphaZero.
My email is [email protected]

ntt123 / a0-jax Goto Github PK

a0-jax's People

Contributors

Stargazers

Watchers

Forkers

a0-jax's Issues

Killed unexpectedly in Colab with TPU

Training on external environments

Consider using qtransform_completed_by_mix_value.

2 player games with non-alternating turns.

Support MuZero

Tic Tac Toe - Missing winning condition

question on 9x9 go agent training

Can I contact you directly?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent