fkluger / tnt-battlesnake Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 6.0 387 KB

Python 100.00%

tnt-battlesnake's People

Contributors

Stargazers

Watchers

Forkers

yiluo2018 jannikgdev burck1 thomasdelteil dachenscratch vvagias

tnt-battlesnake's Issues

Experiment with different state representation

Currently the state is encoded as a (width + 1, height) tensor that contains an encoding of the different entities on their positions. Also in the last column the health points of the agent snake are encoded.

An example state looks like this:

32   32   32   32   32   32   32  67
32   00   42   00   00   31   32  00
32   00   00   00   00   32   32  00
32  -45   32   33   00   33   32  00
32   32   32   32   32   32   32  00

This encodes the following state:

width = 7
height = 5
own health = 67
fruit @ [2, 1]
agent snake @ [[1, 3], [2, 3], [3, 3]] = [head, body, tail]
enemy snake @ [[5, 1], [5, 2], [5, 3]] = [head, body, tail]

It might make training easier if the different entities are represented as different channels of the state. There would be a channel for the agent snake, fruits and obstacles (enemy snakes and walls). For this the observe method has to be modified.

Add "simulation" method to battlesnake_env

https://github.com/fkluger/tnt-battlesnake/blob/master/gym_battlesnake/envs/battlesnake_env.py#L11

Implement Double-Q Agent

Problem

The DQN agent overestimates Q-Values because of the max operation.

Solution

Use a copy of the network that is updated less frequently for the computation of the target Q-Value and estimate the Q-Value with the existing network.

Steps

Implement double_dqn.py here where the second network is updated every n training steps of the first network.
- Updating the second network means copying the weights from the first network to it.
Change this line to use the second network.

Perform multiple replays

Performing one replay of (for example) 64 samples per new observation seems to slow down the training significantly. We could try to perform multiple replays in succession every n observations for faster training iterations.

Implement Dueling network

Problem

For some states the action to take does not matter. Still the network has to approximate the Q-Value for each action for each state. This means that to arrive at a correct Q-Value for such a state, the network has to update its weights once for each possible action.

Solution

Separate the Value of the state s and the Value of the action a according to the following formula.

Q(s, a) = V(s) + A(s, a) - mean(A(s, a'))

Intuitively it computes the Q-Value from the Value of the state and the advantage A of an action over the other actions in that state.

Steps

Implement dueling_dqn.py here.
- An implementation can be found here.

Implement rule based agent based on the dqn agent

https://github.com/fkluger/tnt-battlesnake/blob/master/dqn/agent/dqn_agent.py

Maybe use this one as a starting point:
https://github.com/Hawstein/snake-ai

Stop and continue training

A mechanism should be implemented to stop and continue training. For this the following data should be persisted regularly and restored:

DQN weights (Use model.save(...) and keras.models.load_model(...))
Episode count

Also a file has to be created that documents the hyperparameters for each training.

Implement proper rendering in battlesnake_env

https://github.com/fkluger/tnt-battlesnake/blob/master/gym_battlesnake/envs/battlesnake_env.py#L60

Initialize using a random agent

To prevent falling into a local minimum, we should fill the replay memory with observations of a random agent.

Implement prioritized replay

Problem

Sampling observations with uniform probability leads to an overrepresentation of states that occur more often.

Solution

Assign a priority to each observation in the memory. The priority p of an observation could be defined via the error in the estimation of its Q-Value. epsilon is the minimum probability and alpha defines the amount of prioritization that is used.

p = (error + epsilon)^alpha

For efficient insertion and retrieval we will use a Sum-Tree.

Steps

Implement experienced_replay.py using the Sum-Tree here.

Create Tensorboard summaries

The following summaries will be useful for debugging the training:

Loss
Rewards
Episode length
Action distribution

For evaluation the summaries should also include:

State - Q-Value distribution pairs
State relevancy map

fkluger / tnt-battlesnake Goto Github PK

tnt-battlesnake's People

Contributors

Stargazers

Watchers

Forkers

tnt-battlesnake's Issues

Problem

Solution

Steps

Problem

Solution

Steps

Problem

Solution

Steps

Recommend Projects

Recommend Topics

Recommend Org