zackattack614 / blackbird Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 3.0 640 KB

Board game self-learning algorithm.

License: MIT License

Python 100.00%

learning-algorithm alphazero blackbird python neural-network mcts

blackbird's People

Contributors

Stargazers

Watchers

Forkers

zmflavius hanfeijp radid-ziriab

blackbird's Issues

Reuse of variables in nested generators scares me

children_probs = [ (child.N ** (1/self.temperature)) / sum([child.N ** (1/self.temperature) for child in self.root.children]) for child in self.root.children]

its not clear in this expression which iterator child is from when evaluating child.N

U values need to be updated on the fly

Consider the following scenario:

A has children b,c,d

We explore A. Expand the children and explore c.
A.N = 1
c.N = 1
b.N = 0
d.N = 0

The U values for b.N/d.N haven't been updated to account for the change in c.N's value.

Also, sum(child.N for child in parent.children) = parent.N.

Chess BoardState

A chess BoardState class needs to exist for BlackBird to learn the game.

The class should inherit from GameState, and override all functions.

TeacherPolicy not used

https://github.com/jordan-singer/BlackBird/blob/master/src/Network.py#L219

The teacherPolicy variable is not used anywhere.

Deserialize Game States from Protocol Buffers

Given a game state written in a protobuf, the corresponding BoardState object should be able to deserialize and return a full game state to train on.

Expose python APIs for stateless training commands.

We stop updating MCTS values when find an end game

The MCTS algorithm doesn't back up the number of plays if we hit an end game state. This results in occasionally ~no exploration, since we can iterate to an game end that the AI thinks is good value (regardless of if it is), and then we will continue to go down that branch and quit.

The simulations should not stop just because we stumbled upon an end game.

The gamestate classes shouldn't know about policies

#64

Serialization of policy and evaluation shouldn't be handled by the gamestate class.

Loss Not Appropriately Defined

Loss, as defined here, is just the first element of a column vector. It should use reduce_sum over the vector, not just return one element of that vector.

https://github.com/jordan-singer/BlackBird/blob/ec37781c312623d3863a8b6adbc8841280c1e5df/src/network.py#L85

Create python API for exposing supported python operations.

Network Architecture GUI

An end user should be able to modify the architecture of their neural network via a GUI.

Create Electron app with graph generator

Learning Rate Annealing

The current learning setup provides only a constant learning rate for the network's loss calculation. This rate should decrease over time, in some clever fashion.

Typo?

Should this read example.State.Player?

https://github.com/jordan-singer/BlackBird/blob/c7b6d3558b4af5b042a3fe3f781158b53b88d921/src/blackbird.py#L48

Migrate to tf.data

The BlackBird.TrainingExample class should be removed, and replaced with a tf.data.Dataset. The Network.train() method should use a Dataset object.

Load Training Statistics to SQLite3 DB

The win/loss/draw counts vs random, old, and standard MCTS should be logged in the TrainingStatisticsFact table.

Creating monitoring windows in electron app.

Dirichlet Noise Applied During Evaluation

Dirichlet noise should only be applied in self-play in order to aid in exploration in training, not during network evaluation or official play.

MCTS shouldn't be backing up values from unexplored children

How the code is now:

Expand until you find an un-expanded branch node.
Run the NN to get the policy for that node.
For each child, run the net and get the expected value of this children.
Backup the values of each of those children.

The problem here is that we 1. chose a node. Then 2. updated the value of that node to be the average of the children values. This doesn't make sense, since an intelligent network would never have chosen all of those children.

Instead, we should only be backing up the value of the move we thought was realistic to make.

Card

Repeated State Pane

A BoardState's Board member object should have a constant pane of how many times that position has been seen in the game's history.

This is helpful, for example, in informing BlackBird how close it is to a triple repetition in chess.

Policy Head Softmax Applied Twice

The softmax function is applied twice in the network's policy head; it should only be applied once. Also note that the output size of the policy is hard-coded to 9, rather than a variable size representing the shape of the board.

It is still randomly training rewards

We iterate over the entire state history to generate rewards, not just the states in that game.
That is done every game. It just iterates over the entire history and adds ~random rewards to the list.

Publish Training Games to Cloud

Training games that are generated on a client computer should be able to be published for a centralized server to train the next network on.

Serialize Game States in Protocol Buffers

To ensure that game states are as compact as possible before transferring over the wire to a central repository, game states should be serialized in a ProtoBuf. Current state is JSON serialization, which is much less efficient.

Self-Play ELO Rating System

BlackBird needs a rating system so that performance across training sessions can be measured.

Compile graph from JS generated JSON through python API.

History Panes

BoardState arrays should include historical game state data. This will affect the shape of the neural network input, and how data is serialized.

MCTS.getBestMove doesn't re-sample the full branch

To sample the best branch for exploration + exploitation, the relative expected values of all of the nodes need to be compared after every update.

This code
selected_node = self.root
is only called once. Once it is set to the root, all of the subsequent playouts dive deeper into the same branch
while current_playouts < self.max_playouts: while any(selected_node.children): children_QU = [child.Q + child.U for child in selected_node.children] selected_node = selected_node.children[np.argmax(children_QU)]

It should instead start from the root again and recheck the values to make sure that it is exploring the optimal path, and not something it discovered to suck.