bahriddin / deepmine Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 1.58 GB

AI Research project

Python 100.00%

deepmine's People

Contributors

Stargazers

Watchers

Forkers

zhenxiangwang

deepmine's Issues

Train an AI Player for Abalone Game based on Alpha Zero's algorithm.

Our aim is to train an artificial intelligence player for Abalone game based on Alpha Zero's algorithm.

Abalone is an award-winning two-player abstract strategy board game. Players are represented by opposing black and white marbles on a hexagonal board with the objective of pushing six of the opponent's marbles off the edge of the board. Hexagonal board, more possible actions and more complex rules make this game harder to implement than Go.

We want to:

test the generality of Alpha Zero's algorithm on more complex issues,
see whether our AI player can discover some remarkable Abalone game knowledge during
its self-play training process,
compare the performance of Alpha Zero's algorithm with other algorithms.

In order to prevent the game from being so complicated that the training process cannot be completed within three months, we have looked for other games as alternatives, such as Oware.

Related work:
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... & Chen, Y. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354.

Ozcan, E., & Hulagu, B. (2004). A simple intelligent agent for playing abalone game: Abla. In Proc. of the 13th Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 281-290).

Lemmens, N. P. P. M. (2005, June). Constructing an abalone game-playing agent. In Bachelor Conference Knowledge Engineering, Universiteit Maastricht.

Collaborative Game Challenge

Minecraft Mafia

There are 2 teams in M x N world. Initially, team agent's position, position of
walls are placed symmetrically. Each team has same K number of agents. Each
team has to kill enemies. Killing condition is following:

If both team player's are looking face-to-face, one of them randomly will
be dead.
Otherwise, who facing towards to the enemy and adjacent to it will kill.

Some of the complex situations:

All of these actions have the same price:

move forward, backward, right, left
turn left
turn right

So turning back takes 2 steps: whether 2 times left or 2 times right.
Each team's goal to remove all members of competitors. Each team members will play turn-by-turn. So if we call teams A and B: A1-B1-A2-B2-...
If none of the teams won't win after (N+M) x K x c (c=const) moves, the game will be stopped and draw result would be recorded.

This is partially observable game: agents have exact knowledge about what they see and if enemy is closer than 4 blocks, they hear and guess enemy is close.

Questions

Can AI play better than humans?
Which algorithms show best results to solve this problem: to learn faster, to show better results, more robust.

Hypothesis

In simple shooting/fighting complex world team games (like Counter Strike, Quake or tank), can we create smart AI team to beat humans.

The Effect of Different Reinforcement Learning Algorithms on the Performance of AI for Survival Game in MineCraft

The Effect of Different Reinforcement Learning Algorithms on the Performance of AI for Lunar Lander

TITLE

The Effect of Different Reinforcement Learning Algorithms on the Performance of AI for Survival Game in MineCraft.

GAME

The game was found in mob_fun.py. It is a demo of mob_spawner block - creates an arena, lines it with mob spawners of a given type, and then tries to keep an agent alive.
Mob spawners will continue to appear.

The agent will lose health if it is hit by the mob spawners. If the health value drops to 0 then the game ends. There are some apples distributed randomly in the arena, and the agent will get scores when eating apples. The purpose of our agent is to try to survive and get more scores. We may change some of its rules to be more suitable for our experiments.

HYPOTHESIS

Using Double DQN, Prioritized Replay and Dueling DQN can significantly improve scores and shorten agent training time compared with using Natural DQN.
If we combine value-based reinforcement learning algorithm with policy-based reinforcement learning algorithm on our AI agent, then it can get higher scores in less time than using either algorithm alone.

INDEPENDENT VARIABLE

Reinforcement learning algorithms used to train the agent

LEVELS OF INDEPENDENT VARIABLE AND NUMBERS OF REPEATED TRIALS

Simple Rules (Control)	DQN	Double DQN	Prioritized Replay	Dueling DQN	Policy Gradient	Actor Criti
3 times	3 times	3 times	3 times	3 times	3 times	3 times

DEPENDENT VARIABLE AND HOW MEASURED

The score that the agent gets at the end of the game, measured by the number of eaten apples.
Agent training time, measured by minute.

CONSTANTS

All agents are trained in games with the same size
All agents are trained in games with the same rules and scoring conditions
Game states are fully observable for all agents
All agents are trained and compared using the same computing resources

Reference

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... & Chen, Y. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354.
Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016). Deep exploration via bootstrapped DQN. In Advances in neural information processing systems (pp. 4026-4034).
Ontanón, S., Synnaeve, G., Uriarte, A., Richoux, F., Churchill, D., & Preuss, M. (2013). A survey of real-time strategy game ai research and competition in starcraft. IEEE Transactions on Computational Intelligence and AI in games, 5(4), 293-311.
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (pp. 1057-1063).

AI Agents for Gomoku

Gomoku, also called Gobang or Five in a Row, is an abstract strategy board game. It is traditionally played with Go pieces (black and white stones) on a Go board, using 15×15 of the 19×19 grid intersections. Players alternate turns placing a stone of their color on an empty intersection. The winner is the first player to form an unbroken chain of five stones horizontally, vertically, or diagonally.

We want to start researching from the following questions:

Can Alpha Zero's algorithm be applied well to gomoku?
Can AI agent using Alpha Zero's algorithm discover some remarkable Gomoku game knowledge during its self-play training process?
In Gomoku, can Alpha Zero's algorithm perform better than other algorithms?
After comparing the performance of multiple algorithms, can we make a better algorithm (in the future)?

Methods:

Deep Reinforcement Learning + Monte Carlo Tree Search (Alpha Zero)
Alpha–beta Pruning
Adaptive Dynamic Programming
Hand-coded rules

Baseline:
Some naive strategies used by human beginners.

Related work:
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... & Chen, Y. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354.

Zhao, D., Zhang, Z., & Dai, Y. (2012). Self-teaching adaptive dynamic programming for Gomoku. Neurocomputing, 78(1), 23-29.

Shao, K., Zhao, D., Tang, Z., & Zhu, Y. (2016, November). Move prediction in Gomoku using deep learning. In Chinese Association of Automation (YAC), Youth Academic Annual Conference of (pp. 292-297). IEEE.

Tan, Q., & Hu, X. CS221 Project Final Report Gomoku Game Agent.

Robustness and generality of Alpha Zero

Hypothesis

Authors of recently published research Alpha Zero stated that this technique could be easily generalised to other problems without significant human effort and it approached better than other state-of-the-art alternatives [1]. They tested the system in 3 board games, specifically, Go, Chess and Shogi. We would like to check it in the different game(s) with closer game-space and game-tree complexities.

Questions

How well can Alpha Zero algorithm perform in other games than Go, Chess and Shogi?
Can Alpha Zero perform better than other alternatives?
Can AI agent discover some remarkable knowledge during its self-play training process using Alpha Zero?
After comparing the performance of multiple algorithms, can we make a better one (in the future)?

Methods

Deep Reinforcement Learning + Monte Carlo Tree Search (Alpha Zero)
Alpha-beta Pruning
Adaptive Dynamic Programming
Hand-coded rules

This list can be modified in the later stages of the research.

Candidate Games

We would like to find an answer to our research questions using at least one of this board games:

Gomoku

*Gomoku*, also called Gobang or Five in a Row, is an abstract strategy board game. It is traditionally played with Go pieces (black and white stones) on a Go board, using 15×15 of the 19×19 grid intersections. Because pieces are not moved or removed from the board, Gomoku may also be played as a paper and pencil game. The game is known in several countries under different names.

-- Wikipedia

It's space complexity in 15x15 board is about 3²²⁵ while game tree complexity 10⁷⁰ [3].

Abalone

Abalone is an award-winning two-player abstract strategy board game designed by Michel Lalet and Laurent Lévi in 1987. Players are represented by opposing black and white marbles on a hexagonal board with the objective of pushing six of the opponent's marbles off the edge of the board.

-- Wikipedia

It's space complexity is about 6.5 * 10²³ while average branching factor 60 and game tree complexity 5 * 10¹⁵⁴ [4].

Related Work:

Silver D. et al. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. Retrieved from https://arxiv.org/abs/1712.01815.
Jose Camacho Collados [email protected] 11 December 2017, Is AlphaZero really a scientific breakthrough in AI? viewed 21 March, https://medium.com/@josecamachocollados/is-alphazero-really-a-scientific-breakthrough-in-ai-bf66ae1c84f2
Loos A. (2012). Machine Learning for k-in-a-row Type Games Using Random Forest and Genetic Algorithm. Retrieved from https://comserv.cs.ut.ee/home/files/thesis_final_mod.pdf?study=ATILoputoo&reference=5D52AF13A55F51ADB1F03E3C1EEAF628BA1BC580
Lemmens N. (2005). Constructing an Abalone Game-Playing Agent. Retrieved from https://project.dke.maastrichtuniversity.nl/games/files/bsc/Lemmens_BSc-paper.pdf
Zhao, D., Zhang, Z., & Dai, Y. (2012). Self-teaching adaptive dynamic programming for Gomoku. Neurocomputing, 78(1), 23-29.
Shao, K., Zhao, D., Tang, Z., & Zhu, Y. (2016, November). Move prediction in Gomoku using deep learning. In Chinese Association of Automation (YAC), Youth Academic Annual Conference of (pp. 292-297). IEEE.
Tan, Q., & Hu, X. CS221 Project Final Report Gomoku Game Agent.
Oswin,A & Franz,A & Tino,W. Algorithmic Fun - Abalone. Retrieved from http://www.ist.tugraz.at/aichholzer/research/rp/abalone/tele1-02_aich-abalone.pdf.

Muti-Agent Scheduling and Path Finding

Problem

Try to imagine a scenario in the future that several unmanned vehicles are scheduled to pick up and transport passengers. The map of the city is stored in each vehicles. The problem is how to plan the global optimal (or near optimal) path for each vehicles so that all the passengers do not need to wait too long. Also, we don't want the vehicles to waste their energy on unnecessary wandering.

We define several steps that can help us to solve this problem gradually.

Step 1

In the first step, we will get rid of time and there is only one agent. All passengers are waiting on their positions at the beginning. They don't care about when they will be picked up, in other words, they will wait forever. Obviously, it is a classic planing problem. The optimal path can be planned by A* under these conditions.

We have n passengers P1 . . .Pn. For each i ∈ [1,n], there exist a start position Si and a goal position Gi. A solution to this problem is a plan π which is a sequences of actions such that π allow agent A to carry Pi from Si to Gi for each i ∈ [1,n]. The agent can perform one of seven actions each time step. The agent can moves to one of the four sides of his current location on the grid or the agent waits and stays at the same location or the agent pick up or drop passengers.

Step 2

In this step, we will put several agents in the scenario, so we will consider the real multi-agent problem. We will try to use different algorithms to make best schedule, which can minimise each passenger's waiting time, and shortest path, which save most fuel. Meanwhile, the agents need to avoid traffic accident.

We have n passengers P1 ... Pn. For each i ∈ [1,n], there exist a start position Si and a goal position Gi. A plan π consisting of n sequences π1 , . . . πm of move/wait actions such that π allows agent A1 . . . Am to carry Pi from Si to Gi for each i ∈ [1,n]. During each time step, each agent can perform one of seven actions listed above.

Step 3

Finally, we take time into consideration. We will try to enable the passengers to make their orders at any time they want. Thus, our agents need to replan their schedules at any possible time slots. Also, we introduce the concept of time window - if a passenger wait too long, he/she may lose patience then cancel the order. The agent's goal is to make every passengers happy if possible. If not, minimise the lose of passengers, meanwhile, minimise the consumption of energy.

We have n passengers P1 . . . Pn. For each i ∈ [1,n], there exist a start position Si, a goal position Gi, a waiting start time step TSi and waiting dead line time step TDi. A plan π consisting of n sequences π1 , . . . πm of actions such that π allows agent A1 . . . Am to pick up Pi between time step TSi and TDi and carry Pi from Si to Gi for each i ∈ [1,n]. During each time step, each agent can perform one of seven actions listed above.

Plan validation

To decide whether a plan is valid/optimal, we need to consider:

K-robust
The plan must be K-conflict free. K-conflict occurs iff there exists ∆ ∈ [0,k] such that two or more agents are located in the same location in time steps t and t + ∆.
The loss of passengers
As aforementioned, the passengers will cancel their orders when they lose patiences. It is the most important metric in step 3. The plan should minimise the loss of passengers then other metics matter.
Total waiting time of passengers
We don't want our passengers to wait too long, even though the waiting time is within their tolerance.
Total path length
This metric reflects the energy consumption by all unmanned vehicles.
Others
E.g. The variance of waiting time in different passengers.

Map the problem into Minecraft

We create a map representing a city road network, which will be stored in the agents' memory as grid.
Passengers can be represented by blocks located at any positions in the map. They can be picked up by the agents then dropped when they arrive the destination.
Players (agents) in the Minecraft play the role of unmanned vehicles in the real scenario, they perform one of the seven possible actions at each time step.

Metrics

Completeness
The method must guarantee to find a solution when there is one
Optimality
The method must return the optimal method if it exists.
Time complexity
It is measured in the number of states(nodes) that being expanded during problem solving. Obviously, the less nodes we need to expand, the less time we need to get the result.
Space complexity
The memory the method need to use to solve the problem.
Scalability
Including scalability of agents/passengers/size of map.

Hypothesis

It is a combination of Multi-agent Path Finding (MAPF) problem and scheduling/transportation problem. We can create smart AI to plan the global optimal (or near optimal) path for each agents.

Questions

Can we adjust the algorithms that perform well in classic MAPF problem to the problem we define?
Which algorithm perform the best in this problem?
Can we improve the performance of algorithms in any possible way? Or can we introduce any new algorithms to solve this problem?

Methods in previous works

We have found some previous works which focus on MAPF problem. We will adjust them to the problem we define and compare the performance among all implementation and get knowledge.

A* based solution
The A* family of algorithms are natural solvers for MAPF. It is the benchmark for other algorithms.
Standley’s enhancements
Three methods that substantially improve the basic A* setting were introduced by (Standley 2010).

Independence detection (ID)
The basic idea of ID is to divide the agents into independent groups. Two groups of agents are independent if there is an optimal solution for each group such that the two solutions do not conflict.
Conflict voidance table (CAT)
The CAT stores the location and time of every agent in every group. Then, when an MAPF solver is applied for a given group, ties between nodes with the same f-value are broken in favor of the node that has fewer entries in the CAT.
Operator decomposition (OD)
While the former two ideas are general and can be used by any MAPF solver, OD is specific for an A*-based solver. OD reduces the branching factor by introducing intermediate states between the regular states. Intermediate states are generated by applying an operator to a single agent only.

Conflict based search solution
CBS is a two-level algorithm. At the high level, a search is performed on a tree based on conflicts between agents. At the low level, a search is performed only for a single agent at a time. In many cases this reformulation enables CBS to examine fewer states than A* while still maintaining optimality.

bahriddin / deepmine Goto Github PK

deepmine's People

Contributors

Stargazers

Watchers

Forkers

deepmine's Issues

Minecraft Mafia

Questions

Hypothesis

The Effect of Different Reinforcement Learning Algorithms on the Performance of AI for Lunar Lander

TITLE

GAME

HYPOTHESIS

INDEPENDENT VARIABLE

LEVELS OF INDEPENDENT VARIABLE AND NUMBERS OF REPEATED TRIALS

DEPENDENT VARIABLE AND HOW MEASURED

CONSTANTS

Reference

Hypothesis

Questions

Methods

Candidate Games

Gomoku

Abalone

Related Work:

Problem

Step 1

Step 2

Step 3

Plan validation

Map the problem into Minecraft

Metrics

Hypothesis

Questions

Methods in previous works

Related papers

Recommend Projects

Recommend Topics

Recommend Org