The microtbs-rl from alex-petrenko

RL algorithms for the MicroTbs learning environment

This repository contains:

Implementation of the gym-compatible learning environment called MicroTbs (short for Micro Turn-Based Strategy)
Reinforcement learning algorithms aimed to solve this game:
- DQN algorithm, inspired by the original DeepMind's work
- Synchronous Advantage Actor-Critic (A2C), based on the original A3C algorithm, and OpenAI baselines implementation.

This repository was created for educational purposes, to try and learn different RL techniques. It may be useful for people who also want to experiment with deep RL or just seek simple and easy-to-understand implementations of existing RL algorithms

Here's how it looks

A2C solves simple version of the environment MicroTbs-CollectWithTerrain-v2:

a longer animation: https://youtu.be/JykBihC0TvM
Feed-forward A2C in a more challenging version of the environment MicroTbs-CollectPartiallyObservable-v3 (clickable):

As seen in this example, a simple feed-forward method underperforms on this task, mostly due to following reasons:

Feed-forward architecture does not "remember" where the agent has been on previous steps, thus it regularly gets stuck and fails to explore unseen parts of the environment. This can be solved by adding some memory to the policy network (like LSTM cells) and training it as a recurrent net.
Agent has only visual input, the numeric information like the number of remaining movepoints isn't passed to the agent. Therefore, it is not able to plan it's actions optimally in many cases.
Quite curiously, the agent avoids "Stables" (brown object that gives additional movepoints). This is caused by the tiny negative reward that agent receives for making each step that does not collect gold. Early in the training, when the agent moves mostly randomly, taking stables means significantly increasing the penalty received by the end of the episode.
The agent's policy is stochastic, and during training the policy net is penalized for having too low entropy of the probability distribution of actions. This encourages exploration and prevents early convergence to sub-optimal policies, but sometimes can lead to seemingly stupid behavior during the policy execution. Maybe it's a good idea to decay the entropy penalty during training.
Due to the nature of the A2C algorithm, the agent fails to plan sufficiently far ahead, and thus often gets stuck in obstacles. This can be solved by something like a Monte-Carlo playout into the (predicted) future or an imagination module.

About the environment

This environment was created as a playground, to experiment with various RL algorithms. It resembles some of the traits of certain turn-based strategies like HOMM3, hence the name. The task is to collect as much resources (gold) as possible with a limited number of movepoints.

Cell types in the environment:

Red - a hero
Yellow - gold piles
Grey - walls, obstacles
Green - swamp, increases movepoint penalty per move
Brown - stables, increase hero's movepoints
Light blue - lookout tower, opens the map in a certain radius
Black - fog-of-war, unknown territory

Versions of the environment:

CollectSimple - plain gold collection, no terrain or obstacles
CollectWithTerrain - same, but with walls and obstacles in play area
CollectPartiallyObservable - with all types of cells, map is bigger than view size and must be explored

There's also a PvP version of the environment, that allows experiments with self-play, but it is unfinished.

How-to

Play the environment by yourself, with human controls:

python -m microtbs_rl.envs.gameplay

Train a DQN agent with default parameters and see how it works:

python -m microtbs_rl.algorithms.dqn.train_dqn
python -m microtbs_rl.algorithms.dqn.enjoy_dqn

Train an A2C agent with default parameters and see how it works:

python -m microtbs_rl.algorithms.a2c.train_a2c
python -m microtbs_rl.algorithms.a2c.enjoy_a2c

Train a baseline OpenAI DQN implementation and see how it works:

python -m microtbs_rl.algorithms.baselines.openai_baselines.train_baseline_dqn
python -m microtbs_rl.algorithms.baselines.openai_baselines.enjoy_baseline_dqn

To compare performance and learning curves of different algorithms you can modify the file plotter.py to add the experiments you're interested in. Then just run:

python -m microtbs_rl.utils.plotter

Run unit tests:

python -m unittest

You can install this package into your python env and use it as a dependency:

pip install -e .

If you have any questions or problems please feel free to reach me: [email protected] Or just go ahead and open an issue here on Github.

alex-petrenko / microtbs-rl Goto Github PK

microtbs-rl's Introduction

RL algorithms for the MicroTbs learning environment

About the environment

How-to

microtbs-rl's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent