Partially Observable Process Gym (POPGym)

POPGym is designed to benchmark memory in deep reinforcement learning. It contains a set of environments and a collection of memory model baselines.

Setup

Packages will be sent to pypi soon. Until then, to install the environments:

git clone https://github.com/smorad/popgym
cd popgym
pip install .

To install the baselines and dependencies, first install ray

pip install ray[rllib]

You must do this, as ray-2.0.0 erroneously pins an old verison of gym and will cause dependency issues. This has been patched but did not make it into the latest release. Once ray is installed, install popgym:

git clone https://github.com/smorad/popgym
cd popgym
pip install ".[baselines]"

POPGym Environments

POPGym contains Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface, where every single environment is procedurally generated. We find that much of RL is a huge pain-in-the-rear to get up and running, so our environments follow a few basic tenets:

Painless setup - popgym environments requires only gym, numpy, and mazelib as core dependencies, and can be installed with a single pip install.
Laptop-sized tasks - None of our environments have large observation spaces or require GPUs to render. Well-designed models should be able to solve a majority of tasks in less than a day.
True generalization - It is possible for memoryless agents to receive high rewards on environments by memorizing the layout of each level. To avoid this, all environments are heavily randomized.

Environment Overview

The environments are split into set or sequence tasks. Ordering matters in sequence tasks (e.g. the order of the button presses in simon matters), and does not matter in set tasks (e.g. the "count" in blackjack does not change if you swap o_t-1 and o_t-k). We provide a table of the environments. The frames per second (FPS) was computed by running the popgym-perf-test.ipynb notebook on the Google Colab free tier by stepping and resetting single environment for 100k timesteps. We also provide the same benchmark run on a Macbook Air (2020). With multiprocessing, environment FPS scales roughly linearly with the number of processes.

Environment	Problem Class	Temporal Ordering	Colab FPS	Macbook Air (2020) FPS
Battleship	Long-term memory	None	117,158	235,402
Concentration	Long-term memory	Weak	47,515	157,217
Higher/Lower	Card counting	None	24,312	76,903
Labyrinth Escape	Navigation	Strong	1,399	41,122
Labyrinth Explore	Navigation	Strong	1,374	30,611
Minesweeper	Long-term memory	None	8,434	32,003
Multiarmed Bandit	Noisy dynamics	None	48,751	469,325
Autoencode	Long-term memory	Strong	121,756	251,997
Repeat First	Simple	None	23,895	155,201
Repeat Previous	Simple	Strong	50,349	136,392
Stateless Cartpole	Control	Strong	73,622	218,446
Noisy Stateless Cartpole	Noisy dynamics	Strong	6,269	66,891
Stateless Pendulum	Control	Strong	8,168	26,358
Noisy Stateless Pendulum	Noisy dynamics	Strong	6,808	20,090

Feel free to rerun this benchmark using this colab notebook.

Environment Descriptions

Concentration

The quintessential memory game, sometimes known as "memory". A deck of cards is shuffled and placed face-down. The agent picks two cards to flip face up, if the cards match ranks, the cards are removed from play and the agent receives a reward. If they don't match, they are placed back face-down. The agent must remember where it has seen cards in the past.

Higher/Lower

Guess whether the next card drawn from the deck is higher or lower than the previously drawn card. The agent should keep a count like blackjack and modify bets, but this game is significantly simpler than blackjack.

Battleship

One-player battleship. Select a gridsquare to launch an attack, and receive confirmation whether you hit the target. The agent should use memory to remember which gridsquares were hits and which were misses, completing an episode sooner.

Multiarmed Bandit

Over an episode, solve a multiarmed bandit problem by maximizing the expected reward. The agent should use memory to keep a running mean and variance of bandits.

Minesweeper

Classic minesweeper, but with reduced vision range. The agent only has vision of the surroundings near its last sweep. The agent must use memory to remember where the bombs are

Repeat Previous

Output the t-k^th observation for a reward

Repeat First

Output the zeroth observation for a reward

Autoencode

The agent will receive k observations then must output them in the same order

Stateless Cartpole

Classic cartpole, except the velocity and angular velocity magnitudes are hidden. The agent must use memory to compute rates of change.

Stateless Pendulum

Classic pendulum, but the velocity and angular velocity are hidden from the agent. The agent must use memory to compute rates of change.

Laybrinth Escape

Escape randomly-generated labyrinths. The agent must remember wrong turns it has taken to find the exit.

Labyrinth Explore

Explore as much of the labyrinth as possible in the time given. The agent must remember where it has been to maximize reward.

POPGym Baselines

We implement the following baselines as RLlib custom models:

To add your own custom model, please inherit from BaseModel and implement the initial_state and memory_forward functions, as well as define your model configuration using MODEL_CONFIG. To use any of these or your own custom model in ray, simply add it to the ray config:

import ray
from popgym.baselines.ray_models.ray_lstm import LSTM
config = {
   ...
   "model": {
      custom_model: LSTM,
      custom_model_config: {"hidden_size": 128}
   }
}
ray.tune.run("PPO", config)

Each model defines a MODEL_CONFIG that you can set by adding keys and values to custom_model_config. See ppo.py for an in-depth example.

Contributing

Steps to follow:

Fork this repo in github
Clone your fork to your machine
Move your environment into the forked repo
Install precommit in the fork (see below)
Write a unittest in tests/, see other tests for examples
Add your environment to ALL_ENVS in popgym/__init__.py
Make sure you don't break any tests by running pytest tests/
Git commit and push to your fork
Open a pull request on github

# Step 4. Install pre-commit in the fork
pip install pre-commit
git clone https://github.com/smorad/popgym
cd popgym
pre-commit install

qingbiaoli / popgym Goto Github PK

popgym's Introduction