POPGym is designed to benchmark memory in deep reinforcement learning. It contains a set of environments and a collection of memory model baselines.
Packages will be sent to pypi soon. Until then, to install the environments:
git clone https://github.com/smorad/popgym
cd popgym
pip install .
To install the baselines and dependencies, first install ray
pip install ray[rllib]
You must do this, as ray-2.0.0 erroneously pins an old verison of gym and will cause dependency issues. This has been patched but did not make it into the latest release. Once ray is installed, install popgym:
git clone https://github.com/smorad/popgym
cd popgym
pip install ".[baselines]"
POPGym contains Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface, where every single environment is procedurally generated. We find that much of RL is a huge pain-in-the-rear to get up and running, so our environments follow a few basic tenets:
- Painless setup -
popgym
environments requires onlygym
,numpy
, andmazelib
as core dependencies, and can be installed with a singlepip install
. - Laptop-sized tasks - None of our environments have large observation spaces or require GPUs to render. Well-designed models should be able to solve a majority of tasks in less than a day.
- True generalization - It is possible for memoryless agents to receive high rewards on environments by memorizing the layout of each level. To avoid this, all environments are heavily randomized.
The environments are split into set or sequence tasks. Ordering matters in sequence tasks (e.g. the order of the button presses in simon matters), and does not matter in set tasks (e.g. the "count" in blackjack does not change if you swap ot-1 and ot-k). We provide a table of the environments. The frames per second (FPS) was computed by running the popgym-perf-test.ipynb
notebook on the Google Colab free tier by stepping and resetting single environment for 100k timesteps. We also provide the same benchmark run on a Macbook Air (2020). With multiprocessing
, environment FPS scales roughly linearly with the number of processes.
Environment | Problem Class | Temporal Ordering | Colab FPS | Macbook Air (2020) FPS |
---|---|---|---|---|
Battleship | Long-term memory | None | 117,158 | 235,402 |
Concentration | Long-term memory | Weak | 47,515 | 157,217 |
Higher/Lower | Card counting | None | 24,312 | 76,903 |
Labyrinth Escape | Navigation | Strong | 1,399 | 41,122 |
Labyrinth Explore | Navigation | Strong | 1,374 | 30,611 |
Minesweeper | Long-term memory | None | 8,434 | 32,003 |
Multiarmed Bandit | Noisy dynamics | None | 48,751 | 469,325 |
Autoencode | Long-term memory | Strong | 121,756 | 251,997 |
Repeat First | Simple | None | 23,895 | 155,201 |
Repeat Previous | Simple | Strong | 50,349 | 136,392 |
Stateless Cartpole | Control | Strong | 73,622 | 218,446 |
Noisy Stateless Cartpole | Noisy dynamics | Strong | 6,269 | 66,891 |
Stateless Pendulum | Control | Strong | 8,168 | 26,358 |
Noisy Stateless Pendulum | Noisy dynamics | Strong | 6,808 | 20,090 |
Feel free to rerun this benchmark using this colab notebook.
The quintessential memory game, sometimes known as "memory". A deck of cards is shuffled and placed face-down. The agent picks two cards to flip face up, if the cards match ranks, the cards are removed from play and the agent receives a reward. If they don't match, they are placed back face-down. The agent must remember where it has seen cards in the past.
Guess whether the next card drawn from the deck is higher or lower than the previously drawn card. The agent should keep a count like blackjack and modify bets, but this game is significantly simpler than blackjack.
One-player battleship. Select a gridsquare to launch an attack, and receive confirmation whether you hit the target. The agent should use memory to remember which gridsquares were hits and which were misses, completing an episode sooner.
Over an episode, solve a multiarmed bandit problem by maximizing the expected reward. The agent should use memory to keep a running mean and variance of bandits.
Classic minesweeper, but with reduced vision range. The agent only has vision of the surroundings near its last sweep. The agent must use memory to remember where the bombs are
Output the t-kth observation for a reward
Output the zeroth observation for a reward
The agent will receive k observations then must output them in the same order
Classic cartpole, except the velocity and angular velocity magnitudes are hidden. The agent must use memory to compute rates of change.
Classic pendulum, but the velocity and angular velocity are hidden from the agent. The agent must use memory to compute rates of change.
Escape randomly-generated labyrinths. The agent must remember wrong turns it has taken to find the exit.
Explore as much of the labyrinth as possible in the time given. The agent must remember where it has been to maximize reward.
We implement the following baselines as RLlib
custom models:
- MLP
- Positional MLP
- Framestacking
- Temporal Convolution
- Elman Networks
- Long Short-Term Memory
- Gated Recurrent Units
- Independently Recurrent Neural Networks
- Fast Autoregressive Transformers
- Fast Weight Programmers
- Legendre Memory Units
- Diagonal State Space Models
- Differentiable Neural Computers
To add your own custom model, please inherit from BaseModel and implement the initial_state
and memory_forward
functions, as well as define your model configuration using MODEL_CONFIG
. To use any of these or your own custom model in ray
, simply add it to the ray
config:
import ray
from popgym.baselines.ray_models.ray_lstm import LSTM
config = {
...
"model": {
custom_model: LSTM,
custom_model_config: {"hidden_size": 128}
}
}
ray.tune.run("PPO", config)
Each model defines a MODEL_CONFIG that you can set by adding keys and values to custom_model_config
. See ppo.py for an in-depth example.
Steps to follow:
- Fork this repo in github
- Clone your fork to your machine
- Move your environment into the forked repo
- Install precommit in the fork (see below)
- Write a unittest in
tests/
, see other tests for examples - Add your environment to
ALL_ENVS
inpopgym/__init__.py
- Make sure you don't break any tests by running
pytest tests/
- Git commit and push to your fork
- Open a pull request on github
# Step 4. Install pre-commit in the fork
pip install pre-commit
git clone https://github.com/smorad/popgym
cd popgym
pre-commit install