alex-petrenko / landmark-exploration Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 1.07 MB

Attempt to develop a new RL algorithm for hard exploration problems

License: MIT License

Python 100.00%

landmark-exploration's Introduction

landmark-exploration's People

Contributors

Stargazers

Watchers

Forkers

pelillian

landmark-exploration's Issues

Environments: Unity tower challenge

Might be an interesting environment to try
https://blogs.unity3d.com/2019/01/28/obstacle-tower-challenge-test-the-limits-of-intelligence-systems/

Research: dense vs sparse rewards

Based on distance metric we can provide both sparse rewards (e.g. only +1 for discovering new landmarks) and dense reward (e.g. give a positive reward for getting further from the known landmarks).

Intuitively, sparse rewards should be more reliable, because with dense reward there might exist unexpected local maxima. Although with dense rewards we can achieve better sample efficiency.

Environments: robotic manipulation in simulated or real world

The idea of state-space map is pretty general, and we can try an experiment with sparse-reward robotic manipulation (e.g. pushing an object to a target, stacking, etc.)

Topological map visualization in Tensorboard

Montezuma's revenge and other Atari games

Need wrappers and default algorithm parameters

Environments: complex Doom environments

For the generalization study, we need elaborate randomly generated mazes, we can use VizDoom level generator for this.

Baselines: PPO+ICM and PPO+RND

This should be pretty straightforward. We already have an implementation of ICM with A2C, need to use it with PPO algorithm. RND is the same but without training of the inverse dynamics model.

Optimize the map lookup (neighborhood encoders)

Currently, neighbors are stored as plain images. Instead, it should be better to store them as feature vectors.

When training reachability net, don't use topologically close observation pairs as "unreachable"

Implementation: visualize topological map based on known location in the environment

In DMLab we can easily query ground truth coordinates of the agent with respect to the map.
Would be very cool to adjust the graph visualization to incorporate this information.

Ablation study: reachability metric vs embedding space

Space discretization can be implemented in two different ways:

(current implementation) train a binary classifier, 0 - observations are close, 1 - observations are far apart
learn an embedding space and use the distance in embedding space to determine the reachability

(2) has the potential to be more generalizable, and I've never seen people use it this way.

Agent trajectory visualization in Tensorboard (gifs)

Experiments: generalization

We randomly generate a big number of 3D mazes, split them into train, test and validation sets. Then the policy is only trained on a train set.

We can open-source this dataset, which should be a good contribution.

Research: learn a reliable locomotion policy

The agent should be able to robustly navigate between any pair of landmarks connected by an edge in the graph.
Possibilities are:

Collect past trajectories from exploration policy that initially discovered the landmarks and train locomotion using behaviour cloning
Train locomotion policy with RL: reward +1 is given when target is reached (based on distance metric).

Research: graph pruning and reset policy

This mostly applies to training on a single environment. We need to both learn an exploration policy (that would know how to expand the graph), but we also need to make progress so we should not reset the graph too often.

Ablation study: what is the best "neighborhood encoder"?

Currently implemented with the dynamic RNN, but we can also do deep sets and potentially other architectures.

Research: sample the most "interesting" landmark and navigate to it

When the exploration is stuck we want to look at the graph and select the next target for exploration. A basic policy can be to just randomly sample one, although might not be efficient for large environments, because we always want to be on the "frontier" of exploration.

One idea is to use the value estimate of the exploration policy as the "potential" of the landmark for further exploration. We can also use UCB or Thompson sampling to make sure we explore all landmarks, even those that don't seem promising now.

Baselines: episodic curiosity through reachability

This should be TMAX but without:

Getting rewards for finding edges in the graph
Using locomotion policy
Using neighborhood encoder

I think we should just have a flag that would disable all of the above.

Experiments: large fixed environments

The baseline methods were designed mostly to explore a single sparse-rewards environment rather than learn a general exploration policy.
We could generate a number of particularly challenging mazes (or use this Unity environment) to see how our method compares to baselines when trained and evaluated on a single environment.