alex-petrenko / landmark-exploration Goto Github PK
View Code? Open in Web Editor NEWAttempt to develop a new RL algorithm for hard exploration problems
License: MIT License
Attempt to develop a new RL algorithm for hard exploration problems
License: MIT License
Might be an interesting environment to try
https://blogs.unity3d.com/2019/01/28/obstacle-tower-challenge-test-the-limits-of-intelligence-systems/
Based on distance metric we can provide both sparse rewards (e.g. only +1 for discovering new landmarks) and dense reward (e.g. give a positive reward for getting further from the known landmarks).
Intuitively, sparse rewards should be more reliable, because with dense reward there might exist unexpected local maxima. Although with dense rewards we can achieve better sample efficiency.
The idea of state-space map is pretty general, and we can try an experiment with sparse-reward robotic manipulation (e.g. pushing an object to a target, stacking, etc.)
Need wrappers and default algorithm parameters
For the generalization study, we need elaborate randomly generated mazes, we can use VizDoom level generator for this.
This should be pretty straightforward. We already have an implementation of ICM with A2C, need to use it with PPO algorithm. RND is the same but without training of the inverse dynamics model.
Currently, neighbors are stored as plain images. Instead, it should be better to store them as feature vectors.
In DMLab we can easily query ground truth coordinates of the agent with respect to the map.
Would be very cool to adjust the graph visualization to incorporate this information.
Space discretization can be implemented in two different ways:
(2) has the potential to be more generalizable, and I've never seen people use it this way.
We randomly generate a big number of 3D mazes, split them into train, test and validation sets. Then the policy is only trained on a train set.
We can open-source this dataset, which should be a good contribution.
The agent should be able to robustly navigate between any pair of landmarks connected by an edge in the graph.
Possibilities are:
This mostly applies to training on a single environment. We need to both learn an exploration policy (that would know how to expand the graph), but we also need to make progress so we should not reset the graph too often.
Currently implemented with the dynamic RNN, but we can also do deep sets and potentially other architectures.
When the exploration is stuck we want to look at the graph and select the next target for exploration. A basic policy can be to just randomly sample one, although might not be efficient for large environments, because we always want to be on the "frontier" of exploration.
One idea is to use the value estimate of the exploration policy as the "potential" of the landmark for further exploration. We can also use UCB or Thompson sampling to make sure we explore all landmarks, even those that don't seem promising now.
This should be TMAX but without:
I think we should just have a flag that would disable all of the above.
The baseline methods were designed mostly to explore a single sparse-rewards environment rather than learn a general exploration policy.
We could generate a number of particularly challenging mazes (or use this Unity environment) to see how our method compares to baselines when trained and evaluated on a single environment.
Now it is marked as "skipped"
The first version will likely be based on just randomly picking a new landmark to explore.
The current assumption is that information about neighbors can allow us to do more "directed" exploration: policy can decide where to go based on what it had already seen in the immediate vicinity. This, however, needs to be verified.
Train the exact same policy, but without the heuristic of returning to the next potentially interesting target.
Now the edges are undirected, which does not make sense in Montezuma, because many transitions in this game are irreversible (e.g. result in death).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.