Upside-Down-Reinforcement-Learning

Upside-Down Reinforcement Learning (⅂ꓤ) implementation in Pytorch.
Based on the paper published by Jürgen Schmidhuber: ⅂ꓤ-Paper

This repository contains a discrete action space as well as a continuous action space implementation for the OpenAI gym CartPole environment (continuous version of the environment).

The notebooks include the training of a behavior function as well as an evaluation part, where you can test the trained behavior function. Feed it with an desired reward that the agent shall achieve in a desired time horizon.

Plots for the discrete CartPole Environment:

Plots for the continuous CartPole Environment:

Plots for the LunarLander Environment:

TODO:

test some possible improvements mentioned in the paper (6. Future Research Directions).

Author

Sebastian Dittert

Feel free to use this code for your own projects or research. For citation check DOI or cite as:

@misc{Upside-Down,
  author = {Dittert, Sebastian},
  title = {PyTorch Implementation of Upside-Down RL},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/BY571/Upside-Down-Reinforcement-Learning}},
}

Save model before plots

Please move:

torch.savebf.state.dict(), name)

to between the lines:

rewards, average, d, h, loss = run_upside_down(max_episodes=200)
plt.figure(figsize=(15,8))

I just lost the results of a 4 day experimental run due to this error:

Episode: 7000 | Rewards: 32.87 | Mean_100_Rewards: 0.38 | Loss: 0.6333
qt.qpa.screen: QXcbConnection: Could not connect to display :50.0
Could not connect to any X display

This happened because of a bug in the x2go server exposed when the internet connection is interrupted and a plot function is called.

PS: I was able to construct a graph (attached) of mean reward as a function of episode number by copying the STDOUT log and parsing out the mean reward. As is obvious I increased the number of episodes to 7000 for this experiment. At about 200 episodes the reward peaked and then gradually declined to 0. Any idea why this would happen? The "game" I had it play was very simple: Track a curve that is the sum of 3 sine waves of varying frequency and amplitude with a 256 time-step history available to help classify the action.

by571 / upside-down-reinforcement-learning Goto Github PK