Git Product home page Git Product logo

q-optimality-tightening's Introduction

Q-Optimality-Tightening

This is my implementation to paper Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening.

Dependencies

  • Numpy
  • Scipy
  • Pillow
  • Matplotlib
  • Theano
  • Lasagne
  • ALE or gym

Readers might refer to https://github.com/spragunr/deep_q_rl for installation information. However, I suggest readers installing all the packages using virtual environment. Please make sure the version of Theano is compatible with the one of Lasagne.

Running

THEANO_FLAGS='deivce=gpu0, allow_gc=False' python run_OT -r frostbite --close2

Running frostbite with close boundes.

THEANO_FLAGS='device=gpu1, allow_gc=False' python run_OT -r gopher

Running gopher with randomly sampled bounds. Default: 4 out of 10 upper bounds are selected as U_{j,k}. 4 out of 10 lower bounds are selected as L_{j,l}.

I have already provided 62 game roms.

If everything is configured correctly, the running should be like this:

steps per second is usually between 105 to 140 using one Titan X. The GPU Occupation is about 30 percent which means our code still has huge space of improvement.

Experiments

First I will show two figures runned on frostbite with --close2: frostbite_cl2_1 frostbite_cl2_2 Two other figures runned with sampling 4 bounds out of 15 are below: frostbite_r15_1 frostbite_r15_2

frostbite's 200M baseline is 328.3

Some other games are displayed here: gopher

gopher's 200M baseline is 8520

hero

hero's 200M baseline is 19950

star_gunner

star_gunner's 200M baseline is 57997

zaxxon

zaxxon's 200M baseline is 4977

Finally, we can roughly compare our method with state-of-art method A3C. Our method is using 1 CPU thread and 1 GPU (GPU Occupation is 30%) while A3C is using multiple CPU threads.

Figure 4 in paper A3C: A3C

Our results:

beam_rider breakout pong qbert space_invaders

From the observations, our method almost always outperforms 1,2 and 4 threads A3C and achieves similar results as 8 threads A3C. To be noticed, these five games chosen by A3C paper are not our method's specialties. Our method would definitely achieve much better performance if we run tests on games that our method is good at.

Explain

Gradients are also rescaled so that their magnitudes are comparable with or without penalty

rescale

About frames

frame

Comments

Since we never did grid search on hyperparameters, we expect finding better settings or initializations to further improve the results. More informed strategies regarding the choice of constraints are possible as well since we may expect lower bounds in the more distant future to have a larger impact early in the training. In contrast once the algorithm is almost converged we may expect lower bounds close to the considered time-step to have bigger impact. More complex penalty functions and sophisticated optimization approaches may yield even better results than the ones we reported yet.

Please cite our paper at

@inproceedings{HeICLR2017,
  author = {F.~S. He, Y. Liu, A.~G. Schwing and J. Peng},
  title = {{Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening}},
  booktitle = {Proc. ICLR},
  year = {2017},
}

q-optimality-tightening's People

Contributors

shibihe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

q-optimality-tightening's Issues

Fewer steps per second as training progresses

Apologies if there is an obvious answer, but from the readme I gathered that when running properly, the steps per second should remain constant throughout training. Running on a GTX 970, I started out with ~90 steps per second and 25% GPU utilization. After leaving it to run overnight, I've found it's only run for 6 epochs and has slowed to about 46 steps per second, with about 15% GPU utilization. Everything runs perfectly otherwise, it takes several hours for the issue to appear, and restarting brings it back up to a normal rate. Is there a known cause/solution for this?

Thank you

Question about quadratic penalties

@ShibiHe
Hi, thanks for your great paper, and sorry to bother you.

In the paper, the upper bound and lower bound are incorporated into the algorithm via quadratic penalties. But I cannot find the implementation corresponding to these two quadratic penalties.

It seems that the loss function is defined in the init function of DeepQLearner class. Here no penalties are added.

And some main differences comparing with the original DQN codes are shown in _do_training function of OptimalityTightening class. I am not so sure what is the meaning of targets1 variable. And how can this implementation works as two quadratic penalties in paper?

Please correct me if I'm wrong, and thank you very much!!

Question on upper bound

@ShibiHe ,
First of all, thanks for this inspiring paper and implementation, great work!

In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.

However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.

Which means, for example, at time step t, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...), if r[t-2] and r[t-1] is very low, we need to decrease the value of Q[t] according to upper bounds introduced by r[t-2], r[t-1].

which means essentially what happened before time step t will have impact on the value Q[t].

Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?

Please correct me if anything wrong,

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.