Here I'd like to share some random thoughts on this package in the following three aspects:
- Existing core components in current version(v0.3.0)
- What are missing to support distributed reinforcement learning algorithms?
- The ideal way to do reinforcement learning research.
Feel free to correct me if I misunderstand anything here.
What we have?
RLSetup
RLSetup
is used to organize all the necessary information in the training process. It combines agent(learner
and policy
here), environment and some parameters (like stoppingcriterion
, callbacks
...) together. Then we can call learn!
for training and run!
for testing.
Comments:
- The concept of
RLSetup
is very common and useful in software development. (A very similar concept is TestSuite) And it makes the parameters of callback!
function (which I'll describe soon) consistent. Because everything we need in a callback are all wrapped in a RLSetup
! My only concern is that, different algorithms may need different kinds of parameters for (distributed) training and testing. It is a little vague to cover all these cases in a concept of RLSetup. It would be better to move the extra parameters (like stoppingcriterion
, callbacks
...) into the learn!
and run!
functions. And only keep the core components like learner
, buffer
, policy
in the RLSetup
.
stoppingcriterion
and callbacks
seem to share some similarities. I tried to generalize these two here. I haven't test whether there's any performance decrease here. And by doing so it will let stoppingcriterion
to have multiple criterions.
callbacks
callbacks
are useful for debugging and statistics. Currently, to define a customized callback, we need to do something like this:
# 1. define a struct
mutable struct ReduceEpsilonPerEpisode
ฯต0::Float64
counter::Int64
end
# 2. extend the `callback!` function
function callback!(c::ReduceEpsilonPerEpisode, rlsetup, sraw, a, r, done)
if done
if c.counter == 1
c.ฯต0 = rlsetup.policy.ฯต
end
c.counter += 1
rlsetup.policy.ฯต = c.ฯต0 / c.counter
end
end
Comments:
- I found that sometimes it is a little verbose to define a new struct. For example, to log the loss of each step I had to create an empty struct and print necessary info the the extended callback! function. I attempted to modify the callbacks a little to make it into a closure here. But sometimes closure is not that efficient (see discussions here JuliaLang/julia#15276). So there's a tradeoff here. (I also noticed that in the recent versions of Flux.jl, some closure functions of optimisers are changed to struct based methods)
- Also the
callback!
function can be further simplified with a more general definition callback!(c, rlsetup, sraw, a, r, done) = callback!(c,rlsetup)
considering that we don't need sraw, a, r, done
in most cases.
Learner and Policy
Two core functions around a learner is selectaction
and update
.
selectaction(learner, policy, state)
is called in each step(inside step!
) to generate an action. (maybe call it actor would be better?)
update(learner, buffer)
is called inside `learn! to update a learner.
And we already have several well tested learners
Comments:
- For me, the concept of
learner
is not very clear in the package (I mean it is too generic here and maybe we can decompose it into several common components?).
- I find that policy is sometimes included in a learner. (An example is deepactorcritic.jl)
- We'd better to draw a clear line between learners, actors?
Buffer
Here buffer is used for experience replay. One of the most useful buffer is ArrayStateBuffer
. It uses a circular buffer to store experiences.
Comments:
- I tried to make the buffers more general here. But I'm still not very satisfied with the implementations. Also see discussions here and here. I'll document this part in details in the next section.
Traces
To be honest, I haven't look into the applications of this part. But by reading the source code, I'm wondering if it could be integrated into the concept of buffer. @jbrea
Environment
Environment related code has been split into ReinforcementLearningBase. As @JobJob suggested, we'd better to create a new repo (like Plots.jl I guess?) to support different backends. And we can have many different wrappers to easily introduce new environments. Preprocessors can also be merged into wrappers. I'll make an example repo later and have more discussions there.
Conclusion
In my humble opinion, the components listed above are clear enough to solve many typical RL problems in single machine. For continuous action space problems, @jbrea will take a look later. The only work left is to reorganize the source code a little and clearly define some abstract structs to guide developers on how to implement new algorithms. Some highlights in this repo are:
- Model comparison. This part will be very important in the future and needs to be enhanced to support distributed algorithms.
- A lot of predefined callbacks are very useful.
What are missing?
To compete with many other packages in RL, there's still a long way to go. And one of the most important part is to support distributed RL algorithms.
Typically, there are two directions to scale up deep reinforcement learning.
- To parallelize the computation of gradients.
- To distribute the generation and selection of experiences.
For the first one, we need an efficient parameter server and a standalone resource manager to dispatch computations. (I'm not very experienced in this field, you guys may add more details here.) Some questions in mind are:
- How to comunicate between learners and actors? pub-sub or poll?
- How to do failure tolerance? Maybe we can borrow some ideas from Ray.
For the seconde one, I think we need to carefully design the api first. Although there are many implementations here in Dopamine and here in ray, none of them can be directly ported into Julia (and I believe we can have more efficient implementations). Some critical points are:
- Shared Memory or not?
I have had a long discussion on it with @jbrea before. Obviously it's more efficient to treat the next start state as the end state of current transition. I found that it will make the code much more complicated(forgive my programming skills in Julia, maybe we can find a way to address it). Also in the paper of Distributed Prioritized Experience Replay, as the last sentence in Adding Data, F IMPLEMENTATION, of Appendix states, Note that, since we store both the start and the end state with each transition, we are storing some data twice: this costs more RAM, but simplifies the code. So I guess I'm not the only one...
- Generalized enough to (Distributed) Prioritized Buffer
There are many practical issues to be addressed.
- How easily add more meta data for each transition (id, priority, rank order, last active time...)
- How to queue batches from each actor?
- The general way to update distributed buffer?
- Support async?
Multi-agent
Although multi-agent scenarios are not considered in most existing packages, we'd better to think about it in the early stage.
Model Based Algorithms
- How/When to train/update a environment model?
- The relationship between a environment model between learner and policy?
Compared with Ray
According to the paper about Ray, there are three system layers:
- Global Control Store
- Bottom-up Distributed Scheduler
- In-memory Object Store
For me, the first and second part is relatively easier to understand and re-implement. But the third part is especially difficult for me to figure out how to do it in Julia. If I understand it correctly, Arrow, Plasma is used for processors in one node to avoid serialization/deserialization. I've checked the package Arrow.jl, it seems there's only data transformation and I still don't know how to manage a big memory shared Object Store in Julia across processors like the one in ray.
For the rllib part, the different levels of abstractions are really worth learning from.
Agent
โโโ Optimizer
โโโ Policy Evaluator
โโโ Policy Graph
โโโ Env Creator
โโโ Base Env, Multi-Agent Env, Async Env...
So for me, I'm more skilled in implementing the Env Creator part and I can also offer help to design the API of the other parts. But for the system design level, I really feel that I have a lot to learn.
What's the ideal way to do RL research in Julia?
- Easy to implement/reproduce the result of popular algorithms.
I emphasize implementation here because so many RL packages just provide a function with a lot of parameters and hide a lot of details there(Just like saying, "Hey look, I've implemented so many fancy algorithms here" but in fact it's pretty hard to figure out what it is doing inside.). One thing I really enjoy while learning and using Julia is that I can easily check the source code to figure out the mechanisms inside and then to make improvements.
- Flexible to reuse existing packages.
Like rllib(in Ray), we don't want to limit the users to any specific DL framework. The core components are always replaceable.
- Easy to scale.
TODO List