Git Product home page Git Product logo

machina's Introduction



Build Status Python Version License

machina

machina is a library for real-world Deep Reinforcement Learning which is built on top of PyTorch.
machina is officially pronounced "mάkɪnə".

Features

High Composability
Composability is an important property in computer programming, allowing to dynamically switch between program components during execution. machina was built and designed with this principle in mind, allowing for high flexibility on system and program development.
Specifically, the RL-policy interacts with the environment via generated trajectories, making the exchange of either components simple. For example, using machina, it is possible to switch between a simulated and a real-world environment during the training phase.

Base Merits

There are merits for all users including beginners of Deep Reinforcement Learning.

  1. Readability
  2. Intuitive understanding of algorithmic differences
  3. Customizability

Advanced Merits

Using the principle of composability, we can easily implement following configurations which are otherwise difficult in other RL libraries.

  1. Easy implementation of mixed environment (e.g. simulated environment and real world environment, some meta learning settings).
  2. Convenient for combining multiple algorithms (e.g. Q-Prop is combination of TRPO and DDPG).
  3. Possibility of changing hyperparameters dynamically (e.g. Meta Learning for hyperparameters).

1 Meta Reinforcement Learning example

We usually define meta learning as a fast adaptation method for tasks which are sampled from a task-space. In meta RL, a task is defined as a MDP. RL agents have to adapt to a new MDP as fast as possible. We have to sample episodes from different environments to train our meta agent. We can easily implement this like below with machina.

env1 = GymEnv('HumanoidBulletEnv-v0')

env2 = GymEnv('HumanoidFlagrunBulletEnv-v0')

epis1 = sampler1.sample(pol, max_epis=args.max_epis_per_iter)
epis2 = sampler2.sample(pol, max_epis=args.max_epis_per_iter)
traj1 = Traj()
traj2 = Traj()

traj1.add_epis(epis1)
traj1 = ef.compute_vs(traj1, vf)
traj1 = ef.compute_rets(traj1, args.gamma)
traj1 = ef.compute_advs(traj1, args.gamma, args.lam)
traj1 = ef.centerize_advs(traj1)
traj1 = ef.compute_h_masks(traj1)
traj1.register_epis()

traj2.add_epis(epis2)
traj2 = ef.compute_vs(traj2, vf)
traj2 = ef.compute_rets(traj2, args.gamma)
traj2 = ef.compute_advs(traj2, args.gamma, args.lam)
traj2 = ef.centerize_advs(traj2)
traj2 = ef.compute_h_masks(traj2)
traj2.register_epis()

traj1.add_traj(traj2)

result_dict = ppo_clip.train(traj=traj1, pol=pol, vf=vf, clip_param=args.clip_param,
                             optim_pol=optim_pol, optim_vf=optim_vf, epoch=args.epoch_per_iter, batch_size=args.batch_size, max_grad_norm=args.max_grad_norm)

You can see the full example code here.

2 Combination of Off-policy and On-policy algorithms

DeepRL algorithms can be roughly divided into 2 types. On-policy and Off-policy algorithms. On-policy algorithms use only current episodes for updating policy or some value functions. On the other hand, Off-policy algorithms use whole episodes for updating policy or some value functions. On-policy algorithms are more stable but need many episodeos. Off-policy algorithms are sample efficient but unstable. Some algorithms like Q-Prop are a combination of On-policy and Off-policy algorithms. This is an example of the combination using ppo and sac.

epis = sampler.sample(pol, max_steps=args.max_steps_per_iter)

on_traj = Traj()
on_traj.add_epis(epis)

on_traj = ef.add_next_obs(on_traj)
on_traj = ef.compute_vs(on_traj, vf)
on_traj = ef.compute_rets(on_traj, args.gamma)
on_traj = ef.compute_advs(on_traj, args.gamma, args.lam)
on_traj = ef.centerize_advs(on_traj)
on_traj = ef.compute_h_masks(on_traj)
on_traj.register_epis()

result_dict1 = ppo_clip.train(traj=on_traj, pol=pol, vf=vf, clip_param=args.clip_param,
                            optim_pol=optim_pol, optim_vf=optim_vf, epoch=args.epoch_per_iter, batch_size=args.batch_size, max_grad_norm=args.max_grad_norm)

total_epi += on_traj.num_epi
step = on_traj.num_step
total_step += step

off_traj.add_traj(on_traj)

result_dict2 = sac.train(
    off_traj,
    pol, qf, targ_qf, log_alpha,
    optim_pol, optim_qf, optim_alpha,
    100, args.batch_size,
    args.tau, args.gamma, args.sampling,
)

You can see the full example code here.

To obtain this composability, machina's sampling method is deliberatly restricted to be episode-based because episode-based sampling is suitable for real-world environments. Moreover, some algorithms which update networks step by step (e.g. DQN, DDPG) are not reproduced in machina.

Implemented Algorithms

The algorithms classes described below are useful for real-world Deep Reinforcement Learning.

CLASS MERIT ALGORITHM SUPPORT
Model-Free On-Policy RL stable policy learning Proximal Policy Optimization RNN
Trust Region Policy Optimization RNN
Model-Free Off-Policy RL high generalization Soft Actor Critic R2D2
QT-Opt
Deep Deterministic Policy Gradient
Stochastic Value Gradient
Model-Based RL high sample efficiency Model Predictive Control RNN
Imitation Learning removal of the need for reward designing Behavior Cloning
Generative Adversarial Imitation Learning RNN
Adversatial Inverse Reinforcement Learning
Policy Distillation reduction of necessary computation resources during deployment of policy Teacher Distillation
* R2D2 like burn in and saving hidden states methods

Installation

machina supports Ubuntu, Python3.5, 3.6, 3.7 and PyTorch1.0.0+.

machina can be directly installed using PyPI.

pip install machina-rl

Or you can install machina from source code.

git clone https://github.com/DeepX-inc/machina
cd machina
pip install .

Quick Start

You can start machina by checking out this quickstart.

Moreover, you can also check already implemented algorithms in examples.

Documentation

You can check the documentation.

Web Page

You can check machina's web page.

machina's People

Contributors

farzadab avatar iory avatar jinbeizame007 avatar kaorunasuno avatar mmisono avatar pwuethri avatar rarilurelo avatar swdr1904 avatar takerfume avatar ven-kyoshiro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

machina's Issues

GAE_Dataのpreprocessメソッドを使うとSegmentation Fault (core dumped)が起きる

python example/run_trpo.pyを実行したとき
GAE_Dataのpreprocessメソッドを使うとSegmentation Fault (core dump)が起きます。
preprocessメソッドの下記の箇所でセグフォが起こっていて、どうやらvfで推論しようとするとセグフォが起こるようです。

all_path_vs = [vf(torch.tensor(path['obs'], dtype=torch.float,
                                           device=get_device())).cpu().numpy() for path in self.paths]

下記のコードでもセグフォが起こったのでvfの推論時にセグフォが起こると考えて間違いなさそうです。

vf(torch.tensor(self.paths[0]['obs'], dtype=torch.float,
                                           device=get_device())

なお、手元のノートPCではうまくいきましたが、サーバーで実行するとエラーが出ます。

None Error in Categorical and rnn policy with cpu

cd example
python run_ppo.py --env_name CartPole-v0 --rnn --cuda -1

And then, the error below occurs

Traceback (most recent call last):
  File "run_ppo.py", line 153, in <module>
    kl_beta = result_dict['new_kl_beta']
  File "/home/rarilurelo/.pythons/Python-3.5.2/entity/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/raid/work/machina/machina/utils.py", line 47, in measure
    yield
  File "run_ppo.py", line 149, in <module>
    optim_pol=optim_pol, optim_vf=optim_vf, epoch=args.epoch_per_iter, batch_size=args.batch_size, max_grad_norm=args.max_grad_norm)
  File "/raid/work/machina/machina/algos/ppo_clip.py", line 58, in train
    pol_loss = update_pol(pol, optim_pol, batch, clip_param, ent_beta, max_grad_norm)
  File "/raid/work/machina/machina/algos/ppo_clip.py", line 30, in update_pol
    pol_loss = lf.pg_clip(pol, batch, clip_param, ent_beta)
  File "/raid/work/machina/machina/loss_functional.py", line 44, in pg_clip
    _, _, pd_params = pol(obs, h_masks=h_masks)
  File "/home/rarilurelo/.virtuals/py3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/raid/work/machina/machina/pols/categorical_pol.py", line 54, in forward
    ac = self.pd.sample(dict(pi=pi))
  File "/raid/work/machina/machina/pds/categorical_pd.py", line 30, in sample
    pi_sampled = Categorical(probs=pi).sample(sample_shape)
  File "/home/rarilurelo/.virtuals/py3/lib/python3.5/site-packages/torch/distributions/categorical.py", line 110, in sample
    sample_2d = torch.multinomial(probs_2d, 1, True)
RuntimeError: invalid argument 2: invalid multinomial distribution (encountering probability entry < 0) at /pytorch/aten/src/TH/generic/THTensorRandom.cpp:298

Because of None passing via pi.

Managing number of steps in a batch

iterate_rnn in Traj class makes iterator of batches. A tail of the batches are zero padded for arranging length of episodes. For this reason we couldn't control number of steps in a batch.

Transparent environment

Environment is wrapped by many wrapper envs, so it is difficult to access original environment.

Testing policy distillation

@takerfume
I tried to run nosetests -x tests, but it seems not working right now. Or did i do something wrong?

E
======================================================================
ERROR: Failure: ModuleNotFoundError (No module named 'tests')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pierre/anaconda3/lib/python3.7/site-packages/nose/failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "/home/pierre/anaconda3/lib/python3.7/site-packages/nose/loader.py", line 406, in loadTestsFromName
    module = resolve_name(addr.module)
  File "/home/pierre/anaconda3/lib/python3.7/site-packages/nose/util.py", line 312, in resolve_name
    module = __import__('.'.join(parts_copy))
ModuleNotFoundError: No module named 'tests'

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)```

Add N-distill

Adding N-distill according to https://arxiv.org/abs/1902.02186

  • Add next observation to trajectory data structure
  • Directly compute gradient using the given update rule (This is the difference compared to Teacher distill, on-policy distill and entropy regularised distillation
  • Update nn parameters accordingly
  • Test is policy distillation works using available teacher policy

Inappropriate mean in loss_functional with rnn

When RNN is used, loss is averaged through (timestep, batchsize). However steps after terminate are masked by output_masks. Episode length must be arranged same length for using RNN.

pol_loss = torch.mean(pol_loss * out_masks)

We have to calculate this like below.

timestep = torch.sum(out_masks, dim=0)
pol_loss = torch.sum(pol_loss * out_masks) / (timestep * batchsize)

Gaussian Polのagent_infoのスケーリングについて

現実環境でデプロイする時Gaussian noiseを入れていない出力かつaction spaceでスケーリングされたものが望ましい
agent_infoにmean_realを追加し、スケーリングされたmeanを出力する

More general hs (hidden state)

hs should be tuple whose length is 2 in current machina's implementation. It is compatible to LSTM. But we have to implement more general case of memory architectures something like Memory Augmented Network, GRU (hidden state's length is 1).

Add Explanation about Imitation Learning

Explain steps about how to make expert trajectories.
Where should I write?

Contents is like this.

Download the model of expert from here(link).
Store the expert model to data/expert_pols/
run python expert_epis_make.py

performance check

現在、traj内のindexなどもgpuに乗せているが、パフォーマンスを見て、cpuに変える場合も検討する

Data型の統一

現在Off PolicyとOn Policyで異なった�Data型を用いているが、その部分を統一したクラスにし、メソッドによってDataにadvantage functionや、returnなどを追加するように変更する

Write meanings of args

Write meanings of args in a code of example/run_*.py
Contributors should write comment in the codes they themselves wrote.

  • Branch
    write_meaning_of_args

adamwの実装確認依頼

adamwなんですが、

p.data.add_(-group['weight_decay'], p.data)

-weight_decayの部分に、η(=step_size)がかかっていないような気がしていますが、
どうでしょう??
原著では、
-η*weight_decay * x
のような感じになっていると思っています。

Allocate Traj's tensor to cpu

Traj's tensor is now allocated to gpu for fast computing. However it is difficult to allocate all tensors of Off-policy traj to gpu.

Solution

  1. allocating traj's tensor to cpu
  2. setting max_length of traj's tensor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.