polixir / neorl Goto Github PK

View Code? Open in Web Editor NEW

96.0 5.0 12.0 21.64 MB

Python interface for accessing the near real-world offline reinforcement learning (NeoRL) benchmark datasets

Home Page: http://polixir.ai/research/neorl

License: Apache License 2.0

Python 100.00%

offline-reinforcement-learning

neorl's Introduction

NeoRL

This repository is the interface for the offline reinforcement learning benchmark NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning.

The NeoRL repository contains datasets for training, tools for validation and corresponding environments for testing the trained policies. Current datasets are collected from three open-source environments, i.e., CityLearn, FinRL, IB, and three Gym-MuJoCo tasks. We use SAC to train on each of these domains, and then use policies around 25%, 50% and 75% of the highest episode return to generate three-level quality of datasets respectively for each task. Since the action spaces of these domains are continuous, the policy output is the mean and stdev of a Gaussian distribution. During data collection, with 80% chance we take the mean of the Gaussian policy and with 20% probability to sample from the trained policies to reflect the mistakes of human operators in real-world systems. The entire datasets can be reproduced with this repo. Besides, we also provide a sales promotion task.

More about the NeoRL benchmark can be found at http://polixir.ai/research/neorl and the following paper

Rong-Jun Qin, Songyi Gao, Xingyuan Zhang, Xiong-Hui Chen, Zewen Li, Weinan Zhang, Yang Yu. NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning.

is now accessible at https://openreview.net/forum?id=jNdLszxdtra.

The benchmark is supported by two additional repos, i.e. OfflineRL for training offline RL algorithms and d3pe for offline evaluation. Details for reproducing the benchmark can be found at here.

Install NeoRL interface

NeoRL interface can be installed as follows:

git clone https://agit.ai/Polixir/neorl.git
cd neorl
pip install -e .

After installation, CityLearn, Finance, the industrial benchmark and the sales promotion environments will be available. If you want to leverage MuJoCo in your tasks, it is necessary to obtain a license and follow the setup instructions, and then run:

pip install -e .[mujoco]

So far "HalfCheetah-v3", "Walker2d-v3", and "Hopper-v3" are supported within MuJoCo.

Using NeoRL

NeoRL uses the OpenAI Gym API. Tasks are created via the neorl.make function. A full list of all tasks is available here.

import neorl

# Create an environment
env = neorl.make("citylearn")
env.reset()
env.step(env.action_space.sample())

# Get 100 trajectories of low level policy collection on citylearn task
train_data, val_data = env.get_dataset(data_type = "low", train_num = 100)

To facilitate setting different goals, users can provide custom reward function to neorl.make() while creating an env. See usage and examples of neorl.make() for more details.

As a benchmark, in order to test algorithms conveniently and quickly, each task is associated with a small training dataset and a validation dataset by default. They can be obtained by env.get_dataset(). Meanwhile, for flexibility, extra parameters can be passed into get_dataset() to get multiple pairs of datasets for benchmarking. Each task collects data using a low, medium, or high level policy; for each task, we provide training data for a maximum of 10000 trajectories. See usage of get_dataset() for more details about parameter usage.

Data in NeoRL

In NeoRL, training data and validation data returned by get_dataset() function are dict with the same format:

obs: An N by observation dimensional array of current step's observation.
next_obs: An N by observation dimensional array of next step's observation.
action: An N by action dimensional array of actions.
reward: An N dimensional array of rewards.
done: An N dimensional array of episode termination flags.
index: An trajectory number-dimensional array. The numbers in index indicate the beginning of trajectories.

Reference

CityLearn: Vázquez-Canteli J R, Kämpf J, Henze G, et al. "CityLearn v1.0: An OpenAI Gym Environment for Demand Response with Deep Reinforcement Learning." Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, pp. 356-357, 2019. paper code
FinRL: Liu X Y, Yang H, Chen Q, et al. "FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance." arXiv preprint arXiv:2011.09607, 2020. paper code
Industrial Benchmark: Hein D, Depeweg S, Tokic M, et al. "A Benchmark Environment Motivated by Industrial Control Problems." Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence, pp. 1-8, 2017. paper code
MuJoCo: Todorov E, Erez T, Tassa Y. "Mujoco: A Physics Engine for Model-based Control." Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026-5033, 2012. paper website

Licenses

All datasets are licensed under the Creative Commons Attribution 4.0 License (CC BY), and code is licensed under the Apache 2.0 License.

neorl's People

Contributors

Stargazers

Watchers

Forkers

stjordanis ssimonc huangshiyu13 khurrumsaleem gmmkmtgk cjmdd serapergun arnord victoriawong shercklo gwangpyo dtbinh

neorl's Issues

LSTM

How can this project implement Recurrent Neural Network?

Action space difference between datasset and environment

Hi, our team are training our model with NeoRL and find Action space difference between datasset and environment.

When excuting code below：

env = neorl.make('Citylearn')
low_dataset, _ = env.get_dataset(
    data_type="low",
    train_num=args.traj_num,
    need_val=False,
    use_data_reward=True,
)
action_low = env.action_space.low
action_high = env.action_space.high
print('action_low', action_low)
print('action_high', action_high)
print('dataset action_low', np.min(low_dataset['action'], axis=0))
print('dataset action_low', np.max(low_dataset['action'], axis=0))`

the output is bwlow , and action range is obviously different between dataset and env, which makes us confused.

action_low [-0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
 -0.33333334 -0.33333334]
action_high [0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
 0.33333334 0.33333334]
dataset action_low [-3.5973904 -4.031006  -3.167992  -3.1832075 -3.4287922 -3.9067357
 -3.4079363 -3.3709202 -3.1863866 -4.1262846 -3.6601577 -4.087899
 -3.8954997 -3.312598 ]
dataset action_low [3.4334774 3.8551078 3.4849963 3.7777936 3.6103873 3.9329555 3.7596557
 3.7149396 4.0387006 3.3615265 3.946596  4.272308  3.4278386 3.3716872]

///

Question regarding the reward of sales promotion training dataset

Hi,

In the sales promotion environment the reward is denoted by rew = (d_total_gmv - d_total_cost)/self.num_users which means the operator observes one single reward signal over all users. However, in the offline training dataset the reward is different for each user across 50 days. For example refer to the below user orders and reward graph

as per my understanding the reward should be same each day for the three users and gradually increase over 50 days with increase in sales. Could you kindly let me know how the reward in the training dataset was calculated.

Can I use NeoRL, to generate a dataset, in the d4rl format, e.g for the finance environment?

I see that the get_dataset() call returns a dict with most of the relevant information for d4l, I wanted to know how I may generate more than the available 10000 data points provided.

I'd like to use NeoRL, specifically, the finance env and generate a dataset in the d4rl format. That would be highly useful.

Thanks,
Krishna

Question regarding the dynamics pre-training

Dear Authors,
I can't get my head around a particular line of code in pretrain_dynamics.py, line 56.
There, the number of hidden units for each hidden layer in the ensemble depends on the task:

    hidden_units = 1024 if config["task"] in ['ib', 'finance', 'citylearn'] else 256

Thus, if I understand it correctly, each hidden layer in the 7 models would have 256 units in the Mujoco tasks, 1024 otherwise.
However, the paper states:

... For model-based approaches, the transition model is represented by a local Gaussian distribution ... ... by an MLP with 4 hidden layers and 256 units per layer ...

On the contrary, following the paper, the algo_init function for MOPO (as a model-based algorithm example), sets the number of hidden_units in the ensemble to the provided config value, which defaults to 256. Nonetheless, this ensemble is ignored if a pre-trained one is given.

    transition = EnsembleTransition(obs_shape, action_shape, args['hidden_layer_size'], args['transition_layers'], args['transition_init_num']).to(args['device'])

All things considered, is there a particular reason why the pretrain_dynamics script instantiates the hidden layers in the ensemble with 1024 units, instead of 256?
Or can I simply ignore this change, given the fact that the results in the paper, as stated, have been obtained by using the latter value?

Kind regards

Unable to reproduce results for BCQ

Hi!
I was trying to reproduce results for BCQ but failed. For example, in the maze2d-large-v1 environment, the d4rl score given by this repo is around 20. In contrast, the d4rl score given by the original code for BCQ is around 30.

I tested the algorithm over three seeds and averaged the performance over the last 10 evaluations, so it does not seem to be resulting from a bad seed selection or big performance fluctuations. I also tried replacing hyperparameters in benchmark/OfflineRL/offlinerl/algo/modelfree/bcq.py and benchmark/OfflineRL/offlinerl/config/algo/bcq_config.py with the original ones, but it still failed.

Could you please figure that out and fix it? Thanks a lot!

Next_Observation and Reward clamping in MOPO

Dear Authors,
In your MOPO implementation, when generating the transitions from the ensemble, you take under consideration the min/max of the training batch, as follows:

obs_max = torch.as_tensor(train_buffer['obs'].max(axis=0)).to(self.device)
obs_min = torch.as_tensor(train_buffer['obs'].min(axis=0)).to(self.device)
rew_max = train_buffer['rew'].max()
rew_min = train_buffer['rew'].min()

So that, prior to the computation of the penalised reward and the addition of the experience tuple in the batch, you can clamp the observation and the reward between the min and the max recovered above:

next_obs = torch.max(torch.min(next_obs, obs_max), obs_min)
reward = torch.clamp(reward, rew_min, rew_max)

Is there a particular reason behind this choice? I could not find a correspondence in the original MOPO implementation/publication, or is it simply due to other re-implementation needs, considering the different framework used?

Kind regards

Update Aim version and add Aim running instructions on README

HI, Gev here - the author of Aim
Love your work on the NeoRL and would love to contribute.

Changes Proposal

Aim 3.x has been released a while ago and it's a much improved and better scalable version (especially for the RL cases).
Would love to update Aim to 3.x and add an instruction section on README so it's easier to run the benchmarks?

Motivation

To provide easy and smoother experience for the NeoRL users with Aim.

Baseline Policies and Raw Results

Hi,

Do you plan to open-source raw results (especially for the newer version of your paper)? This could be very helpful for computing other relevant metrics.
Do you plan to open-source baseline policies?

This data could be extremely helpful (considerably decrease needed compute time) for our research.

Learning curves in IB

Hi,

i am executing benchmark scripts with IB datasets and I am not getting any results. The picture corresponds to the running of BCQ with the IB-Low-100 dataset. Each 10 training epochs I run 100 validation episodes and obtain its mean reward, the result is an horizontal line, no learning takes place.

Thank you for all you answers!

Expert scores & random scores for normalization

I'm trying to get the normalized score of NeoRL mujoco tasks.
But, I could not find the expert scores & random scores inside the codebase nor in the paper.

Can you provide or guide me where I can get those scores?
(I think there should be somewhere, but I can not find it...)