liuzuxin / osrl Goto Github PK

View Code? Open in Web Editor NEW

151.0 4.0 12.0 1.49 MB

🤖 Elegant implementations of offline safe RL algorithms in PyTorch

Home Page: https://offline-saferl.org

License: Apache License 2.0

Python 99.24% Makefile 0.76%

library offline-rl pytorch reinforcement-learning robotics safe-rl offline-safe-rl cdt cpq bc-safe

osrl's People

Contributors

Stargazers

Watchers

Forkers

ja4822 sherlook wang1946may7 bobhuangc kihyukh houge1996 gudwls909 sakura-yy520 tudouzhenbutian xjyjz ebenezergelo junhoseo0

osrl's Issues

Using CDT for Multi-Agent Reinforcement Learning

Hi,

I was wondering if this framework can be used for a Multi Agent RL problem.

if not, can you list the limitations that would be stop this model from being used in a Multi Agent RL problem.
If yes, can you explain the additional steps that will be required for me to use this model in a Multi Agent RL problem.

Thanks

Want to see the render results

Hello, I trained the model and want to see the render results in eval phase.I made the following changes in BCQLTrainer.py .

    def rollout(self):
        """
        Evaluates the performance of the model on a single episode.
        """
        obs, info = self.env.reset()
        episode_ret, episode_cost, episode_len = 0.0, 0.0, 0
        for _ in range(self.model.episode_len):
            act, _ = self.model.act(obs)
            obs_next, reward, terminated, truncated, info = self.env.step(act)
            cost = info["cost"] * self.cost_scale
            obs = obs_next
            episode_ret += reward
            episode_len += 1
            episode_cost += cost
            if terminated or truncated:
                break
            
            self.env.render()
        
           return episode_ret, episode_len, episode_cost

In eval_bcql.py

    env = wrap_env(
        env=gym.make(cfg["task"], render_mode="human"),
        reward_scale=cfg["reward_scale"],
    )

But I got it

C:\Users\s3424\anaconda3\envs\RLenv\python.exe D:/Code/Python/OSRL/examples/eval/eval_bcql.py
OApackage is not installed, can not use CDT.
load config from D:\Code\Python\OSRL\examples\train\logs\OfflineCarCircle1Gymnasium-v0-cost-10\BCQL_episode_len300-790e\BCQL_episode_len300-790e\config.yaml
load model from D:\Code\Python\OSRL\examples\train\logs\OfflineCarCircle1Gymnasium-v0-cost-10\BCQL_episode_len300-790e\BCQL_episode_len300-790e\checkpoint/model.pt
Traceback (most recent call last):
  File "D:\Code\Python\OSRL\examples\eval\eval_bcql.py", line 82, in <module>
    eval()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\pyrallis\argparsing.py", line 158, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "D:\Code\Python\OSRL\examples\eval\eval_bcql.py", line 74, in eval
    ret, cost, length = trainer.evaluate(args.eval_episodes)
  File "D:\Code\Python\OSRL\osrl\algorithms\bcql.py", line 327, in evaluate
    epi_ret, epi_len, epi_cost = self.rollout()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\Code\Python\OSRL\osrl\algorithms\bcql.py", line 353, in rollout
    self.env.render()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\core.py", line 418, in render
    return self.env.render()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\core.py", line 418, in render
    return self.env.render()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\core.py", line 418, in render
    return self.env.render()
  [Previous line repeated 1 more time]
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\wrappers\order_enforcing.py", line 70, in render
    return self.env.render(*args, **kwargs)
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\wrappers\env_checker.py", line 63, in render
    return env_render_passive_checker(self.env, *args, **kwargs)
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\utils\passive_env_checker.py", line 391, in env_render_passive_checker
    result = env.render()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\core.py", line 418, in render
    return self.env.render()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\core.py", line 418, in render
    return self.env.render()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\wrappers\order_enforcing.py", line 70, in render
    return self.env.render(*args, **kwargs)
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\wrappers\env_checker.py", line 63, in render
    return env_render_passive_checker(self.env, *args, **kwargs)
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\gymnasium\utils\passive_env_checker.py", line 391, in env_render_passive_checker
    result = env.render()
  File "C:\Users\s3424\anaconda3\envs\RLenv\lib\site-packages\safety_gymnasium\builder.py", line 312, in render
    assert self.render_parameters.mode, 'Please specify the render mode when you make env.'
AssertionError: Please specify the render mode when you make env.

How should I solve this problem.

Problems with the evaluation part

Hi,

I am trying to run the evaluation using the commands provided on the readme. Seems like there is no config file in the checkpoints folder, just the model.pt and model_best.pt. Here is the list of all files in the directory

When I try to run the Eval_cdt.py I get the following error where it says that config file location is not a directory. What am I doing wrong here?

The path to my model and best_models are "/home/xubuntu/Desktop/homelocal/3rd Year/DT_RL/OSRL/logs/OfflineCarPush1Gymnasium-v0-cost-5/CDT_cost5-eea0/CDT_cost5-eea0/checkpoint/mode.pt "

I even tried to give the path argument as you have mentioned in the command in your readme but still the same issue. See the below screenshot please.

In this screenshot I entered the model path inside the eval_cdt.py file and hence path argument is not used here

missing CORL citation

Hi! Great work, surely, safety is a very important topic in offline RL. However, we are a little bit puzzled by the complete lack of citations for the CORL library, considering that:

CORL is a library that is closer in spirit and implementation to OSRL, yet the article cites only d3rlpy. And more broadly, CORL is used in many papers and projects in offline RL, so we feel that a review of related work without mention of CORL is incomplete.
CORL seems to have had a lot of influence on the design and code of this library, as one can find comments and code parts taken directly from CORL. For example, code on OSRL [1, 2] and exactly the same code in CORL [1, 2].

We would like to remind you that CORL is licensed under the Apache License. If you borrow code like this, you should give credit to the source. Also, such misconduct could actually be a significant violation of the NeurIPS Datasets and Benchmarks Track rules.

Thus, we would highly appreciate if you credited CORL both in the code and publication. Thanks in advance!

The parameters in cdt_configs.py

I have successfully written a custom environment in the gymnasium and used it in CDT successfully,Here's the environment I created：

but I ran into two problems:

When the created data in my environment, such as 'actions','rewards', etc. are between 0 and 1, it will cause an error in CDT,It only works if the data range is between 0-100. I think I should modify the parameters in cdt_configs.py, but I don't know which parameter should be modified and how?

2.There are some parameters in cdt_configs.py that I don't know what will happen if I change them?

For example, num_heads, target_returns, cost_limit, deg, max_rew_decrease, max_reward, reward_scale should I change,?How should I change,?And what does them do? Or Is there any documentation for these parameters?
May you help me?I'm really confused about these parameters!

How can I use this on Atari dataset?

Hi,

I would like this model to take visual observations as inputs (for example from Atari dataset) rather than trajectories. Is there a way I could do that?

Thanks

Return and cost are constant as the number of training rounds increases

After running train__cdt.py successfully with my own env，results are as follows:

it's strange that return and cost are constant as the number of training rounds increases.May you tell me what cause this problem?

Questions about the paper details and CDT code bugs

Hi @liuzuxin, great work! I encountered the following bugs and problems when using the OSRL library:

What are the specific values of the three distinct cost thresholds mentioned in Tables 3 and 6 of the paper？
When using the example code in OSRL, changing the environment will not modify the default episode_len, which remains constant at 300, is this a bug?
When I run the CDT code:

python ./examples/train/train_cdt.py --task OfflineCarPush1Gymnasium-v0 --cost_limit 5 --device "cuda:3"

but I got:

/root/anaconda3/envs/osrl/lib/python3.10/site-packages/numpy/lib/polynomial.py:667: RuntimeWarning: invalid value encountered in divide
  lhs /= scale
 ** On entry to DLASCL parameter number  4 had an illegal value
Traceback (most recent call last):
  File "/root/zyn/ydj_data_collection/OSRL/./examples/train/train_cdt.py", line 226, in <module>
    train()
  File "/root/anaconda3/envs/osrl/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/root/zyn/ydj_data_collection/OSRL/./examples/train/train_cdt.py", line 127, in train
    dataset = SequenceDataset(
  File "/root/zyn/ydj_data_collection/OSRL/osrl/common/dataset.py", line 726, in __init__
    self.idx, self.aug_data, self.pareto_frontier, self.indices = augmentation(
  File "/root/zyn/ydj_data_collection/OSRL/osrl/common/dataset.py", line 369, in augmentation
    pareto_frontier = np.poly1d(np.polyfit(cost_ret_pareto, rew_ret_pareto, deg=deg))
  File "<__array_function__ internals>", line 200, in polyfit
  File "/root/anaconda3/envs/osrl/lib/python3.10/site-packages/numpy/lib/polynomial.py", line 668, in polyfit
    c, resids, rank, s = lstsq(lhs, rhs, rcond)
  File "<__array_function__ internals>", line 200, in lstsq
  File "/root/anaconda3/envs/osrl/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 2285, in lstsq
    x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)
  File "/root/anaconda3/envs/osrl/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 101, in _raise_linalgerror_lstsq
    raise LinAlgError("SVD did not converge in Linear Least Squares")
numpy.linalg.LinAlgError: SVD did not converge in Linear Least Squares
wandb: Waiting for W&B process to finish... (failed 1).

How should I solve this problem?

Unable to install the OApackage

Hi,

I am trying to run the train_cdt.py as per instruction listed on github readme page. However, when I try to install the OApackage I get the following error.

❯ pip install OApackage==2.7.6
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting OApackage==2.7.6
  Downloading OApackage-2.7.6.tar.gz (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 11.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      /tmp/pip-install-1t4f2txe/oapackage_afdf46ec7c0d4ba1a2622e95ae5261fd/setup.py:128: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
        if LooseVersion(swig_version) >= LooseVersion(swig_minimum_version):
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-1t4f2txe/oapackage_afdf46ec7c0d4ba1a2622e95ae5261fd/setup.py", line 297, in <module>
          raise Exception('could not find a recent version if SWIG')
      Exception: could not find a recent version if SWIG
      swig_version 3.0.12, swig_executable /usr/bin/swig3.0
      Readthedocs environment: False
      checkZlib: compile and link
      find_packages: ['oapackage', 'tests']
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I tried installing the dependencies within the conda environment as well as outside the environment, but the error persists. Any help in resolving the error will be appreciated.

The output of the training result is empty

I ran the commandpython examples/train/train_bcql.py --task OfflineCarCircle-v0 as illustrated in the example.However, i found the output of the training result is empty.I looked at the web of wandb but it didn't generate any icons, just some empty folders.And logs/OfflineCarCircle-v0-cost-10/BCQL-3d74/BCQL-3d74/progress.txt is also empty.There mush be something wrong,but i don't know why.I‘m looking forward to your help.

Questions about cost_transform

Thank you for the wonderful work. I have read the paper about Constrained Decision Transformer. Could you explain why you apply cost_transform when calculating a cost-to-go (CDT's forward: 50-x and dataset's sample_prob: 70-x)? Also, if this is mentioned in the paper, I would appreciate it if you could tell me where it is written. Thank you.

After install all the package in a conda env, got the error message!

hi, I got an error after install all the packages and run your code.
Traceback (most recent call last):
File "/home/ubuntu/OSRL/examples/train/train_bc.py", line 15, in
from fsrl.utils import WandbLogger
File "/home/ubuntu/anaconda3/envs/OSRL/lib/python3.9/site-packages/fsrl/utils/init.py", line 3, in
from fsrl.utils.logger import BaseLogger, DummyLogger, TensorboardLogger, WandbLogger
File "/home/ubuntu/anaconda3/envs/OSRL/lib/python3.9/site-packages/fsrl/utils/logger/init.py", line 4, in
from fsrl.utils.logger.tb_logger import TensorboardLogger
File "/home/ubuntu/anaconda3/envs/OSRL/lib/python3.9/site-packages/fsrl/utils/logger/tb_logger.py", line 5, in
from torch.utils.tensorboard import SummaryWriter
File "/home/ubuntu/anaconda3/envs/OSRL/lib/python3.9/site-packages/torch/utils/init.py", line 4, in
from .throughput_benchmark import ThroughputBenchmark
File "/home/ubuntu/anaconda3/envs/OSRL/lib/python3.9/site-packages/torch/utils/throughput_benchmark.py", line 2, in
import torch._C
ModuleNotFoundError: No module named 'torch._C'

I want to write an environment for CDT to run, how do I do it？

I successfully install your code and successfully run the code on train_cdt.py.
I want to write an environment for CDT to run, how do I do it？

Here are the details of my computer:

As the number of training rounds increases, the values of cost and reward change irregularly.

In my env,as the number of training rounds increases, the values of cost and reward change irregularly,The training results are shown below,it doesn't look normal.

However,When I choose other env(like OfflinePointCircle2Gymnasium-v0), the cost gradually declines as the number of training rounds increases, the ret rises as the number of training rounds increases,it look normal.

Here's the parameters in my cdt_config.py

@dataclass
class CDTTrainConfig:
    # wandb params
    project: str = "OSRL-baselines"
    group: str = None
    name: Optional[str] = None
    prefix: Optional[str] = "CDT"
    suffix: Optional[str] = ""
    logdir: Optional[str] = "logs"
    verbose: bool = True
    # dataset params
    outliers_percent: float = None
    noise_scale: float = None
    inpaint_ranges: Tuple[Tuple[float, float], ...] = None
    epsilon: float = None
    density: float = 1.0
    # model params
    embedding_dim: int = 128
    num_layers: int = 3
    num_heads: int = 8
    action_head_layers: int = 1
    seq_len: int = 10
    episode_len: int = 300
    attention_dropout: float = 0.1
    residual_dropout: float = 0.1
    embedding_dropout: float = 0.1
    time_emb: bool = True
    # training params
    #task: str = "OfflinePointCircle2Gymnasium-v0"
    task: str = "Autobidding-v0"
    dataset: str = None
    learning_rate: float = 1e-4
    betas: Tuple[float, float] = (0.9, 0.999)
    weight_decay: float = 1e-4
    clip_grad: Optional[float] = 0.25
    batch_size: int = 8
    update_steps: int = 5000
    lr_warmup_steps: int = 200
    reward_scale: float = 0.1
    cost_scale: float = 1
    num_workers: int = 0
    # evaluation params
    target_returns: Tuple[Tuple[float, ...],
                          ...] = ((450.0, 10), (500.0, 20), (550.0, 50))  # reward, cost
    #The cost limit corresponds to the cost threshold for your problem, it should be the same as your target cost return for CDT.
    cost_limit: int = 100
    eval_episodes: int = 10
    eval_every: int = 100
    # general params
    seed: int = 0
    device: str = "cuda:0"
    threads: int = 6
    # augmentation param
    deg: int = 4
    pf_sample: bool = False
    beta: float = 1.0
    augment_percent: float = 0.2
    # maximum absolute value of reward for the augmented trajs
    max_reward: float = 1.0
    # minimum reward above the PF curve
    min_reward: float = 0.2
    # the max drecrease of ret between the associated traj
    # w.r.t the nearest pf traj
    max_rew_decrease: float = 0.5
    # model mode params
    use_rew: bool = True
    use_cost: bool = True
    cost_transform: bool = True
    cost_prefix: bool = False
    add_cost_feat: bool = False
    mul_cost_feat: bool = False
    cat_cost_feat: bool = False
    loss_cost_weight: float = 0.02
    loss_state_weight: float = 0
    cost_reverse: bool = False
    # pf only mode param
    pf_only: bool = False
    rmin: float = 300
    cost_bins: int = 60
    npb: int = 5
    cost_sample: bool = True
    linear: bool = True  # linear or inverse
    start_sampling: bool = False
    prob: float = 0.2
    stochastic: bool = True
    init_temperature: float = 0.1
    no_entropy: bool = False
    # random augmentation
    random_aug: float = 0
    aug_rmin: float = 400
    aug_rmax: float = 500
    aug_cmin: float = -2
    aug_cmax: float = 25
    cgap: float = 5
    rstd: float = 1
    cstd: float = 0.2

@dataclass
class CDTCarCircleConfig(CDTTrainConfig):
    pass

@dataclass
class AutobiddingConfig(CDTTrainConfig):
    # model params
    seq_len: int = 10
    episode_len: int = 1000
    # training params
    task: str = "Autobidding-v0"
    target_returns: Tuple[Tuple[float, ...],
                          ...] = ((15.0, 20), (15.0, 40), (15.0, 80))# reward, cost
    # augmentation param
    deg: int = 1
    max_reward: float = 2
    min_reward: float = 1
    max_rew_decrease: float = 0.3
    device: str = "cuda:0"

The format of my data is shown below:

May you help me?What do you think is the problem?

How can I re-use this code?

Although you mentioned that your code is inspired by CORL but I find differences as well other than CTG. Can you help me understand your code better by answering the following:

What are the major differences of your code from the CORL other than the CTG and why?
If I re-use this codebase for one of my projects do I cite you or the CORL?
What steps will I need to take to include another parameter similar to CTG on the existing architecure. Let's say in addition to RTG and CTG I would like to add another feature such as Distance-to-go, how can I include that?
What all files (model, training, evaluation, visualization, results etc) makeup just the CTD. I would like to isolate all the CTD code from this code-base to work on something like in point 3.

Thanks

AssertionError: action space does not inherit from `gymnasium.spaces.Space`, actual type: <class 'gym.spaces.box.Box'>

The configuration of my computer is as follows

I directly ran your train_cdt.py file without any modification, and the following error occurred. May you help me to solve it？

Questions about cost loss and target cost

Thank you for your outstanding work. I have read the 'Constrained Decision Transformer for Offline Safe Reinforcement Learning'. I have two questions I'd like to kindly ask:

In the implementation, the cross entropy loss between the cost prediction and the ground truth is included in the objective function with a weight of 0.02. I could not find any mention of this in the paper. Is this as intended?
I am curious about the specific target cost return used in Table 1 and Figure 5. Also, in Figure 4, did you vary the target cost return discretely? If so, at what intervals did you change it?

Thank you.
(I also posed a question in issue #7 , but I separated the issues for clarity and to make it easier for future readers to understand.)

DT-Cost file missing in examples

Hi,

I see that one of your baseline method is a simple DT-Cost. But I am unable to find the files in examples folder for DT-Cost. Can you kindly share those files for training.

Thanks

liuzuxin / osrl Goto Github PK

osrl's People

Contributors

Stargazers

Watchers

Forkers

osrl's Issues

Recommend Projects

Recommend Topics

Recommend Org