aviralkumar2907 / bear Goto Github PK

View Code? Open in Web Editor NEW

157.0 157.0 39.0 37 KB

Code for Stabilizing Off-Policy RL via Bootstrapping Error Reduction

Python 100.00%

bear's People

Contributors

Stargazers

Watchers

bear's Issues

In place operations in algos.py

I keep getting this error due to some in place changes to the variable a in sample_multiple:

[W python_anomaly_mode.cpp:60] Warning: Error detected in AddmmBackward. Traceback of forward call that caused the error: File "/home/7331215/wrappers/run_optimizer.py", line 211, in <module> main(sys.argv[1:]) File "/home/7331215/wrappers/run_optimizer.py", line 152, in main RewPred.generate_knobs() File "/home/7331215//wrappers/../../rewardpredictor/rewardpredictor_base.py", line 431, in generate_knobs self.generate_knobs_BEAR() File "/home/7331215//wrappers/../../rewardpredictor/rewardpredictor_base.py", line 532, in generate_knobs_BEAR pol_vals = policy.train(replay_buffer, iterations = int(5e3)) File "/home/7331215/wrappers/../../rl/Algos/BEAR/algos.py", line 440, in train actor_actions, raw_actor_actions = self.actor.sample_multiple(state, num_samples)# num) File "/home/7331215/../../rl/Algos/BEAR/algos.py", line 76, in sample_multiple log_std_a = self.log_std(a.clone()) File "/home/7331215/virtenvs/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/7331215/virtenvs/lib64/python3.6/site-packages/torch/nn/modules/linear.py", line 91, in forward return F.linear(input, self.weight, self.bias) File "/home/7331215/virtenvs/lib64/python3.6/site-packages/torch/nn/functional.py", line 1674, in linear ret = torch.addmm(bias, input, weight.t()) (function print_stack) Traceback (most recent call last): File "/home/7331215/wrappers/run_optimizer.py", line 211, in <module> main(sys.argv[1:]) File "/home/7331215/wrappers/run_optimizer.py", line 152, in main RewPred.generate_knobs() File "/home/7331215/wrappers/../../rewardpredictor/rewardpredictor_base.py", line 431, in generate_knobs self.generate_knobs_BEAR() File "/home/7331215/wrappers/../../rewardpredictor/rewardpredictor_base.py", line 532, in generate_knobs_BEAR pol_vals = policy.train(replay_buffer, iterations = int(5e3)) File "/home/7331215//wrappers/../../rl/Algos/BEAR/algos.py", line 508, in train (-lagrange_loss).backward() File "/home/7331215/virtenvs/lib64/python3.6/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/7331215/virtenvs/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [300, 32]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Any guidance for how to fix? I have edited main.py to adapt to my specific problem task but haven't edited algos.py except to try to debug this error.

OSError: Not a gzipped file (b'\x93N')

Edit: Nothing happens.

Algorithm question

I am confused about the VAE network, could you please explain it? Just to pre-train and obtain the distribution of behavior policy?

BEAR/algos.py

Lines 395 to 398 in f2e31c1

 recon, mean, std = self.vae(state, action) 

 recon_loss = F.mse_loss(recon, action) 

 KL_loss = -0.5 * (1 + torch.log(std.pow(2)) - mean.pow(2) - std.pow(2)).mean() 

 vae_loss = recon_loss + 0.5 * KL_loss

Couldn't reproduce the result on Mujoco suite.

Couldn't reproduce the result on the Mujoco suite.
Setting: We run the BEAR with the recommend settings: ** mmd_sigma = 20.0 , kernel_type = gaussian , num_samples_match = 5 , version = 0 or 2 , lagrange_thresh = 10.0 , `mode = auto**
The batch dataset is produced by training a DDPG agent for 1 million time steps. For reproducing, we use the DDPG code in BCQ repository.

We utilize the final buffer setting in BCQ paper.
Here are the whole results.
Note that the "behavioral" means the evaluation of the DDPG agent when training.

Sweetice 0720 update: Upload the png file for easily reading.

For more clear reading.
Ant-v2.pdf
HalfCheetah-v2.pdf
Hopper-v2.pdf
InvertedDoublePendulum-v2.pdf
InvertedPendulum-v2.pdf
Reacher-v2.pdf
Swimmer-v2.pdf
Walker2d-v2.pdf

What is delta_conf ?

Thank you for your greta work.

BEAR/algos.py

Lines 468 to 470 in f2e31c1

 actor_loss = (-critic_qs +\ 

 self._lambda * (np.sqrt((1 - self.delta_conf)/self.delta_conf)) * std_q +\ 

 self.log_lagrange2.exp() * mmd_loss).mean()

A default value for delta_conf is 0.1 in the code.
But I cant' find what this means in the paper.
Can you explain this?

Target_Q in BCQ

It seems "done" is missing in the calculation of target_Q in BCQ.

https://github.com/aviralkumar2907/BEAR/blob/master/algos.py#L921
It should be target_Q = reward + done * discount * target_Q?

Couldn't reproduce the result on MuJoCo Suite (d4rl datasets).

Hi, Kumar! In the last issue, you mentioned that you don't test BEAR on the final buffer setting and recommend me using d4rl datasets. Following your comments, I use d4rl datasets and the code in d4rl_evaluation. What's a pity, I cannot reproduce your results.
The results are here. :)

For more clear reading.
Offline_rl_results.pdf

Couldn't reproduce the results reported in D4RL

Hi @aviralkumar2907

Thank you for the code!
I try to reproduce the results reported in D4RL "walker2d-medium-v0" environment.
I run the code with the following command:
"python main.py --eval_freq=1000 --algo_name=BEAR --env_name=walker2d-medium-v0 --log_dir=data_walker_BEAR/ --lagrange_thresh=10.0 --distance_type=MMD --mode=auto --num_samples_match=5 --lamda=0.0 --version=0 --mmd_sigma=20.0 --kernel_type=gaussian --use_ensemble_variance="False""

The results are averaged over four random seeds, which are shown here:

However, the score looks much lower than the reported score 1526.7 in walker2d-medium-v0 D4RL paper.

Do you know how to solve this?

Thank you!

Best,
Rui

Question about using Importance Sampling in BEAR

Hello, I have some problem with the BEAR_IS in your algos.py file.

As is known to us, DDPG is actually one-step Q-learning in continuous tasks and BEAR also takes such architechture. Now that it makes no sense to use importance sampling in BEAR because the difference between current policy and behavioral policy doesn't result in the inaccuracy of the estimation of Q-value.

So Can you explain why you wrote a importance sampling version of BEAR in your project?

KeyError: 'data_policy_mean' when run BEAR_IS

Traceback (most recent call last):
File "/home/hq/code/remotepycharmfolder/BEAR-master/main.py", line 209, in
pol_vals = policy.train(replay_buffer, iterations=int(args.eval_freq))
File "/home/hq/code/remotepycharmfolder/BEAR-master/algos.py", line 655, in train
state_np, next_state_np, action, reward, done, mask, data_mean, data_cov = replay_buffer.sample(batch_size, with_data_policy=True)
File "/home/hq/code/remotepycharmfolder/BEAR-master/utils.py", line 40, in sample
data_mean = self.storage['data_policy_mean'][ind]
KeyError: 'data_policy_mean'

When running BEAR_IS algorithm, that raises this error.
Note that your program don't save data_policy_mean and data_policy_logvar in the buffer : )

aviralkumar2907 / bear Goto Github PK

bear's People

Contributors

Stargazers

Watchers

Forkers

bear's Issues

In place operations in algos.py

OSError: Not a gzipped file (b'\x93N')

Algorithm question

Couldn't reproduce the result on Mujoco suite.

What is delta_conf ?

Target_Q in BCQ

Couldn't reproduce the result on MuJoCo Suite (d4rl datasets).

Couldn't reproduce the results reported in D4RL

Question about using Importance Sampling in BEAR

KeyError: 'data_policy_mean' when run BEAR_IS

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	recon, mean, std = self.vae(state, action)
	recon_loss = F.mse_loss(recon, action)
	KL_loss = -0.5 * (1 + torch.log(std.pow(2)) - mean.pow(2) - std.pow(2)).mean()
	vae_loss = recon_loss + 0.5 * KL_loss

	actor_loss = (-critic_qs +\
	self._lambda * (np.sqrt((1 - self.delta_conf)/self.delta_conf)) * std_q +\
	self.log_lagrange2.exp() * mmd_loss).mean()