microsoft / oac-explore Goto Github PK

Code accompanying the paper "Better Exploration with Optimistic Actor Critic" (NeurIPS 2019)

License: MIT License

Dockerfile 4.49% Python 95.19% Shell 0.33%

oac-explore's Issues

disabling code signature checker

I've forked the repo and started making some changes to try something out, but the code appears to have a signature checker, because when I try to run main.py the only output I get is the git diff and then the code exits. Any suggestions on how to disable that?

Obviously I have verified that I can run main.py locally on the master branch, the issue is on my custom branch.

ERROR: Cannot install -r requirements.txt (line 7) and urllib3==1.25.1 because these package versions have conflicting dependencies.

why conflict in reqirements.txt ?

feature request: batched get_optimistic_exploration_action

Would it be straighforward to implement a batched version of get_optimistic_exploration_action?

`RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation`

the following code generates an error in some of the most recent versions of py-torch:

oac-explore/trainer/trainer.py

Lines 146 to 159 in cbc0333

 """ 

  Update networks 

  """ 

 self.qf1_optimizer.zero_grad() 

 qf1_loss.backward() 

 self.qf1_optimizer.step() 

 self.qf2_optimizer.zero_grad() 

 qf2_loss.backward() 

 self.qf2_optimizer.step() 

 self.policy_optimizer.zero_grad() 

 policy_loss.backward() 

 self.policy_optimizer.step()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

In order to solve it is necessary to move these lines

oac-explore/trainer/trainer.py

Lines 120 to 124 in cbc0333

 q_new_actions = torch.min( 

 self.qf1(obs, new_obs_actions), 

 self.qf2(obs, new_obs_actions), 

 ) 

 policy_loss = (alpha * log_pi - q_new_actions).mean()

between the q networks gradient steps and the steps on the policy network as so:

"""
Update networks
"""
self.qf1_optimizer.zero_grad()
qf1_loss.backward(retain_graph=True)
self.qf1_optimizer.step()

self.qf2_optimizer.zero_grad()
qf2_loss.backward(retain_graph=True)
self.qf2_optimizer.step()

q_new_actions = torch.min(
    self.qf1(obs, new_obs_actions),
    self.qf2(obs, new_obs_actions),
)
policy_loss = (alpha * log_pi - q_new_actions).mean()

self.policy_optimizer.zero_grad()
policy_loss.backward(retain_graph=True)
self.policy_optimizer.step()

Be aware that if you simply use an old version of pytorch to solve this problem the behaviour might not be what you expect since the policy_loss was computed based on a network which no longer exists

Documenation

Hi,

It seems the current code lack documentation. I just want to implement OAC and I do not know exactly how I can put the code together to do so.
I would appreciate it if you could make it more clear how people can use your code for OAC. There is little documentation on this.

Gradient calc in deterministic OAC

Hi Quan,

I came across your paper and found it to be interesting. One of the doubts I have is with the implementation of the optimistic policies. Why are you computing gradients of the upper bound w.r.t pre-tanh of the policies? As per the paper, isn' it supposed to be the deterministic action (output of the tanh policy)?

Regards,
Kartik

Seed 3 is omitted in plot_against_baseline.py?

oac-explore/plotting/plot_against_baseline.py

Line 130 in 715db5a

RUN_IDXES = list([i for i in range(5) if i is not 3])

Is there a reason seed 3 is omitted?

Could you provide the hyper-parameters for other envrionments?

I ran the code with the Walker2d environment, and got only around 3,000 score at 1M steps, while the score presented at the paper is over 4,000.

Calculation of alpha loss in SAC is different from the original paper

Hello, in the SAC paper "Soft Actor-Critic Algorithms and Applications" the calculation of the loss of alpha is:

J(alpha) = E[-alpha * (log(pi) + H)]

However, in your implementation, the calculation of the loss of alpha is instead (line 109 of "trainer.py"):

J(alpha) = E[-log(alpha) * (log(pi) + H)]

I am curious why the loss is calculated in this way. I have searched in Github for a couple of PyTorch based SAC implementations and they call calculate the loss in this way. But the TensorFlow based SAC implementations calculate the J(alpha) in the same way as the SAC paper (https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py). TensorFlow implementations still calculate the gradient with respect to log(alpha), but when calculating the loss J(alpha) they use exp(log(alpha)) (which is alpha) instead of log(alpha).

Visualisations of the critic for continuous control domains

hi，Thank you for your code,it's Perfect . I want to know how to get Figure 7 in your paper.

raise InvalidGitRepositoryError(epath) git.exc.InvalidGitRepositoryError: /home/f/Downloads/oac-explore-master

think you for your code ,can you tell me how to deal with this error
/home/f/anaconda3/envs/f/bin/python /home/f/Downloads/oac-explore-master/main.py
Traceback (most recent call last):
File "/home/f/Downloads/oac-explore-master/main.py", line 219, in
variant['log_dir'] = get_log_dir(args)
File "/home/f/Downloads/oac-explore-master/main.py", line 165, in get_log_dir
get_current_branch('./'),
File "/home/f/Downloads/oac-explore-master/main.py", line 35, in get_current_branch
repo = Repo(dir)
File "/home/f/anaconda3/envs/f/lib/python3.7/site-packages/git/repo/base.py", line 181, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: /home/f/Downloads/oac-explore-master

microsoft / oac-explore Goto Github PK

oac-explore's Issues

disabling code signature checker

ERROR: Cannot install -r requirements.txt (line 7) and urllib3==1.25.1 because these package versions have conflicting dependencies.

feature request: batched get_optimistic_exploration_action

`RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation`

Documenation

Gradient calc in deterministic OAC

Seed 3 is omitted in plot_against_baseline.py?

Could you provide the hyper-parameters for other envrionments?

Calculation of alpha loss in SAC is different from the original paper

Visualisations of the critic for continuous control domains

raise InvalidGitRepositoryError(epath) git.exc.InvalidGitRepositoryError: /home/f/Downloads/oac-explore-master

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	"""
	Update networks
	"""
	self.qf1_optimizer.zero_grad()
	qf1_loss.backward()
	self.qf1_optimizer.step()

	self.qf2_optimizer.zero_grad()
	qf2_loss.backward()
	self.qf2_optimizer.step()

	self.policy_optimizer.zero_grad()
	policy_loss.backward()
	self.policy_optimizer.step()

	q_new_actions = torch.min(
	self.qf1(obs, new_obs_actions),
	self.qf2(obs, new_obs_actions),
	)
	policy_loss = (alpha * log_pi - q_new_actions).mean()