pku-epic / unidexgrasp2 Goto Github PK

UniDexGrasp++: Improving Dexterous Grasping Policy Learning via Geometry-aware Curriculum and Iterative Generalist-Specialist Learning

License: MIT License

Python 99.91% Shell 0.09%

unidexgrasp2's Issues

The code

Hi, thanks for the great work. Could I know when the code will be released?

[Bug] DAggerValue, ppo_buffer adds next_obs instead of current_obs

# Record the transition
self.storage.add_transitions(current_obs, actions_expert, rews, dones)
current_obs.copy_(next_obs)

# value_net
if self.apply_value_net:
    self.ppo_buffer.add_transitions(
        current_obs, current_states, actions, rews, 
        dones, values, actions_log_prob, mu, sigma,
    )
    current_states.copy_(next_states)

the current_obs variable updates before adding transitions to ppo_buffer, so the ppo_buffer actually adds the next_obs instead the current_obs.

A Question About Dagger Value Algorithm

Hi, I have a question about dagger-value algorithm:
when updating value network, why do you use torch.max() to get the larger loss?

What's the meaning comparing these two losses? In my understanding, using clipped value loss is to keep the training procedure stable, but in that case why is it max not min, or, why not just use value_losses_clipped directly?

UniDexGrasp2/dexgrasp/algorithms/rl/dagger_value/dagger.py

Lines 433 to 437 in a223e62

 clip_range = self.value_loss_cfg['clip_range'] 

 value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-clip_range, clip_range) 

 value_losses = (value_batch - returns_batch).pow(2) 

 value_losses_clipped = (value_clipped - returns_batch).pow(2) 

 value_loss = torch.max(value_losses, value_losses_clipped).mean()

In above code, the target_values_batch is the student critic value before learning epoches, value_batch is student critic value during learning epoches.

UniDexGrasp2/dexgrasp/algorithms/rl/dagger_value/dagger.py

Line 272 in a223e62

 self.ppo_buffer.add_transitions(current_obs, current_states, actions, rews, dones, values, actions_log_prob, mu, sigma) 

pku-epic / unidexgrasp2 Goto Github PK

unidexgrasp2's People

Contributors

Stargazers

Watchers

Forkers

unidexgrasp2's Issues

The code

[Bug] DAggerValue, ppo_buffer adds next_obs instead of current_obs

A Question About Dagger Value Algorithm

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	clip_range = self.value_loss_cfg['clip_range']
	value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-clip_range, clip_range)
	value_losses = (value_batch - returns_batch).pow(2)
	value_losses_clipped = (value_clipped - returns_batch).pow(2)
	value_loss = torch.max(value_losses, value_losses_clipped).mean()