hjsuh94 / score_po Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 0.0 32.95 MB

Score-Guided Planning

Python 100.00%

score_po's People

Contributors

Stargazers

Watchers

score_po's Issues

Add a keypoint based pushing example

We can create the example in Drake to generate the dynamics data.

Experiments to try and include

Toy problems that demonstrate distribution risk

Light vs. dark domain

State-based problems

Cartpole dynamics
Keypoint-dynamics
Yunzhu's GNN-style deformable objects

Pixel-based problems

Carrots?
Image-based navigation (Glen's work)
Visual foresight with planar pushing

Thoughts on choosing experiments

What do we want out of our experiments? In the setting of offline RL, we want our algorithm to

Achieve reasonable success on the task
Show that adding distribution risk improves over vanilla RL with learned dynamics.

On low-dimensional examples like cartpole and keypoints, this could be a tricky balance. If the dynamics is trained "too well", distribution risk will make the optimizer more conservative and we don't achieve 2. On the other hand, if the dynamics is trained too badly, we don't achieve 1.

So in order to answer the question of "when does distribution risk help?", we will need to actively find cases where:

Interpolative regime: have a "reasonably bad" dynamics, where if we land on a correct sequence of samples, we can still achieve the task.
Extrapolative regime: create a tension between optimality and safety by forcing an optimal trajectory to get out of the support of data.

Adding U-Net for diffusion model

It seems that most diffusion papers use U-Net (or U-Net with FiLM structure for conditional input) instead of MLP for the diffusion model. We can consider adding our own U-Net.

Compare against Janner's approach

I think when we use the "direct collocation" formulation

min c(x₁, ..., xₙ, u₁, ..., uₙ) + β log p(xᵢ, uᵢ, xᵢ₊₁)

although it looks like Janner's approach in this objective function, in practice our approach is easier, for the following reason:

In Janner's approach, they need to train a classifier to guide the diffusion process. Note that they cannot use the cost function exp(c(x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ)) as this guided classifier directly, but have to train a separate classifier. (Here I use the superscript t on x₁ᵗ to denote it is the t step of denoising, not the t-step in planning horizon). The reason is that during the denoising stage, the trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ contains a lot of noise, and what the classifier wants to predict is the probability of the denoised trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰ being optimal, not the optimality of the noisy trajectory.
So to train this guided classifier, they start with a no-noise trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰, and then inject noise into for multiple steps, then they pair the matching from the noisy trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ with target probability exp(c(x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰)), and train a classifier model through regression. There is this extra effort to train the classifer, while we just use the cost function c(x₁, ..., xₙ, u₁, ..., uₙ) directly.

This also puts the question that if we should consider a classifier-free planning approach, such as https://arxiv.org/pdf/2211.15657.pdf?

Question about dynamics training

I don't exactly understand the part

score_po/score_po/dynamical_system.py

Lines 70 to 84 in feaf67a

 def evaluate_dynamic_loss(self, data, labels, sigma): 

 """ 

  Evaluate L2 loss. 

  data_samples: 

  data of shape (B, dim_x + dim_u + dim_x) 

  sigma: vector of dim_x + dim_u used for data augmentation. 

  """ 

 B = data.shape[0] 

 noise = torch.normal(0, sigma, size=data.shape) 

 databar = data + noise 

 pred = self.dynamics_batch( 

 databar[:, : self.dim_x], databar[:, self.dim_x :], eval=False 

 ) # B x dim_x 

 loss = 0.5 * ((labels - pred) ** 2).sum(dim=-1).mean(dim=0) 

 return loss

why we want to inject noise here?

Also I think we can make the parent class DynamicalSystem an abstract class, by using abc module, something like this

class DynamicalSystem(abc.ABC):
    @abc.abstractmethod
    def dynamics(self, x, u):

ScoreFunction should normalize and denormalize the input

Design Pattern Notes: Dynamics and Policy Class Coverage

Dynamics Classes:

Neural Network (Learned Dynamics)
Non-differentiable Physics Simulation
Custom differentiable Physics Simulation

Policy Classes:

Open-loop Policy (Trajectory Optimization)
Time-varying State-Feedback Policy
Neural Network Policy

Policy Optimizer Classes:

First-Order Policy Gradient
Zeroth-Order Policy Gradient w/ Parameter Perturbation
Zeroth-Order Policy Gradient w/ Output Perturbation

Add Noise-Conditioned Score Estimator

Following #19, we need to implement a noise-conditioned score estimator with nn.Embedding.

Perhaps this is a bit coupled with #33, but I will implement the MLP version first.

Use torch.Dataset instead of our own Dataset?

score_po/score_po/dataset.py

Line 4 in feaf67a

class Dataset: # Unsupervised Dataset

Topics for Meeting 04/24

Overall, we should try to focus our efforts towards what's necessary for the paper.

DataDistance vs. ScoreMatching - do we also want to show that optimal control w/ data distance penalty is empirically equivalent to modifying gradients with score matching?
NoiseConditionedEstimation - do we want to anneal variances or train with a single variance?
What's the minimum set of questions that we want to answer for the experiment? Goal: let's not try to do something too impressive or overly complicated, but have a crisp set of empirical experiments that prove the point.
- 3.1. Illustrative (and very analyzable) low-dimensional example where distribution risk is helpful. (Hongkai)
  - We want to show that optimizing without distribution risk leads trajectories to go out of the data distribution.
  - We want to show that outside of this data distribution, the dynamics is inaccurate, forcing the planner to not do well.
  - We want to show that with distribution risk, the trajectories stay inside the data distribution, leading to better performance.
  - [Optional] using data distance leads to similar performance with score-matching.
  - The right experiment: Hongkai's corridor? perhaps even with single-integrator dynamics?
- 3.2. Comparison against existing methods on complex examples (Terry)
  - We want to show that compared to existing approaches (MOPO / CQL), our method achieves comparable (or better) performance on existing benchmarks. (For MOPO, comparable is acceptable since Ensembles take long to train)
  - The right experiment. D4RL Mujoco /Adroit Tasks
- 3.3 Scalability of our method to pixel-based control problems (Glen / Lu)
  - We want to leverage strengths of scalability of denoising score-matching to show that we achieve good performance on pixel-based problems.
  - The right experiment.. TBD

D4RL Mujoco Experiments

I spent up all night making a series of shocking revelations.

I think it's quite important to compare our method against other offline RL methods such as CQL / MOPO. One of the best ways to do that is to run D4RL since the original paper already provides various methods for the benchmarks, so instead of implementing all the baselines, we can just run our own and record the scoreboard.

Specifically, let's focus on the half-cheetah environment. What I thought we need to do is to load the dataset from D4RL, train the dynamics and score function, and simply run our method. But there is a big catch - the Mujoco environments in gym gives us reward when we do env.step(action) by internally calling the simulator. But if we roll out the learned dynamics, there is no way to get this reward since the gym wrapper doesn't give us reward as a function of state x.

I thought one way to get around this would be to simply go look at how the rewards are defined in Mujoco gym and just write our own reward function explicitly as a function of x. For this environment, it seems simple enough: it's a sum of quadratic cost over actions and a terminal reward on how far it reached, which should be covered by our QuadraticCost: https://github.com/openai/gym/blob/dcd185843a62953e27c2d54dc8c2d647d604b635/gym/envs/mujoco/half_cheetah.py#L30

But RL folks are very inconceivably weird. Even though they claim Mujoco environments are "fully observable", when we get observations from env.step, they purposely left out x position from half-cheetah and included everything else: https://github.com/openai/gym/blob/dcd185843a62953e27c2d54dc8c2d647d604b635/gym/envs/mujoco/half_cheetah_v4.py#L15, which means there is no way to compute the reward from the observation that they provide. Why did they do this?? Even beyond computing reward, what does learning dynamics on this environment even mean if they don't give us the full state?

It seems previous works (MBPO / MOPO) get around this by also learning the reward as a function of (x,u): https://github.com/tianheyu927/mopo/blob/e5efe3eaa548850af109c920bfc2c1ec96bf2285/mopo/models/constructor.py#L53
https://github.com/jannerm/mbpo/blob/ac694ff9f1ebb789cc5b3f164d9d67f93ed8f129/mbpo/models/constructor.py#L27
and then computing the value function under the learned reward and the learned dynamics:
https://github.com/tianheyu927/mopo/blob/e5efe3eaa548850af109c920bfc2c1ec96bf2285/mopo/models/fake_env.py#L73

I think we can do the same thing by learning the reward function for the gym environment and still append the score function gradients. But this has been very confusing from a robotics point-of-view.

Generalize to Multiple Dimensions

Remaining list of implementations that are not general to input dimensions:

ScoreEstimator
NoiseConditionedScoreEstimator
DataDistanceEstimator
PolicyOptimizer

Diffusion Planning Comparison

If we frame our approach as doing "Uncertainty-aware Planning with Learned dynamics", we can broadly classify different methods

Choice of gradient estimation: First vs. Zeroth-order.
- first order methods have less variance, as zeroth-order gradients suffer from variance-dependence.
- zeroth-order potentially has smoothing effects and is robust against exploding gradients.
Choice of transcription: Single shooting vs. Direct collocation.
- single-shooting with learned dynamics suffers from compounding error of autoregressive rollouts, unless dynamics is directly trained with simulation error.
- single-shooting also requires differentiation through a long trajectory, which might suffer from gradient explosion.
- direct collocation potentially overcomes some of these limitations, but is often expensive to implement.
Choice of uncertainty measure: Ensembles vs. GPs vs. DataDistance.
- ensembles underestimate uncertainty
- ensembles are compute-intensive to train
- ensembles have spurious local minima in the uncertainty landscape are not friendly to gradient-based optimization.

To convincingly show the benefits of our method (first + dircol + datadistance) as opposed to popular planning approaches like MPPI with ensemble variance (zeroth + shooting + ensemble), where different options are summarized as follows:

		Single Shooting	Direct Collocation
First-order	Ensembles
	DataDistance	DRisk Trajopt	Diffusion Planning
Zeroth-order	Ensembles	MPPI w/ Ensembles
	DataDistance

Add a cart-pole example with neural network dynamics

We can try two versions

state-based
key-point based.

Architecture Study for Noise Conditioning

How do we actually train a Noise Conditioned Score Estimator?

The most straightforward / naïve way would be to append one more dimension to the input of the network. (do MLP(4, 3) instead of MLP(3,3)).

It seems that for the diffusion papers, there were some interesting choice of architectures we should consider adopting. In the original implementation this repo, it's interesting to see that the value of sigma is never really used as an input to the network.

Instead, if we look at this implementation of anneal_dsm_score_estimation, the labels are defined as integer variables here. So this network takes in as input integers, as opposed to actual values of sigmas.

After getting this integer label, data and label are both passed into ConditionalResidualBlock, which normalizes the (data, label) pairs.

This normalization is done by using nn.Embedding to convert the integer token label as some feature vector. Then, this feature vector is multiplied with the batch-normalized data. This is done multiple times through a deep layer.

Meeting w/ Abhishek

The fundamental claim is that score matching / diffusion models can draw you to within-data distribution regimes.

what is the extent to which I can get o.o.d. before diffusion models start to be less effective?
especially on high dimensions, is this true?
people have thought about doing this for anomaly detection, maybe it's interesting to make connections:
https://openreview.net/forum?id=5tKhUU5WBi8
https://arxiv.org/pdf/2211.07740.pdf
This claims is generalizable beyond the setting of model-based offline RL, and would apply to any kind of model-based optimization? (where we have uncertainty over cost functions)

Why use score-based generative models as opposed to ensembles?

need a persuasive example that we can handle something that ensembles cannot handle.
epistemic vs. aleatoric uncertainty. How well do generative models do for either?
https://cims.nyu.edu/~andrewgw/deepensembles/

TODOs:

Produce a simple example of a diffusion-based score field trying to pull samples within distribution.
Produce a simple example where diffusion-based score fields to do better than ensembles.
Compare with CQL: why do model-based?

Benchmarks:

If we have some convincing cases where this beats ensembles, might not be too necessary.

Baselines & Selling Points

Why Model-Based?

It's possible to be more data efficient although model-free might have better asymptotic performance
Models allow easily injecting inductive biases

What about other generative models?

VAE: taking gradients is not straightforward
Denoising AE
Normalizing Flows

What about other planners / policy optimizers that use diffusion?

Janner's approach of directly doing diffusion on the trajectory level
MPPI

What if we don't include distribution risk / uncertainty?

Policy Gradient with vs. without distribution risk.
MPPI / SGD for planning problems

What about other approaches that tackle similar distribution risk problems?

Compare against MOPO with includes ensembles and variance.

ScoreFunctionEstimator.train_network has incorrect sigma_min by default

sigma should be positive, but the default value is -3

score_po/score_po/score_matching.py

Line 131 in 646ab1b

sigma_min=-3,

	def evaluate_dynamic_loss(self, data, labels, sigma):
	"""
	Evaluate L2 loss.
	data_samples:
	data of shape (B, dim_x + dim_u + dim_x)
	sigma: vector of dim_x + dim_u used for data augmentation.
	"""
	B = data.shape[0]
	noise = torch.normal(0, sigma, size=data.shape)
	databar = data + noise
	pred = self.dynamics_batch(
	databar[:, : self.dim_x], databar[:, self.dim_x :], eval=False
	) # B x dim_x
	loss = 0.5 * ((labels - pred) ** 2).sum(dim=-1).mean(dim=0)
	return loss

hjsuh94 / score_po Goto Github PK

score_po's People

Contributors

Stargazers

Watchers

score_po's Issues

Recommend Projects

Recommend Topics

Recommend Org