Git Product home page Git Product logo

score_po's People

Contributors

hjsuh94 avatar hongkai-dai avatar lujieyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

score_po's Issues

Experiments to try and include

  1. Toy problems that demonstrate distribution risk
  • Light vs. dark domain
  1. State-based problems
  • Cartpole dynamics
  • Keypoint-dynamics
  • Yunzhu's GNN-style deformable objects
  1. Pixel-based problems
  • Carrots?
  • Image-based navigation (Glen's work)
  • Visual foresight with planar pushing

Thoughts on choosing experiments

What do we want out of our experiments? In the setting of offline RL, we want our algorithm to

  1. Achieve reasonable success on the task
  2. Show that adding distribution risk improves over vanilla RL with learned dynamics.

On low-dimensional examples like cartpole and keypoints, this could be a tricky balance. If the dynamics is trained "too well", distribution risk will make the optimizer more conservative and we don't achieve 2. On the other hand, if the dynamics is trained too badly, we don't achieve 1.

So in order to answer the question of "when does distribution risk help?", we will need to actively find cases where:

  1. Interpolative regime: have a "reasonably bad" dynamics, where if we land on a correct sequence of samples, we can still achieve the task.
  2. Extrapolative regime: create a tension between optimality and safety by forcing an optimal trajectory to get out of the support of data.

Adding U-Net for diffusion model

It seems that most diffusion papers use U-Net (or U-Net with FiLM structure for conditional input) instead of MLP for the diffusion model. We can consider adding our own U-Net.

Compare against Janner's approach

I think when we use the "direct collocation" formulation

min c(x₁, ..., xₙ, u₁, ..., uₙ) + β log p(xᵢ, uᵢ, xᵢ₊₁)

although it looks like Janner's approach in this objective function, in practice our approach is easier, for the following reason:

In Janner's approach, they need to train a classifier to guide the diffusion process. Note that they cannot use the cost function exp(c(x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ)) as this guided classifier directly, but have to train a separate classifier. (Here I use the superscript t on x₁ᵗ to denote it is the t step of denoising, not the t-step in planning horizon). The reason is that during the denoising stage, the trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ contains a lot of noise, and what the classifier wants to predict is the probability of the denoised trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰ being optimal, not the optimality of the noisy trajectory.
So to train this guided classifier, they start with a no-noise trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰, and then inject noise into for multiple steps, then they pair the matching from the noisy trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ with target probability exp(c(x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰)), and train a classifier model through regression. There is this extra effort to train the classifer, while we just use the cost function c(x₁, ..., xₙ, u₁, ..., uₙ) directly.

This also puts the question that if we should consider a classifier-free planning approach, such as https://arxiv.org/pdf/2211.15657.pdf?

Question about dynamics training

I don't exactly understand the part

def evaluate_dynamic_loss(self, data, labels, sigma):
"""
Evaluate L2 loss.
data_samples:
data of shape (B, dim_x + dim_u + dim_x)
sigma: vector of dim_x + dim_u used for data augmentation.
"""
B = data.shape[0]
noise = torch.normal(0, sigma, size=data.shape)
databar = data + noise
pred = self.dynamics_batch(
databar[:, : self.dim_x], databar[:, self.dim_x :], eval=False
) # B x dim_x
loss = 0.5 * ((labels - pred) ** 2).sum(dim=-1).mean(dim=0)
return loss
why we want to inject noise here?

Also I think we can make the parent class DynamicalSystem an abstract class, by using abc module, something like this

class DynamicalSystem(abc.ABC):
    @abc.abstractmethod
    def dynamics(self, x, u):

Design Pattern Notes: Dynamics and Policy Class Coverage

Dynamics Classes:

  • Neural Network (Learned Dynamics)
  • Non-differentiable Physics Simulation
  • Custom differentiable Physics Simulation

Policy Classes:

  • Open-loop Policy (Trajectory Optimization)
  • Time-varying State-Feedback Policy
  • Neural Network Policy

Policy Optimizer Classes:

  • First-Order Policy Gradient
  • Zeroth-Order Policy Gradient w/ Parameter Perturbation
  • Zeroth-Order Policy Gradient w/ Output Perturbation

Topics for Meeting 04/24

Overall, we should try to focus our efforts towards what's necessary for the paper.

  1. DataDistance vs. ScoreMatching - do we also want to show that optimal control w/ data distance penalty is empirically equivalent to modifying gradients with score matching?

  2. NoiseConditionedEstimation - do we want to anneal variances or train with a single variance?

  3. What's the minimum set of questions that we want to answer for the experiment? Goal: let's not try to do something too impressive or overly complicated, but have a crisp set of empirical experiments that prove the point.

    • 3.1. Illustrative (and very analyzable) low-dimensional example where distribution risk is helpful. (Hongkai)

      • We want to show that optimizing without distribution risk leads trajectories to go out of the data distribution.
      • We want to show that outside of this data distribution, the dynamics is inaccurate, forcing the planner to not do well.
      • We want to show that with distribution risk, the trajectories stay inside the data distribution, leading to better performance.
      • [Optional] using data distance leads to similar performance with score-matching.
      • The right experiment: Hongkai's corridor? perhaps even with single-integrator dynamics?
    • 3.2. Comparison against existing methods on complex examples (Terry)

      • We want to show that compared to existing approaches (MOPO / CQL), our method achieves comparable (or better) performance on existing benchmarks. (For MOPO, comparable is acceptable since Ensembles take long to train)
      • The right experiment. D4RL Mujoco /Adroit Tasks
    • 3.3 Scalability of our method to pixel-based control problems (Glen / Lu)

      • We want to leverage strengths of scalability of denoising score-matching to show that we achieve good performance on pixel-based problems.
      • The right experiment.. TBD

D4RL Mujoco Experiments

I spent up all night making a series of shocking revelations.

I think it's quite important to compare our method against other offline RL methods such as CQL / MOPO. One of the best ways to do that is to run D4RL since the original paper already provides various methods for the benchmarks, so instead of implementing all the baselines, we can just run our own and record the scoreboard.

Specifically, let's focus on the half-cheetah environment. What I thought we need to do is to load the dataset from D4RL, train the dynamics and score function, and simply run our method. But there is a big catch - the Mujoco environments in gym gives us reward when we do env.step(action) by internally calling the simulator. But if we roll out the learned dynamics, there is no way to get this reward since the gym wrapper doesn't give us reward as a function of state x.

I thought one way to get around this would be to simply go look at how the rewards are defined in Mujoco gym and just write our own reward function explicitly as a function of x. For this environment, it seems simple enough: it's a sum of quadratic cost over actions and a terminal reward on how far it reached, which should be covered by our QuadraticCost: https://github.com/openai/gym/blob/dcd185843a62953e27c2d54dc8c2d647d604b635/gym/envs/mujoco/half_cheetah.py#L30

But RL folks are very inconceivably weird. Even though they claim Mujoco environments are "fully observable", when we get observations from env.step, they purposely left out x position from half-cheetah and included everything else: https://github.com/openai/gym/blob/dcd185843a62953e27c2d54dc8c2d647d604b635/gym/envs/mujoco/half_cheetah_v4.py#L15, which means there is no way to compute the reward from the observation that they provide. Why did they do this?? Even beyond computing reward, what does learning dynamics on this environment even mean if they don't give us the full state?

It seems previous works (MBPO / MOPO) get around this by also learning the reward as a function of (x,u): https://github.com/tianheyu927/mopo/blob/e5efe3eaa548850af109c920bfc2c1ec96bf2285/mopo/models/constructor.py#L53
https://github.com/jannerm/mbpo/blob/ac694ff9f1ebb789cc5b3f164d9d67f93ed8f129/mbpo/models/constructor.py#L27
and then computing the value function under the learned reward and the learned dynamics:
https://github.com/tianheyu927/mopo/blob/e5efe3eaa548850af109c920bfc2c1ec96bf2285/mopo/models/fake_env.py#L73

I think we can do the same thing by learning the reward function for the gym environment and still append the score function gradients. But this has been very confusing from a robotics point-of-view.

Generalize to Multiple Dimensions

Remaining list of implementations that are not general to input dimensions:

  • ScoreEstimator
  • NoiseConditionedScoreEstimator
  • DataDistanceEstimator
  • PolicyOptimizer

Diffusion Planning Comparison

If we frame our approach as doing "Uncertainty-aware Planning with Learned dynamics", we can broadly classify different methods

  1. Choice of gradient estimation: First vs. Zeroth-order.
    • first order methods have less variance, as zeroth-order gradients suffer from variance-dependence.
    • zeroth-order potentially has smoothing effects and is robust against exploding gradients.
  2. Choice of transcription: Single shooting vs. Direct collocation.
    • single-shooting with learned dynamics suffers from compounding error of autoregressive rollouts, unless dynamics is directly trained with simulation error.
    • single-shooting also requires differentiation through a long trajectory, which might suffer from gradient explosion.
    • direct collocation potentially overcomes some of these limitations, but is often expensive to implement.
  3. Choice of uncertainty measure: Ensembles vs. GPs vs. DataDistance.
    • ensembles underestimate uncertainty
    • ensembles are compute-intensive to train
    • ensembles have spurious local minima in the uncertainty landscape are not friendly to gradient-based optimization.

To convincingly show the benefits of our method (first + dircol + datadistance) as opposed to popular planning approaches like MPPI with ensemble variance (zeroth + shooting + ensemble), where different options are summarized as follows:

Single Shooting Direct Collocation
First-order Ensembles
DataDistance DRisk Trajopt Diffusion Planning
Zeroth-order Ensembles MPPI w/ Ensembles
DataDistance

Architecture Study for Noise Conditioning

How do we actually train a Noise Conditioned Score Estimator?

The most straightforward / naïve way would be to append one more dimension to the input of the network. (do MLP(4, 3) instead of MLP(3,3)).

It seems that for the diffusion papers, there were some interesting choice of architectures we should consider adopting. In the original implementation this repo, it's interesting to see that the value of sigma is never really used as an input to the network.

Instead, if we look at this implementation of anneal_dsm_score_estimation, the labels are defined as integer variables here. So this network takes in as input integers, as opposed to actual values of sigmas.

After getting this integer label, data and label are both passed into ConditionalResidualBlock, which normalizes the (data, label) pairs.

This normalization is done by using nn.Embedding to convert the integer token label as some feature vector. Then, this feature vector is multiplied with the batch-normalized data. This is done multiple times through a deep layer.

Meeting w/ Abhishek

  1. The fundamental claim is that score matching / diffusion models can draw you to within-data distribution regimes.
  • what is the extent to which I can get o.o.d. before diffusion models start to be less effective?
  • especially on high dimensions, is this true?
  • people have thought about doing this for anomaly detection, maybe it's interesting to make connections:
  • https://openreview.net/forum?id=5tKhUU5WBi8
  • https://arxiv.org/pdf/2211.07740.pdf
  • This claims is generalizable beyond the setting of model-based offline RL, and would apply to any kind of model-based optimization? (where we have uncertainty over cost functions)
  1. Why use score-based generative models as opposed to ensembles?

TODOs:

  • Produce a simple example of a diffusion-based score field trying to pull samples within distribution.
  • Produce a simple example where diffusion-based score fields to do better than ensembles.
  • Compare with CQL: why do model-based?

Benchmarks:

  • If we have some convincing cases where this beats ensembles, might not be too necessary.

Baselines & Selling Points

  1. Why Model-Based?
  • It's possible to be more data efficient although model-free might have better asymptotic performance
  • Models allow easily injecting inductive biases
  1. What about other generative models?
  • VAE: taking gradients is not straightforward
  • Denoising AE
  • Normalizing Flows
  1. What about other planners / policy optimizers that use diffusion?
  • Janner's approach of directly doing diffusion on the trajectory level
  • MPPI
  1. What if we don't include distribution risk / uncertainty?
  • Policy Gradient with vs. without distribution risk.
  • MPPI / SGD for planning problems
  1. What about other approaches that tackle similar distribution risk problems?
  • Compare against MOPO with includes ensembles and variance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.