hjsuh94 / score_po Goto Github PK
View Code? Open in Web Editor NEWScore-Guided Planning
Score-Guided Planning
We can create the example in Drake to generate the dynamics data.
What do we want out of our experiments? In the setting of offline RL, we want our algorithm to
On low-dimensional examples like cartpole and keypoints, this could be a tricky balance. If the dynamics is trained "too well", distribution risk will make the optimizer more conservative and we don't achieve 2. On the other hand, if the dynamics is trained too badly, we don't achieve 1.
So in order to answer the question of "when does distribution risk help?", we will need to actively find cases where:
It seems that most diffusion papers use U-Net (or U-Net with FiLM structure for conditional input) instead of MLP for the diffusion model. We can consider adding our own U-Net.
I think when we use the "direct collocation" formulation
min c(x₁, ..., xₙ, u₁, ..., uₙ) + β log p(xᵢ, uᵢ, xᵢ₊₁)
although it looks like Janner's approach in this objective function, in practice our approach is easier, for the following reason:
In Janner's approach, they need to train a classifier to guide the diffusion process. Note that they cannot use the cost function exp(c(x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ))
as this guided classifier directly, but have to train a separate classifier. (Here I use the superscript t
on x₁ᵗ
to denote it is the t
step of denoising, not the t-step in planning horizon). The reason is that during the denoising stage, the trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ contains a lot of noise, and what the classifier wants to predict is the probability of the denoised trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰ being optimal, not the optimality of the noisy trajectory.
So to train this guided classifier, they start with a no-noise trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰, and then inject noise into for multiple steps, then they pair the matching from the noisy trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ with target probability exp(c(x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰)), and train a classifier model through regression. There is this extra effort to train the classifer, while we just use the cost function c(x₁, ..., xₙ, u₁, ..., uₙ) directly.
This also puts the question that if we should consider a classifier-free planning approach, such as https://arxiv.org/pdf/2211.15657.pdf?
I don't exactly understand the part
score_po/score_po/dynamical_system.py
Lines 70 to 84 in feaf67a
Also I think we can make the parent class DynamicalSystem
an abstract class, by using abc
module, something like this
class DynamicalSystem(abc.ABC):
@abc.abstractmethod
def dynamics(self, x, u):
Dynamics Classes:
Policy Classes:
Policy Optimizer Classes:
Line 4 in feaf67a
Overall, we should try to focus our efforts towards what's necessary for the paper.
DataDistance vs. ScoreMatching - do we also want to show that optimal control w/ data distance penalty is empirically equivalent to modifying gradients with score matching?
NoiseConditionedEstimation - do we want to anneal variances or train with a single variance?
What's the minimum set of questions that we want to answer for the experiment? Goal: let's not try to do something too impressive or overly complicated, but have a crisp set of empirical experiments that prove the point.
3.1. Illustrative (and very analyzable) low-dimensional example where distribution risk is helpful. (Hongkai)
3.2. Comparison against existing methods on complex examples (Terry)
3.3 Scalability of our method to pixel-based control problems (Glen / Lu)
I spent up all night making a series of shocking revelations.
I think it's quite important to compare our method against other offline RL methods such as CQL / MOPO. One of the best ways to do that is to run D4RL since the original paper already provides various methods for the benchmarks, so instead of implementing all the baselines, we can just run our own and record the scoreboard.
Specifically, let's focus on the half-cheetah
environment. What I thought we need to do is to load the dataset from D4RL, train the dynamics and score function, and simply run our method. But there is a big catch - the Mujoco environments in gym
gives us reward when we do env.step(action)
by internally calling the simulator. But if we roll out the learned dynamics, there is no way to get this reward since the gym wrapper doesn't give us reward as a function of state x
.
I thought one way to get around this would be to simply go look at how the rewards are defined in Mujoco gym and just write our own reward function explicitly as a function of x. For this environment, it seems simple enough: it's a sum of quadratic cost over actions and a terminal reward on how far it reached, which should be covered by our QuadraticCost
: https://github.com/openai/gym/blob/dcd185843a62953e27c2d54dc8c2d647d604b635/gym/envs/mujoco/half_cheetah.py#L30
But RL folks are very inconceivably weird. Even though they claim Mujoco environments are "fully observable", when we get observations from env.step
, they purposely left out x position from half-cheetah and included everything else: https://github.com/openai/gym/blob/dcd185843a62953e27c2d54dc8c2d647d604b635/gym/envs/mujoco/half_cheetah_v4.py#L15, which means there is no way to compute the reward from the observation that they provide. Why did they do this?? Even beyond computing reward, what does learning dynamics on this environment even mean if they don't give us the full state?
It seems previous works (MBPO / MOPO) get around this by also learning the reward as a function of (x,u): https://github.com/tianheyu927/mopo/blob/e5efe3eaa548850af109c920bfc2c1ec96bf2285/mopo/models/constructor.py#L53
https://github.com/jannerm/mbpo/blob/ac694ff9f1ebb789cc5b3f164d9d67f93ed8f129/mbpo/models/constructor.py#L27
and then computing the value function under the learned reward and the learned dynamics:
https://github.com/tianheyu927/mopo/blob/e5efe3eaa548850af109c920bfc2c1ec96bf2285/mopo/models/fake_env.py#L73
I think we can do the same thing by learning the reward function for the gym environment and still append the score function gradients. But this has been very confusing from a robotics point-of-view.
Remaining list of implementations that are not general to input dimensions:
ScoreEstimator
NoiseConditionedScoreEstimator
DataDistanceEstimator
PolicyOptimizer
If we frame our approach as doing "Uncertainty-aware Planning with Learned dynamics", we can broadly classify different methods
To convincingly show the benefits of our method (first + dircol + datadistance) as opposed to popular planning approaches like MPPI with ensemble variance (zeroth + shooting + ensemble), where different options are summarized as follows:
Single Shooting | Direct Collocation | ||
---|---|---|---|
First-order | Ensembles | ||
DataDistance | DRisk Trajopt | Diffusion Planning | |
Zeroth-order | Ensembles | MPPI w/ Ensembles | |
DataDistance |
We can try two versions
How do we actually train a Noise Conditioned Score Estimator?
The most straightforward / naïve way would be to append one more dimension to the input of the network. (do MLP(4, 3)
instead of MLP(3,3)
).
It seems that for the diffusion papers, there were some interesting choice of architectures we should consider adopting. In the original implementation this repo, it's interesting to see that the value of sigma
is never really used as an input to the network.
Instead, if we look at this implementation of anneal_dsm_score_estimation
, the labels are defined as integer variables here. So this network takes in as input integers, as opposed to actual values of sigmas.
After getting this integer label, data
and label
are both passed into ConditionalResidualBlock
, which normalizes the (data
, label
) pairs.
This normalization is done by using nn.Embedding
to convert the integer token label
as some feature vector. Then, this feature vector is multiplied with the batch-normalized data. This is done multiple times through a deep layer.
TODOs:
Benchmarks:
sigma should be positive, but the default value is -3
score_po/score_po/score_matching.py
Line 131 in 646ab1b
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.