The NUTS sampler (<a href="http://www.jmlr.org/papers/volume15/hoffman14a/hoffman14a.p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Feature Request: No U-Turn Sampler (NUTS) about probability HOT 11 CLOSED

tensorflow commented on May 23, 2024 6

Feature Request: No U-Turn Sampler (NUTS)

from probability.

Comments (11)

martiningram commented on May 23, 2024 2

How far along is this implementation? Would contributions be appreciated? I may have some time to work on a tensorflow version.

from probability.

axch commented on May 23, 2024 2

@junpenglao You can always run it with batch size = 1... There would probably be some overhead that a direct unbatched implementation wouldn't impose; if that overhead becomes a problem, we (or you!) could look into ways to optimize it.

from probability.

axch commented on May 23, 2024

Yes, we are actively working on it. NUTS (unlike vanilla HMC) has data-dependent recursive control flow, which makes it a bit tricky to implement in graph-mode TF (i.e., without relying on TF Eager). There is therefore no definite time line, but it is a priority.

from probability.

alichtl commented on May 23, 2024

Awesome- thanks for letting me know, and definitely thanks for working on this!

from probability.

axch commented on May 23, 2024

We are all happy to know you're interested in helping! You should know that we're going for an implementation that can run in graph mode, and can batch together multiple independent chains running in tandem (or at least their leapfrog steps). Since different chains can take different numbers of leapfrog steps during one update, the solution is proving surprisingly complicated---it basically amounts to writing a multi-stage compiler targeting TF graphs, which implements the logic needed for batching and recursion (which, however, should be reusable).

Given the scope, off the top of my head I would guess the implementation roadmap will take at least another full-time engineer-month. If you are interested in diving in, we can look for a place that would make sense, but it may take a while to get you up to speed on the design and what has happened so far.

from probability.

martiningram commented on May 23, 2024

Thanks for the reply! That sounds quite involved, and I am not sure whether I have the time to spare for that right now, but I'll get in touch if I can.

from probability.

twiecki commented on May 23, 2024

@axch Sounds exciting. How do you mean different chains? Running separate NUTS chains in parallel? What does batching do and how does it help?

from probability.

axch commented on May 23, 2024

Yes, independent NUTS chains in parallel. For an example of how it can help, consider Bayesian linear regression, on, say, the cover type data set. That has 54 features (plus the intercept) and a half-million data points. The dominant cost of running NUTS on that will be the matrix-vector multiplies of a 500,000 x 55 matrix with a 55-vector that happen in the leapfrog steps.

But suppose we are able to run, say, 100 NUTS chains in parallel, and synchronize them on the leapfrog steps. Then, to run the leapfrogs in batch, we just need one (gradient of a) matrix-matrix multiply of a 500,000 x 55 matrix with a 55 x 100 matrix, and on modern accelerators that's much faster than 100 separate matrix-vector multiplies. (It's even faster on CPU, but not as dramatically.)

from probability.

junpenglao commented on May 23, 2024

That's a really interesting idea, so if I get the idea correctly, the gradient computation is performed once at each leapfrog step across all the parallel chains, so you basically create a new Hamiltonian system that contains n chain copy of the original system? But how would that help synchronize the tree building? As each chain the recurrent tree building would still be different.

from probability.

axch commented on May 23, 2024

@junpenglao I wouldn't call it one Hamiltonian system, but rather 100 Hamiltonian systems whose energy function we can evaluate in parallel with batching (to get 100 energies, or rather, 100 energy gradients).

You identified the difficulty exactly: the recurrent tree building is different in each chain, and that is why the design involves what I called a "compiler" earlier, though you can just think of it as an elaborate Python program for building the TensorFlow graph. What happens at the bottom is that, in order to synchronize the leapfrogs, we have to let the control logic at least partially de-synchronize: if chain A ends up needing 5 u-turn checks to get to its next leapfrog while chain B needs only 2, we have to do all 5 anyway, and waste some compute power while chain B waits. But, at least on the cover type example I gave earlier, that should be fine: those u-turn checks are dot products of 55-vectors, so it's ok to waste some in order to get the best batching for the operation involving the 500,000-row matrix.

from probability.

junpenglao commented on May 23, 2024

OK I see, thanks for the explanation! Will running different chain out of sync also a potential option (FYI that's how PyMC3 does it)?

from probability.

Feature Request: No U-Turn Sampler (NUTS) about probability HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent