#### An alternative to MCMC?

I think of Markov-Chain Monte Carlo (MCMC) as a kind of directed staggering about, a random walk with a goal. (Sort of like driving in Boston.) It is conceptually simple to grasp as a way to explore the posterior probability distribution of the parameters of interest by sampling only where it is worth sampling from. Thus, a major savings from brute force Monte Carlo, and far more robust than downhill fitting programs. It also gives you the error bar on the parameter for free. What could be better?

Feroz & Hobson (2007, arXiv:0704.3704) describe a technique called Nested Sampling (Skilling 2004), one that could give MCMC a run for its money. It takes the one inefficient part of MCMC — the burn-in phase — and turns that into a virtue. The way it seems to work is to keep track of how the parameter space is traversed as the model parameters {*theta*} reach the mode of the posterior, and to take the sequence of likelihoods thus obtained L(*theta*), and turn it around to get *theta*(L). Neat.

Two big (computational) problems that I see are (1) the calculation of *theta*(L), and (2) the sampling to discard the tail of L(*theta*). The former, it seems to me, becomes intractable exactly when the likelihood surface gets complicated. The latter, again, it seems you have to run through just as many iterations as in MCMC to get a decent sample size. Of course, if you have a good *theta*(L), it does seem to be an improvement over MCMC in that you won’t need to run the chains multiple times to make sure you catch all the modes.

I think the main advantage of MCMC is that it produces and keeps track of marginalized posteriors for each parameter, whereas in this case, you have to essentially keep a full list of samples from the joint posterior and then marginalize over it yourself. The larger the sample size, the harder this gets, and in fact it is a bit difficult to tell whether the nested sampling method is competing with MCMC or Monte Carlo integration.

Is there any reason why this should not be combined with MCMC? i.e., can we use nested sampling from the burn-in phase to figure out the proposal distributions for Metropolis or Metropolis-Hastings *in situ*, and get the best of both worlds?

## David van Dyk:

You can definitely use the burn-in to set the Metropolis-Hastings jumping

11-18-2007, 12:51 pmrules used after burn-in. You can learn about the posterior correlations

and variances to optimize the jumping rule variances and blocking schemes.

The hitch is that you have to take it with a grain of salt because the

burn in is just burn in—it is likely more dispersed than the actual

posterior—at least if you pick good starting values!

## TomLoredo:

Vinay: First, thanks for posting about the Feroz & Hobson paper; I hadn’t seen it and I certainly

will give it a look.

Second, I’m not sure you have correctly characterized MCMC or nested sampling. MCMC does not

hunt for the mode; in fact, the MCMC sample with the highest prior*likelihood may not be

a very good estimator for the mode, especially as dimension grows (but it may be a good

starting point for input into an optimizer). MCMC doesn’t give you error bars “for free”—that’s

what it’s aiming for in the first place! MCMC correctly wanders around the posterior

via an algorithm that does not require one to normalize prior*likelihood; an

unfortunate consequence is that it can’t tell you what the normalization constant

is (thought there are tricky ways to fudge around this, with various degrees of

complexity and success depending on the problem). Nested sampling targets that

normalization constant; it gives you the posterior samples (the “error bars”) “for free.”

A good way to understand nested sampling is to think of it as building an approximation

to the *Lesbegue* integral of prior*likelihood, as opposed to its Riemann integral. For

the familiar Riemann integral of f(x), you divide the x axis into “bins” and add up the

areas of the resulting narrow vertical “bars” of height f(x). For the Lesbegue version,

you divide the *ordinate* into bins, and for each bin, find the size (“measure”) of the x

region corresponding to the intersection of the f-axis bins with f(x). This builds

the integral out of a bunch of wide but vertically short *horizontal* bars. Nested

sampling does this by using an algorithm for finding vertical bins (defined by likelihood

values) that has a cute statistical property: the associate x measures are, statistically,

distributed as order statistics. So even though you can’t expect to find those measures

exactly (they would be the areas/volumes between neighboring likelihood contours, over the

*whole* space), by construction, the algorithm lets you “guess” those measures reasonably

accurately. It’s truly ingenious.

There is no theta(L) step involved; indeed, there can’t be—there is no unique solution

(it’s a contour or surface or hypersurface). The algorithm works in theta space directly,

and you can use the resulting thetas as posterior samples.

But you are right about this hard part: “sampling to discard the tail of L(theta).” There

is no general way to do it (well, maybe the new paper has one!). The few problems that

have had good success with NS have had a special structure that allowed this sampling to

be done cleverly and accurately. The one previous cosmological example used an approximate

method, and to my mind (I only quickly read it) did not make a compelling case that

the results were accurate; and to the extent they were, it’s because the problem was

not that challenging. We (my collaboration with some statisticians at Duke on

exoplanet problems) had a grad student spend a semester trying to get NS to work on a

simple exoplanet problem (with a complex, multimodal posterior), and it’s the sampling

“above the tail” where we got hung up. He (Floyd Bullard) could

get it to work, but it was not as efficient as more common, less clever methods

(e.g., importance sampling).

Maybe the new paper will force us to re-examine NS.

-Tom

12-06-2007, 2:57 pm## vlk:

Thanks for those very illuminating comments, Tom. You are of course spot on that what MCMC does is get error bars — in fact my favorite game with astronomers is to shock them by pointing out how you can’t ever really get a definitive “best-fit” with MCMC and yet you will not lack for realistic characterization of parameter uncertainty. Also thanks for clarifying the Lebesgue integral approach — that was what I was trying to say while flailing about with the theta(L) business.

btw, during one of our CHASC meetings recently, Xiao-li described a fairly general method to get comparative goodness-of-fits out of MCMC. So even though MCMC doesn’t give you a normalization constant, it is possible to work around that to compare nested models. More on that later.

12-07-2007, 4:08 pm