Visualising the Metropolis-Hastings algorithm

In a previous post, I demonstrated how to use my R package MHadapive to do general MCMC to estimate Bayesian models. The functions in this package are an implementation of the Metropolis-Hastings algorithm. In this post, I want to provide an intuitive way to picture what is going on ‘under the hood’ in this algorithm.

The main idea is to draw samples from a distribution which you can evaluate at any point, but not necessarily integrate (or at least, you don’t want to). The reason that integration is important come from Bayes theorem itself:

P(θ|D)=P(D|θ)P(θ)/P(D)

Where P(D) is the unconditional probability of observing the data. Since this is not dependant on the parameters of the model (θ) on which we wish to perform inference, P(D) is effectively a normalising constant which makes P(θ|D) a proper probability density function (ie. integrates to 1).

So, we have a non-normalised probability density function which we wish to estimate by taking random samples. The process of taking random samples itself is often difficult for complex models, so instead we explore the distribution using a Markov chain. We need a chain which, if run long enough, will consist as a whole of random samples from our distribution of interest (lets call that distribution π). This property of the Markov chain we are constructing is called ergodicity. The Metropolis-Hastings algorithm is a way to construct such a chain.

It works like this:

Choose some starting point in the parameter space k_X
Choose a candidate point k_Y ~ N(k_X, σ). This is often referred to as the proposal distribution.
Move to the candidate point with probability: min( π(k_Y)/π(K_X), 1)
Repeat.

The following code demonstrates this process in action for a simple normal target distribution.

################################################################
###     Metropolis-Hastings Visualization                #######
###      Created by Corey Chivers, 2012                #########
################################################################

mh_vis<-function(prop_sd=0.1,target_mu=0,
target_sd=1,seed=1,iter=5000,plot_file='MH.pdf')
{
plot_range=c(target_mu-3*target_sd,target_mu+3*target_sd)
track <- NULL
k_X = seed; ## set k_X to the seed position

pdf(plot_file)
par(mfrow=c(3,1),mgp = c(1, 0.5, 0))

for(i in 1:iter)
{
track<-c(track,k_X)    ## The chain
k_Y = rnorm(1,k_X,prop_sd) ## Candidate point

## -- plot kernal density estimation of the Chain
par(mar=c(0,3,2,3),xaxt='n',yaxt='n')

curve(dnorm(x,target_mu,target_sd),col='blue',lty=2,lwd=2,xlim=plot_range)
if(i > 2)
lines(density(track,adjust=1.5),col='red',lwd=2)

## -- plot the chain
par(mar=c(0,3,0,3))
plot(track,1:i,xlim=plot_range,main='',type='l',ylab='Trace')

pi_Y = dnorm(k_Y,target_mu,target_sd,log=TRUE)
pi_X = dnorm(k_X,target_mu,target_sd,log=TRUE)

## -- plot the target distribution and propsal distribution actions
par(mar=c(3,3,0,3),xaxt='s')
curve(dnorm(x,target_mu,target_sd),xlim=plot_range,col='blue',lty=2,ylab='Metropolis-Hastings',lwd=2)
curve(dnorm(x,k_X,prop_sd),col='black',add=TRUE)
abline(v=k_X,lwd=2)
points(k_Y,0,pch=19,cex=2)
abline(v=k_Y)

a_X_Y = (pi_Y)-(pi_X)
a_X_Y = a_X_Y

if (a_X_Y > 0)
a_X_Y = 0

## Accept move with a probability a_X_Y
if (log(runif(1))<=a_X_Y)
{
k_X = k_Y;
points(k_Y,0,pch=19,col='green',cex=2)
abline( v=k_X,col='black',lwd=2)
}
## Adapt the poposal
if(i>100)
prop_sd=sd(track[floor(i/2):i])

if(i%%100==0)
print(paste(i,'of',iter))
}
dev.off()
}

mh_vis()

A common problem in the implementation of this algorithm is the selection of σ. Efficient mixing (the rate at which the chain converges to the target distribution) occurs when σ approximates the standard deviation of the target distribution. When we don’t know this value in advance. we can allow σ to adapt based on the history of the chain so far. In the above example, σ is simply updated to take the value of the standard deviation of some previous points in the chain.

The output is a multipage pdf which you can scroll through to animate the MCMC.

The top panel shows the target distribution (blue dotted) and a kernel smoothed estimate of the target via the MCMC samples. The second panel shows a trace of the chain, and the bottom panel illustrates the steps of the algorithm itself.

Note: notice the first 100 or so iterations were a rather poor representation of the target distribution. In practice, we would ‘burn’ the first n iterations of the chain – typically the first 100-1000.

6 thoughts on “Visualising the Metropolis-Hastings algorithm”

Vincent says:

February 10, 2012 at 10:41 pm

This is awesome. Thanks a lot!

Niall Bolger says:

February 11, 2012 at 10:11 am

Very well done. Just one question: Can you clarify the difference between the green and black points? Is the color coding acceptance vs. rejection?

- Corey Chivers says:
  
  February 11, 2012 at 11:43 am
  
  That’s correct. The green point indicates an acceptance. If a proposal point is rejected, it is coloured black.
  
kenahoo says:

February 11, 2012 at 12:20 pm

A nice change would be to make ‘dnorm’ a parameter to the function, so that you can run M-H on whatever distribution you want.

itc says:

February 29, 2012 at 9:12 pm

Nice demo. Is it worth cautioning that step number 3, as written, requires that step number 2 involves a symmetric density?

vate01 says:

February 4, 2014 at 12:28 am

Hello. Congratulations for this post. It helps a lot. I would like to ask you something. You write: “When we don’t know this value in advance. we can allow σ to adapt based on the history of the chain so far. In the above example, σ is simply updated to take the value of the standard deviation of some previous points in the chain”. Where can I found a concrete reference pointing out this assertion? Thanks in advance.