Markov Chain Monte Carlo (MCMC)

Markov ChainsInformation TheoryData Science

Markov Chain Monte Carlo (MCMC) Explained with Proof and Intuition.

The crime story of a Mathematical Smuggler

A seasoned intelligence operative was assigned a difficult mission: track an arms smuggler who never stayed in one place long enough to be caught. The smuggler moved relentlessly across cities, completing deals and vanishing before anyone could react. Following him directly was impossible.

So the operative chose a different approach. Instead of chasing the smuggler, he decided to understand his movement.

He began keeping a diary.

Each page of the diary represented a step in the smuggler’s journey. On the ttht^{th} page, he wrote down a vector that captured his belief about the smuggler’s location. For every city kk, he recorded the probability that the smuggler would be there after tt moves. At the beginning, these probabilities were uncertain and scattered.

However, the operative had one powerful piece of information. Through intelligence reports, he had determined how the smuggler moved between cities. He represented this behavior as a matrix PP, where each entry PijP_{ij} described the probability of traveling from city ii to city jj.

With this, the diary became more than just notes. It became a system.

If the probabilities on page tt were given by xtx_t, then the next page followed a precise rule:

xt+1=xtPx_{t+1} = x_t P

Each new page was obtained by applying the same transition matrix. The operative did not need to guess anymore. He could compute how his belief evolved over time.

As the pages filled, something interesting happened. The probabilities began to settle. The changes between consecutive pages became smaller and smaller. Eventually, the distribution reached a stable form that barely changed with further updates.

The operative understood what this meant.

The smuggler’s movement was not arbitrary. It followed a Markov chain, where each step depended only on the current city. Because of this structure, repeated application of the transition matrix revealed a long-term pattern.

The diary no longer just described where the smuggler might be next. It revealed where he was most likely to be found in the long run.

Let us now look at the actual math behind it. Let us build the math analogously to the above said story.

Markov Chain Monte Carlo

Each state in a markov chain is a city the smuggler goes to. The amount he receives at each city is the value of the state, i.e., f(xi)\mathbf{f(x_i)}.

Let xix_i be the ithi^{th} state, and f(xi)f(x_i) be the value of the ithi^{th} state. The intelligence reports are nothing but the transition probabilities between the states. Let PijP_{ij} be the probability of moving from state ii to state jj. Not to mention, but the whole skill is hidden in figuring out the transition probabilities, and luckily, there are some methods to do that, such as Metropolis-Hastings algorithm, Gibbs sampling, etc.

Let us get into why we do this kind of Markov Chain Monte Carlo.

Let us say there is a function

f:RdR f: \mathbb{R}^d \to \mathbb{R}

We want to find the expected value of the function. Practically, if we see, even if the values each co-ordinate in the d-dimensions take 2 values, it becomes atleast 2d2^d possible values. That is a very very huge space. So, instead of doing the

E[f(x)]=xf(x)p(x)\mathbb{E}[f(x)] = \sum_{x} f(x) p(x)

We can approximate it as

E[f(x)]1ni=1nf(xi)\mathbb{E}[f(x)] \approx \frac{1}{n} \sum_{i=1}^{n} f(x_i)

where xix_i are sampled from p(x)p(x). We now have some issues.

  1. We don't know how to sample from p(x)p(x).
  2. We don't know how many to sample from p(x)p(x) to get a good approximation.
  3. How can we use Markov Chains in this scenario?

To answer them, we shall see how the markov chain works in this scenario.

We need to build a markov chain such that it behaves like we sampled from p(x)p(x). So, we keep some nn no. of states, and then assign them a target probability p(xi)p(x_i), where i=1np(xi)=1\sum_{i=1}^{n} p(x_i) = 1. We are not completed here. We also need to define a transition probability martix, such that the walk converges to the target distribution.

We have just answered the first and third question.

To answer the second question, we can say, walk long enough that you reach a stationary probability that is close enough to the target probability.

So, to design the heart of the walk, i.e., the transition probability matrix, we need to use the afformentioned techniques such as Metropolis-Hastings algorithm, Gibbs sampling, etc.

Some math to get ourselves dirty to understand what we are doing actually by using markov chain method before going into the Metropolis-Hastings algorithm.

To estimate

E[f(x)]=i=1nf(xi)p(xi)\mathbb{E}[f(x)] = \sum_{i=1}^n f(x_i) p(x_i)

we shall estimate it by averaging some samples, which is

E[f(x)]1ni=1nf(xi)\mathbb{E}[f(x)] \approx \frac{1}{n} \sum_{i=1}^n f(x_i)

we shall recall that p(t)\mathbf{p(t)} is the probability distribution of being at a vertex at time tt. So, p(t)\mathbf{p(t)} contains as many components as the number of vertices in the graph. The long term t-step average is

a(t)=1t[p(0)+p(1)++p(t1)]\mathbf{a(t) = \frac{1}{t}[p(0) + p(1) + \dots + p(t-1)]}

The expected value of function is

E[f(x)]=if(xi)p(xi)\mathbb{E}[f(x)] = \sum_{i} f(x_i) p(x_i)

where, fif_{i} is the value of the function at the ithi^{th} state and this is equivalent to the sampling by walking long enough. Let us say this estimate as γ\gamma. We shall replace the probability pip_{i} with the average probability aia_{i}.

E[f(x)]=ifiai(t)\mathbb{E}[f(x)] = \sum_{i} f_{i} a_{i}(t)

where ai(t)a_{i}(t) is the average probability of being at the ithi^{th} state at time tt.

When we expand this, we get

E[γ]=ifi(1tj=1tProb(walk is in state i at time j))=ifiai(t).\mathbb{E}[\gamma] = \sum_i f_i \left( \frac{1}{t} \sum_{j=1}^{t} \mathrm{Prob}(\text{walk is in state } i \text{ at time } j) \right) = \sum_i f_i a_i(t).

Metropolis-Hastings Algorithm

This algorithm says, if you are wanting to get a target distribution p(x)\mathbf{p(x)}, then you can start with a bunch of states, with each being a lattice point, that is

each point in the d-dimensions where the oint is from (x1,x2,,xi,,xd)(x_1, x_2, \dots, x_i, \dots, x_d) where xiNx_i \in \mathbb{N}.

Each point has 2d2d co-ordinate edges. But let it is not always that there will be those many neighbours, sometimes, in markov chains, we may have only rr no. of edges connected to a vertex. So, let rr be the maximum degree a state has in that graph. So, we usually can think, there is at max 1r\frac{1}{r} possibility to move from one vertex to another, but rr is the maximum no. of egdes one vertex can have. So, there might be some vertices with less than rr edges, which means there is a possibility that we might not move to any vertex from that vertex. So, to go to a vertex jj from ii, we can move to jj if pjpip_{j} \ge p_{i}. If pj<pip_{j} < p_{i}, then we can move to jj with probability pjpi\frac{p_{j}}{p_{i}} and with probability 1pjpi1 - \frac{p_{j}}{p_{i}} we stay at the current vertex ii. This shows us that, this movement favours the states that have the high probability. Hence this walk ensures that we stay in the high probability states for a long time, which resembles the target distribution.

For ii adjacent to jj in the graph GG,

pij=1rmin(1,pjpi)\mathbf{p_{ij} = \frac{1}{r} \min\left(1, \frac{p_j}{p_i}\right)}

where rr is the maximum degree of any vertex in the graph GG, and

pii=1jipij\mathbf{p_{ii} = 1 - \sum_{j \neq i} p_{ij}}

Thus,

pipij=pirmin(1,pjpi)=1rmin(pi,pj)p_{i}p_{ij} = \frac{p_{i}}{r} \min\left(1, \frac{p_{j}}{p_{i}}\right) = \frac{1}{r} \min\left(p_{i}, p_{j}\right)

and

pjpji=pjrmin(1,pipj)=1rmin(pj,pi)p_{j}p_{ji} = \frac{p_{j}}{r} \min\left(1, \frac{p_{i}}{p_{j}}\right) = \frac{1}{r} \min\left(p_{j}, p_{i}\right)

Thus, we have

pipij=pjpjip_{i}p_{ij} = p_{j}p_{ji}

This is called the detailed balance condition.