Markov Chain Monte Carlo (MCMC)

The crime story of a Mathematical Smuggler

A seasoned intelligence operative was assigned a difficult mission: track an arms smuggler who never stayed in one place long enough to be caught. The smuggler moved relentlessly across cities, completing deals and vanishing before anyone could react. Following him directly was impossible.

So the operative chose a different approach. Instead of chasing the smuggler, he decided to understand his movement.

He began keeping a diary.

Each page of the diary represented a step in the smuggler’s journey. On the $t^{th}$ page, he wrote down a vector that captured his belief about the smuggler’s location. For every city $k$ , he recorded the probability that the smuggler would be there after $t$ moves. At the beginning, these probabilities were uncertain and scattered.

However, the operative had one powerful piece of information. Through intelligence reports, he had determined how the smuggler moved between cities. He represented this behavior as a matrix $P$ , where each entry $P_{ij}$ described the probability of traveling from city $i$ to city $j$ .

With this, the diary became more than just notes. It became a system.

If the probabilities on page $t$ were given by $x_t$ , then the next page followed a precise rule:

x_{t+1} = x_t P

Each new page was obtained by applying the same transition matrix. The operative did not need to guess anymore. He could compute how his belief evolved over time.

As the pages filled, something interesting happened. The probabilities began to settle. The changes between consecutive pages became smaller and smaller. Eventually, the distribution reached a stable form that barely changed with further updates.

The operative understood what this meant.

The smuggler’s movement was not arbitrary. It followed a Markov chain, where each step depended only on the current city. Because of this structure, repeated application of the transition matrix revealed a long-term pattern.

The diary no longer just described where the smuggler might be next. It revealed where he was most likely to be found in the long run.

Let us now look at the actual math behind it. Let us build the math analogously to the above said story.

Markov Chain Monte Carlo

Each state in a markov chain is a city the smuggler goes to. The amount he receives at each city is the value of the state, i.e., $\mathbf{f(x_i)}$ .

Let $x_i$ be the $i^{th}$ state, and $f(x_i)$ be the value of the $i^{th}$ state. The intelligence reports are nothing but the transition probabilities between the states. Let $P_{ij}$ be the probability of moving from state $i$ to state $j$ . Not to mention, but the whole skill is hidden in figuring out the transition probabilities, and luckily, there are some methods to do that, such as Metropolis-Hastings algorithm, Gibbs sampling, etc.

Let us get into why we do this kind of Markov Chain Monte Carlo.

Let us say there is a function

f: \mathbb{R}^d \to \mathbb{R}

We want to find the expected value of the function. Practically, if we see, even if the values each co-ordinate in the d-dimensions take 2 values, it becomes atleast $2^d$ possible values. That is a very very huge space. So, instead of doing the

\mathbb{E}[f(x)] = \sum_{x} f(x) p(x)

We can approximate it as

\mathbb{E}[f(x)] \approx \frac{1}{n} \sum_{i=1}^{n} f(x_i)

where $x_i$ are sampled from $p(x)$ . We now have some issues.

We don't know how to sample from $p(x)$ .
We don't know how many to sample from $p(x)$ to get a good approximation.
How can we use Markov Chains in this scenario?

To answer them, we shall see how the markov chain works in this scenario.

We need to build a markov chain such that it behaves like we sampled from $p(x)$ . So, we keep some $n$ no. of states, and then assign them a target probability $p(x_i)$ , where $\sum_{i=1}^{n} p(x_i) = 1$ . We are not completed here. We also need to define a transition probability martix, such that the walk converges to the target distribution.

We have just answered the first and third question.

To answer the second question, we can say, walk long enough that you reach a stationary probability that is close enough to the target probability.

So, to design the heart of the walk, i.e., the transition probability matrix, we need to use the afformentioned techniques such as Metropolis-Hastings algorithm, Gibbs sampling, etc.

Some math to get ourselves dirty to understand what we are doing actually by using markov chain method before going into the Metropolis-Hastings algorithm.

To estimate

\mathbb{E}[f(x)] = \sum_{i=1}^n f(x_i) p(x_i)

we shall estimate it by averaging some samples, which is

\mathbb{E}[f(x)] \approx \frac{1}{n} \sum_{i=1}^n f(x_i)

we shall recall that $\mathbf{p(t)}$ is the probability distribution of being at a vertex at time $t$ . So, $\mathbf{p(t)}$ contains as many components as the number of vertices in the graph. The long term t-step average is

\mathbf{a(t) = \frac{1}{t}[p(0) + p(1) + \dots + p(t-1)]}

The expected value of function is

\mathbb{E}[f(x)] = \sum_{i} f(x_i) p(x_i)

where, $f_{i}$ is the value of the function at the $i^{th}$ state and this is equivalent to the sampling by walking long enough. Let us say this estimate as $\gamma$ . We shall replace the probability $p_{i}$ with the average probability $a_{i}$ .

\mathbb{E}[f(x)] = \sum_{i} f_{i} a_{i}(t)

where $a_{i}(t)$ is the average probability of being at the $i^{th}$ state at time $t$ .

When we expand this, we get

\mathbb{E}[\gamma] = \sum_i f_i \left( \frac{1}{t} \sum_{j=1}^{t} \mathrm{Prob}(\text{walk is in state } i \text{ at time } j) \right) = \sum_i f_i a_i(t).

Metropolis-Hastings Algorithm

This algorithm says, if you are wanting to get a target distribution $\mathbf{p(x)}$ , then you can start with a bunch of states, with each being a lattice point, that is

each point in the d-dimensions where the oint is from $(x_1, x_2, \dots, x_i, \dots, x_d)$ where $x_i \in \mathbb{N}$ .

Each point has $2d$ co-ordinate edges. But let it is not always that there will be those many neighbours, sometimes, in markov chains, we may have only $r$ no. of edges connected to a vertex. So, let $r$ be the maximum degree a state has in that graph. So, we usually can think, there is at max $\frac{1}{r}$ possibility to move from one vertex to another, but $r$ is the maximum no. of egdes one vertex can have. So, there might be some vertices with less than $r$ edges, which means there is a possibility that we might not move to any vertex from that vertex. So, to go to a vertex $j$ from $i$ , we can move to $j$ if $p_{j} \ge p_{i}$ . If $p_{j} < p_{i}$ , then we can move to $j$ with probability $\frac{p_{j}}{p_{i}}$ and with probability $1 - \frac{p_{j}}{p_{i}}$ we stay at the current vertex $i$ . This shows us that, this movement favours the states that have the high probability. Hence this walk ensures that we stay in the high probability states for a long time, which resembles the target distribution.

For $i$ adjacent to $j$ in the graph $G$ ,

\mathbf{p_{ij} = \frac{1}{r} \min\left(1, \frac{p_j}{p_i}\right)}

where $r$ is the maximum degree of any vertex in the graph $G$ , and

\mathbf{p_{ii} = 1 - \sum_{j \neq i} p_{ij}}

Thus,

p_{i}p_{ij} = \frac{p_{i}}{r} \min\left(1, \frac{p_{j}}{p_{i}}\right) = \frac{1}{r} \min\left(p_{i}, p_{j}\right)

and

p_{j}p_{ji} = \frac{p_{j}}{r} \min\left(1, \frac{p_{i}}{p_{j}}\right) = \frac{1}{r} \min\left(p_{j}, p_{i}\right)

Thus, we have

p_{i}p_{ij} = p_{j}p_{ji}

This is called the detailed balance condition.