Finite State Markov Chains

Author

David Evans

Published

March 30, 2026

1 Introduction

Many economic phenomena are inherently random and evolve over time. Stock prices fluctuate, economies move between expansions and recessions, and individual workers transition between employment and unemployment. A powerful and widely used tool for modeling these kinds of stochastic dynamics is the finite state Markov chain.

A Markov chain describes a random variable that takes on one of a finite number of values, where the probability of transitioning from one value to another depends only on the current state—not on the history of how you got there. This “memoryless” property makes Markov chains both tractable and remarkably useful. They appear throughout economics: in macroeconomic models of business cycles, in labor economics models of job search, in finance models of credit ratings, and in industrial organization models of firm entry and exit.

Before diving into Markov chains, we first need to understand the simpler building block they are constructed from: discrete random variables. In these notes we will:

Define discrete random variables and show how to simulate them in Julia.
Introduce finite state Markov chains and the transition matrix $P$.
Build a simulation function and apply it to a business cycle example.
Compute expected values and conditional expectations, culminating in the key result that the vector of conditional expectations is simply the matrix product $P\bar X$.
Verify our theoretical formulas using Monte Carlo simulation.
Estimate the transition matrix from data using maximum likelihood, both in closed form and numerically with Optim.jl.
Compute multi-step transition probabilities and show that $P^k$ converges as $k \to \infty$.
Find stationary distributions as eigenvectors and verify them by simulation.
State the Ergodic Theorem and connect time averages to ensemble averages.
Analyze mixing time and the role of eigenvalues in convergence speed.

using Distributions
using LinearAlgebra
using Plots

2 Discrete Random Variables

2.1 Definition

A discrete random variable is a random variable that can take on only a finite number of values. It is completely described by two objects:

A vector of possible values (sometimes called the state space or support):

\[ \bar x = \left(\begin{matrix}\bar x_1\\\bar x_2\\ \vdots\\ \bar x _n\end{matrix}\right) \]

A vector of probabilities:

\[ p = \left(\begin{matrix}p_1\\p_2\\ \vdots\\ p _n\end{matrix}\right) \]

Here $p_i$ represents the probability that the random variable $x$ takes the value $\bar x_i$. For this to be a valid probability distribution, we require that all probabilities are non-negative ($p_i \geq 0$ for all $i$) and that they sum to one:

\[ \sum_{i=1}^n p_i = 1 \]

2.2 Example: A Coin Toss

We have all experienced discrete random variables in our lives. Consider a bet on a coin toss: I bet you $1 that the coin will come up heads. We can describe the random variable $w$ representing your winnings by a vector of outcomes and a vector of probabilities:

\[ \bar w = \left(\begin{matrix}-1\\1\end{matrix}\right) \quad\text{and}\quad p = \left(\begin{matrix}0.5\\0.5\end{matrix}\right) \]

The first outcome represents the coin coming up heads and you losing a dollar, while the second outcome represents the coin coming up tails and you winning a dollar.

If I were cheating and using a loaded coin, the probability vector might look like:

\[ p = \left(\begin{matrix}0.8\\0.2\end{matrix}\right) \]

representing an 80% probability of the coin coming up heads (and you losing).

2.3 Representing Discrete Random Variables in Julia

Julia’s Distributions package provides the Categorical distribution, which is exactly what we need. A Categorical distribution takes a vector of probabilities and returns random indices (1, 2, …, $n$) according to those probabilities:

cointoss = Categorical([0.5, 0.5])   # Fair coin: 50/50 chance of 1 or 2

Distributions.Categorical{Float64, Vector{Float64}}(support=Base.OneTo(2), p=[0.5, 0.5])

We can draw a single random outcome using rand:

rand(cointoss)       # Returns either 1 or 2

Or simulate multiple draws at once:

rand(cointoss, 5)    # Simulates 5 random coin tosses

5-element Vector{Int64}:
 1
 1
 1
 1
 1

Note that rand returns indices (1 or 2), not the actual values (-1 or 1). To translate these indices into meaningful values, we use a vector of values and index into it:

wbar = [-1., 1.]
w = wbar[rand(cointoss)]    # Single toss: maps index to value

-1.0

For multiple tosses, we can separate the two steps to see the mapping clearly. First draw the state indices, then look up the corresponding values:

tosses   = rand(cointoss, 10)    # 10 random state indices (1s and 2s)
winnings = wbar[tosses]          # Map each index to its value (-1 or 1)
hcat(tosses, winnings)           # Display side by side

10×2 Matrix{Float64}:
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 2.0   1.0
 2.0   1.0
 1.0  -1.0
 2.0   1.0
 2.0   1.0
 1.0  -1.0

Each 1 in the left column (state index) maps to -1.0 in the right column (a loss), and each 2 maps to 1.0 (a win).

The same approach works for a loaded coin. We simply change the probability vector:

loadedcoin = Categorical([0.8, 0.2])
w = wbar[rand(loadedcoin, 10)]

10-element Vector{Float64}:
 -1.0
 -1.0
  1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0
 -1.0

With the loaded coin, you should see many more $-1$s (losses) than $1$s (wins), reflecting the 80/20 probability split.

3 Finite State Markov Chains

3.1 Definition and the Transition Matrix

A Markov chain extends the idea of a discrete random variable by allowing the probability distribution to change over time, depending on the current state. Formally, let $x_t$ be a random variable at time $t$. A finite state Markov chain is characterized by two objects:

A vector of possible values $\bar x = (\bar x_1, \bar x_2, \ldots, \bar x_n)'$, just as before.
A transition matrix of probabilities:

\[ P = \left(\begin{matrix}P_{1,1}&P_{1,2}&\cdots&P_{1,n}\\ P_{2,1}&P_{2,2}&\cdots&P_{2,n}\\ \vdots&\vdots&\ddots&\vdots\\ P_{n,1}&P_{n,2}&\cdots&P_{n,n}\end{matrix}\right) \]

The entry $P_{i,j}$ gives the probability that $x_t = \bar x_j$ conditional on $x_{t-1} = \bar x_i$. In other words, if we are currently in state $i$, then $P_{i,j}$ tells us the probability of moving to state $j$ in the next period. We often use $s_t$ to denote the index associated with $x_t$: if $x_t = \bar x_i$ then $s_t = i$.

Properties of the Transition Matrix

The transition matrix $P$ must satisfy two conditions:

All entries are non-negative: $P_{i,j} \geq 0$ for all $i, j$.
Each row sums to one: $\sum_{j=1}^n P_{i,j} = 1$ for all $i$.

Row $i$ of $P$ gives the complete conditional probability distribution of the next state, given that the current state is $i$.

The defining property of a Markov chain is that the transition probabilities depend only on the current state, not on the entire history of past states. This is sometimes called the Markov property or “memorylessness.” Mathematically: $\Pr(x_t = \bar x_j \mid x_{t-1}, x_{t-2}, \ldots) = \Pr(x_t = \bar x_j \mid x_{t-1}) = P_{s_{t-1}, j}$.

3.2 Why Markov Chains? Modeling Persistence

Why would we want to use a Markov chain rather than a simple discrete random variable? The key reason is persistence. Many economic variables exhibit serial correlation: when the economy is in a recession, it is more likely to remain in a recession next period than to suddenly jump to a boom. When a worker is unemployed, they are more likely to still be unemployed next month than someone who is currently employed.

A simple discrete random variable with fixed probabilities cannot capture this kind of dependence. But a Markov chain can: by choosing the transition matrix $P$ appropriately, we can make certain transitions more likely than others, building in the persistence we observe in the data.

3.3 Business Cycle Example

Let us model output growth as a two-state Markov chain. The economy is either in an expansion (output growing at 3%) or a recession (output declining at 1%):

# ḡ is typed as g\bar then tab
ḡ = [0.03, -0.01]    # State 1 = expansion (3%), State 2 = recession (-1%)

2-element Vector{Float64}:
  0.03
 -0.01

The transition matrix captures the persistence of each regime:

P = [0.9 0.1;
     0.3 0.7]

2×2 Matrix{Float64}:
 0.9  0.1
 0.3  0.7

How do we read this matrix? The first row says that if we are currently in an expansion, there is a 90% chance we remain in an expansion next period and a 10% chance we transition to a recession. The second row says that if we are in a recession, there is a 70% chance we stay in a recession and a 30% chance we recover to an expansion. Both regimes are persistent, but expansions are more persistent than recessions.

3.4 Reading the Transition Matrix in Julia

We can extract specific probabilities from the transition matrix using standard indexing:

P[1, 2]    # Probability of transitioning from boom (state 1) to recession (state 2)

0.1

To get the entire conditional distribution for next period, given that we are currently in a boom, we extract the first row:

P[1, :]    # Full conditional distribution given current state is boom

2-element Vector{Float64}:
 0.9
 0.1

We can use this row to draw a random value of output next period:

pboom = Categorical(P[1, :])   # Create distribution from row 1 of P
ḡ[rand(pboom)]                 # Draw a state and look up its value

0.03

4 Simulating Markov Chains

4.1 The `drawDiscrete` Function

The pattern of creating a Categorical distribution and mapping the drawn index to a value comes up repeatedly, so let us wrap it in a reusable function:

"""
    drawDiscrete(p, X̄)

Draws a discrete random variable with probability vector `p` over values `X̄`.

# Arguments
- `p`: a probability vector (must sum to 1)
- `X̄`: a vector of values corresponding to each state

# Returns
- `(s, X)`: a tuple where `s` is the drawn state index and `X = X̄[s]` is the corresponding value
"""
function drawDiscrete(p, X̄)
    dist = Categorical(p)         # Create the distribution from p
    s = rand(dist)                # Draw a random state index
    X = X̄[s]                     # Look up the value for that state
    return s, X                   # Return both the index and value
end

Main.Notebook.drawDiscrete

This function takes a probability vector p and a value vector X̄, draws a random state index from the categorical distribution defined by p, maps it to the corresponding value, and returns both the index and the value as a tuple.

Let us verify it works with our coin toss examples:

drawDiscrete([0.5, 0.5], [-1, 1])    # Fair coin

(1, -1)

drawDiscrete([0.8, 0.2], [-1, 1])    # Loaded coin

(1, -1)

4.2 The `simulateMarkov` Function

Since the simulation pattern—pre-allocate arrays, set the initial state, loop forward drawing from the appropriate row of $P$—will be used repeatedly, let us wrap it in a reusable function:

"""
    simulateMarkov(P, X̄, T; s₁=1)

Simulate a Markov chain for `T` periods.

# Arguments
- `P`: an `n × n` transition matrix
- `X̄`: a vector of values corresponding to each state
- `T`: number of periods to simulate
- `s₁`: initial state index (default: 1)

# Returns
- `(s, X)`: a tuple where `s` is the vector of state indices and `X` is the vector of corresponding values
"""
function simulateMarkov(P, X̄, T; s₁=1)
    X = zeros(T)                          # Pre-allocate values
    s = zeros(Int, T)                     # Pre-allocate state indices
    s[1] = s₁
    X[1] = X̄[s₁]
    for t in 2:T
        s[t], X[t] = drawDiscrete(P[s[t-1], :], X̄)
    end
    return s, X
end

Main.Notebook.simulateMarkov

At each time step, the function extracts row $s_{t-1}$ from the transition matrix $P$ to get the conditional probability vector for the next state, then passes it to drawDiscrete. The result is a pair of vectors: s containing the state indices (1 or 2 in our example) and X containing the corresponding values (growth rates).

Pre-allocation

Notice that the function pre-allocates the arrays X and s with zeros before the loop, rather than growing them with push!. This is much more efficient because Julia can allocate a single contiguous block of memory up front. For numerical simulations, always pre-allocate when you know the size of the output.

4.3 Simulating a Markov Chain

With simulateMarkov in hand, running a simulation is a one-liner:

s, g = simulateMarkov(P, ḡ, 100)

([1, 1, 1, 1, 1, 1, 2, 2, 1, 1  …  1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0.03, 0.03, 0.03, 0.03, 0.03, 0.03, -0.01, -0.01, 0.03, 0.03  …  0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03])

This returns the state sequence s (a vector of 1s and 2s) and the corresponding growth rates g.

4.4 Plotting the Simulation

We can visualize the simulated path of output growth:

plot(g, linewidth=2, color=:steelblue, label="Output Growth",
     title="Simulated Business Cycle", size=(700, 350))
hline!([0.0], color=:gray, linestyle=:dash, label="")
xlabel!("Time")
ylabel!("Output Growth")

The plot shows the economy alternating between expansion and recession regimes. The dashed line at zero separates positive and negative growth. You can see the persistence at work: the economy tends to stay in each regime for several consecutive periods before transitioning. The expansions (at 3%) tend to last longer than the recessions (at -1%), which is consistent with the transition matrix—the probability of staying in an expansion (0.9) is higher than the probability of staying in a recession (0.7).

4.5 Animating the Markov Chain

An animation can help build intuition for how the Markov chain evolves over time. Using the @animate macro from Plots.jl, we create a GIF that shows the growth path being drawn period by period, with the current state highlighted in green (expansion) or red (recession):

[ Info: Saved animation to /var/folders/ms/mjkz2xp96zgblq028jcm5y0r0000gr/T/jl_AtYZMD6nGj.gif

Watch how the economy tends to stay in each regime for stretches before transitioning. The green dots (expansions) tend to cluster together, as do the red dots (recessions)—this is persistence in action.

5 Expectations

5.1 Expected Value of a Discrete Random Variable

The expected value (or mean) of a discrete random variable is the probability-weighted average of all possible values:

\[ \mathbb{E}[X] = p_1 \bar X_1 + p_2 \bar X_2 + \ldots + p_n \bar X_n = \sum_{j=1}^n p_j \bar X_j \]

This formula has a natural interpretation: each possible value $\bar X_j$ is weighted by how likely it is to occur ($p_j$), and we sum up all these weighted values. The expected value tells us the long-run average outcome if we were to repeat the random experiment many times.

Notice that the expected value formula is exactly a dot product between the probability vector $p$ and the value vector $\bar X$. In Julia, we can compute it directly using the dot function from LinearAlgebra:

# Fair coin toss
p = [0.5, 0.5]
w̄ = [-1., 1.]
Ew = dot(p, w̄)   # Expected winnings

0.0

For a fair coin, the expected winnings are zero—which makes intuitive sense, since you are equally likely to win or lose a dollar.

# Loaded coin toss (80% heads)
p = [0.8, 0.2]
Ew = dot(p, w̄)   # Expected winnings

-0.6000000000000001

With the loaded coin, the expected winnings are $-0.6$: on average, you lose 60 cents per toss. This is the “house edge” when playing against a cheater.

We can verify that dot is doing the same thing as the manual calculation:

p[1]*w̄[1] + p[2]*w̄[2]

-0.6000000000000001

5.2 Monte Carlo Verification

One of the most powerful ideas in computational statistics is the Monte Carlo method: if you want to know the expected value of a random variable, simply simulate it many times and take the average. By the law of large numbers, the sample average converges to the true expected value as the number of draws $N \to \infty$.

Let us verify our expected value formula using this approach:

N = 100000                          # Number of draws
w_simulations = zeros(N)            # Pre-allocate results

for i in 1:N                        # Simulate N coin tosses
    s, w = drawDiscrete(p, w̄)      # Toss coin, get winnings
    w_simulations[i] = w            # Store the result
end
mean(w_simulations)                 # Sample average ≈ true expectation

-0.60514

The simulated average should be very close to $-0.6$, confirming our formula. The small discrepancy is due to sampling variability, which shrinks as $N$ increases.

Monte Carlo Method

The Monte Carlo method is invaluable in economics. Many quantities of interest—expected utilities, option prices, welfare effects of policy changes—are difficult or impossible to compute analytically. But if you can simulate the underlying random process, you can always estimate expectations through averaging. The key requirement is a sufficiently large number of draws $N$.

6 Conditional Expectations

6.1 Conditional Expectations for Markov Chains

For finite state Markov chains, we are often interested in conditional expectations: the expected value of the random variable next period, given information about the current state:

\[ \mathbb{E}[X_{t+1} | X_t = \bar X_i] \]

Since the conditional distribution of $X_{t+1}$ given $X_t = \bar X_i$ is determined by row $i$ of the transition matrix $P$, we have:

\[ \mathbb{E}[X_{t+1} | X_t = \bar X_i] = \sum_{j=1}^n P_{i,j} \bar X_j \]

This is again a dot product—this time between row $i$ of $P$ and the value vector $\bar X$. We can compute the conditional expectations for our business cycle example:

# Expected growth rate conditional on currently being in a boom
dot(P[1, :], ḡ)

0.026

# Expected growth rate conditional on currently being in a recession
dot(P[2, :], ḡ)

0.0019999999999999996

Conditional on being in a boom, expected output growth next period is about 2.6%—close to 3% because there is a high probability of remaining in the expansion. Conditional on being in a recession, expected growth is about 0.2%—slightly positive because there is a 30% chance of transitioning to an expansion.

6.2 Conditional Expectations as a Matrix Product

Each conditional expectation is a dot product of $\bar X$ with a row of $P$:

\[ \mathbb{E}[X_{t+1} | X_t = \bar X_i] = P_{i,\cdot} \cdot \bar X \]

If we stack all $n$ conditional expectations into a vector, something elegant happens. Each entry is a dot product of $\bar X$ with a different row of $P$, and by definition, stacking all these row-wise dot products is exactly matrix multiplication:

\[ \left(\begin{matrix}\mathbb{E}[X_{t+1}|X_t = \bar X_1]\\\mathbb{E}[X_{t+1}|X_t = \bar X_2]\\\vdots\\\mathbb{E}[X_{t+1}|X_t = \bar X_n]\end{matrix}\right) = \left(\begin{matrix} P_{1,\cdot}\cdot \bar X \\ P_{2,\cdot}\cdot \bar X \\ \vdots \\ P_{n,\cdot}\cdot \bar X\end{matrix}\right) = P\bar X \]

Key Result

The vector of all conditional expectations equals $P\bar X$—a single matrix multiplication. This is one of the most important computational results for working with Markov chains, and it generalizes: the $k$-step-ahead conditional expectations are given by $P^k \bar X$.

In Julia:

P * ḡ

2-element Vector{Float64}:
 0.026
 0.0019999999999999996

This matches the individual dot products we computed above, confirming that matrix multiplication is doing exactly what we expect.

6.3 Verifying with Simulation

Just as we used Monte Carlo simulation to verify the expected value formula, we can verify the conditional expectation formula by running a long simulation and computing empirical conditional averages.

First, simulate a long Markov chain using our simulateMarkov function:

s, g = simulateMarkov(P, ḡ, 5000)
T = length(g)

Next, we need to sort the simulated output growth values by the preceding state. For each period $t$, we look at what state the economy was in ($s_t$) and collect the next period’s growth rate ($g_{t+1}$). This gives us empirical samples from each conditional distribution:

gboom = []                         # Will hold growth after booms
grec = []                          # Will hold growth after recessions

for t in 1:T-1                    # Loop over all periods
    if s[t] == 1                   # Was period t a boom?
        push!(gboom, g[t+1])      #   → save next period's growth
    else                           # Was period t a recession?
        push!(grec, g[t+1])       #   → save next period's growth
    end
end

Now we can compare the empirical conditional expectations to the theoretical values:

[mean(gboom), mean(grec)]    # Empirical conditional expectations

2-element Vector{Float64}:
 0.025987124463519452
 0.0017702596380802529

P * ḡ                         # Theoretical conditional expectations

2-element Vector{Float64}:
 0.026
 0.0019999999999999996

The two should be very close, with any small differences due to sampling variability. As we increase $T$, the empirical values converge to the theoretical ones.

6.4 List Comprehensions: A Concise Alternative

The loop above can be written much more concisely using Julia’s list comprehensions with filtering conditions. A list comprehension of the form [expression for variable in iterable if condition] creates an array by evaluating expression for each element that satisfies condition:

gboom = [g[t+1] for t in 1:T-1 if s[t] == 1]
grec  = [g[t+1] for t in 1:T-1 if s[t] == 2]

[mean(gboom), mean(grec)]

2-element Vector{Float64}:
 0.025987124463519275
 0.0017702596380802505

This produces the same result as the explicit loop but in just two lines. List comprehensions are a powerful Julia feature that combines filtering and mapping into a single, readable expression. They are especially useful for extracting subsets of data based on conditions—a task that arises frequently in empirical work.

7 Estimating the Transition Matrix

7.1 The Estimation Problem

So far we have taken the transition matrix $P$ as given. In practice, however, the transition probabilities are unknown and must be estimated from data. Suppose we observe a time series of states $s_1, s_2, \ldots, s_T$—for example, a classification of each quarter as “expansion” or “recession” based on GDP growth. How should we estimate the transition matrix $P$ that generated this sequence?

The answer comes from maximum likelihood estimation (MLE): choose the transition matrix that makes the observed data as probable as possible.

7.2 The Likelihood Function

Given a sequence of states $s_1, s_2, \ldots, s_T$ and a candidate transition matrix $P$, the probability (likelihood) of observing that particular sequence is the product of each one-step transition probability:

\[ \mathcal{L}(P) = \prod_{t=2}^{T} P_{s_{t-1}, s_t} \]

Each factor $P_{s_{t-1}, s_t}$ is the probability of transitioning from the state at time $t-1$ to the state at time $t$. We multiply these because, conditional on the current state, the transitions are independent (this follows from the Markov property).

Taking the natural logarithm gives the log-likelihood, which is easier to work with both analytically and numerically:

\[ \log \mathcal{L}(P) = \sum_{t=2}^{T} \log P_{s_{t-1}, s_t} \]

7.3 The Closed-Form MLE

To find the MLE, it is helpful to reorganize the log-likelihood in terms of transition counts. Define $n_{ij}$ as the number of times we observe a transition from state $i$ to state $j$ in the data:

\[ n_{ij} = \sum_{t=2}^{T} \mathbf{1}(s_{t-1} = i,\; s_t = j) \]

where $\mathbf{1}(\cdot)$ is the indicator function that equals 1 when the condition is true and 0 otherwise. Let $n_{i\cdot} = \sum_j n_{ij}$ denote the total number of transitions out of state $i$.

Using these counts, the log-likelihood can be rewritten as:

\[ \log \mathcal{L}(P) = \sum_{i=1}^{n} \sum_{j=1}^{n} n_{ij} \log P_{ij} \]

We want to maximize this subject to the constraint that each row of $P$ sums to one: $\sum_j P_{ij} = 1$ for all $i$. Using Lagrange multipliers (or recognizing that each row can be optimized independently), we obtain the MLE:

\[ \hat{P}_{ij} = \frac{n_{ij}}{n_{i\cdot}} = \frac{n_{ij}}{\sum_{k=1}^{n} n_{ik}} \]

MLE of Transition Probabilities

The maximum likelihood estimate of $P_{ij}$ is simply the fraction of all transitions from state $i$ that go to state $j$. This is both intuitive and easy to compute: just count transitions and normalize each row.

This estimator has several desirable properties. It is consistent: as $T \to \infty$, $\hat{P} \to P$ almost surely. It is also asymptotically efficient: among all consistent estimators, it achieves the smallest possible variance in large samples.

7.4 Julia Implementation

7.4.1 Counting Transitions

Let us write a function that counts the number of transitions between each pair of states:

"""
    countTransitions(s, n_states)

Count the number of transitions between each pair of states in a state sequence.

# Arguments
- `s`: a vector of state indices (integers from 1 to `n_states`)
- `n_states`: the number of distinct states

# Returns
- An `n_states × n_states` matrix `N` where `N[i,j]` is the number of transitions from state `i` to state `j`
"""
function countTransitions(s, n_states)
    N = zeros(Int, n_states, n_states)   # Initialize count matrix
    for t in 2:length(s)
        N[s[t-1], s[t]] += 1            # Increment the (i,j) count
    end
    return N
end

Main.Notebook.countTransitions

The function loops through the state sequence and, for each consecutive pair $(s_{t-1}, s_t)$, increments the corresponding entry in the count matrix.

7.4.2 Estimating the Transition Matrix

To go from counts to probabilities, we normalize each row by its sum:

"""
    estimateTransitionMatrix(s, n_states)

Estimate a transition matrix from a state sequence using maximum likelihood.

# Arguments
- `s`: a vector of state indices (integers from 1 to `n_states`)
- `n_states`: the number of distinct states

# Returns
- An `n_states × n_states` transition matrix where entry `(i,j)` is the MLE estimate of the probability of transitioning from state `i` to state `j`
"""
function estimateTransitionMatrix(s, n_states)
    N = countTransitions(s, n_states)
    P̂ = N ./ sum(N, dims=2)              # Divide each row by its sum
    return P̂
end

Main.Notebook.estimateTransitionMatrix

The expression sum(N, dims=2) sums along dimension 2 (columns), producing a column vector of row totals. Broadcasting with ./ then divides each element by its row total.

7.4.3 Testing on Simulated Data

Let us test our estimator on the simulated business cycle data from earlier. Recall that we generated $T = 5{,}000$ observations from a Markov chain with the true transition matrix:

2×2 Matrix{Float64}:
 0.9  0.1
 0.3  0.7

Applying our estimator to the simulated state sequence:

P̂ = estimateTransitionMatrix(s, 2)

2×2 Matrix{Float64}:
 0.899678  0.100322
 0.294256  0.705744

We can compare the estimate to the truth:

println("True P:")
display(P)
println("\nEstimated P̂:")
display(P̂)
println("\nDifference (P̂ - P):")
display(round.(P̂ .- P, digits=4))

True P:

Estimated P̂:

Difference (P̂ - P):

2×2 Matrix{Float64}:
 0.9  0.1
 0.3  0.7

2×2 Matrix{Float64}:
 0.899678  0.100322
 0.294256  0.705744

2×2 Matrix{Float64}:
 -0.0003  0.0003
 -0.0057  0.0057

The differences should be small—on the order of a few percentage points or less—confirming that the MLE works well even with a moderately sized sample.

7.4.4 How Sample Size Affects Accuracy

To see how estimation accuracy improves with more data, we can estimate the transition matrix from simulations of different lengths and track the maximum absolute error:

sample_sizes = [50, 200, 1000, 5000, 20000]

errors = map(sample_sizes) do T_sim
    s_sim, _ = simulateMarkov(P, ḡ, T_sim)
    P̂_sim = estimateTransitionMatrix(s_sim, 2)
    return maximum(abs.(P̂_sim .- P))   # Max absolute error across all entries
end

hcat(sample_sizes, round.(errors, digits=4))

5×2 Matrix{Float64}:
    50.0  0.0273
   200.0  0.1412
  1000.0  0.0234
  5000.0  0.0173
 20000.0  0.0166

We can visualize this convergence on a log-log scale:

plot(sample_sizes, errors, marker=:circle, linewidth=2,
     color=:steelblue, label="Max absolute error",
     xscale=:log10, yscale=:log10,
     title="MLE Accuracy vs. Sample Size", size=(700, 350))
xlabel!("Sample size (T)")
ylabel!("Max |P̂ᵢⱼ - Pᵢⱼ|")

The roughly linear relationship on the log-log plot reflects the $1/\sqrt{T}$ convergence rate that is typical of maximum likelihood estimators: doubling the precision requires quadrupling the sample size.

7.5 The Log-Likelihood Function

7.5.1 Computing the Log-Likelihood

It is often useful to evaluate the log-likelihood directly—for example, to compare competing models or to perform numerical optimization. Here is a function that computes the log-likelihood of a state sequence given a transition matrix:

"""
    loglikelihood_mc(P, s)

Compute the log-likelihood of state sequence `s` given transition matrix `P`.

# Arguments
- `P`: an `n × n` transition matrix
- `s`: a vector of state indices

# Returns
- The log-likelihood (a scalar)
"""
function loglikelihood_mc(P, s)
    ll = 0.0
    for t in 2:length(s)
        ll += log(P[s[t-1], s[t]])
    end
    return ll
end

Main.Notebook.loglikelihood_mc

7.5.2 Comparing Log-Likelihoods

The MLE should achieve a higher log-likelihood than any other valid transition matrix. Let us verify this by comparing three candidates: our MLE $\hat{P}$, the true $P$, and a naive matrix with equal transition probabilities:

P_wrong = [0.5 0.5;
           0.5 0.5]

println("Log-likelihood at P̂ (MLE):    ", round(loglikelihood_mc(P̂, s), digits=2))
println("Log-likelihood at true P:      ", round(loglikelihood_mc(P, s), digits=2))
println("Log-likelihood at P = 0.5 I:   ", round(loglikelihood_mc(P_wrong, s), digits=2))

Log-likelihood at P̂ (MLE):    -1984.67
Log-likelihood at true P:      -1984.77
Log-likelihood at P = 0.5 I:   -3465.04

As expected, the MLE achieves the highest log-likelihood. The true $P$ comes in a close second (it would match the MLE exactly with infinite data), while the naive equal-probability matrix does substantially worse because it ignores the persistence in the data.

7.5.3 Visualizing the Likelihood Surface

For our $2 \times 2$ transition matrix, the constraint that rows sum to one means there are only two free parameters: $P_{11}$ (the probability of staying in a boom) and $P_{22}$ (the probability of staying in a recession). We can visualize the log-likelihood as a function of these two parameters:

p11_grid = 0.01:0.01:0.99
p22_grid = 0.01:0.01:0.99

ll_surface = [loglikelihood_mc([p11 1-p11; 1-p22 p22], s)
              for p11 in p11_grid, p22 in p22_grid]

contourf(p11_grid, p22_grid, ll_surface',
         xlabel="P₁₁ (boom persistence)",
         ylabel="P₂₂ (recession persistence)",
         title="Log-Likelihood Surface",
         color=:viridis, size=(600, 450), levels=20)
scatter!([P̂[1,1]], [P̂[2,2]], marker=:star5, markersize=12,
         color=:red, label="MLE")
scatter!([P[1,1]], [P[2,2]], marker=:diamond, markersize=8,
         color=:white, label="True P")

The contour plot shows that the log-likelihood surface has a single, well-defined peak near the true parameter values. The MLE (red star) sits right at this peak, while the true $P$ (white diamond) is very close by. The concavity of the surface confirms that the maximum is unique—there is no ambiguity about the best-fitting transition matrix.

7.6 Numerical MLE with `Optim.jl`

7.6.1 Why Numerical Optimization?

For the simple case of estimating a transition matrix, we derived a closed-form MLE. But many econometric models do not have closed-form solutions. In those cases, we need to use numerical optimization to find the parameter values that maximize the likelihood. Julia’s Optim.jl package provides a variety of optimization algorithms for this purpose.

Even when a closed form exists, it is good practice to verify it numerically. Let us use Optim.jl to find the MLE and confirm that it matches our closed-form answer.

using Optim

7.6.2 Setting Up the Objective Function

For numerical optimization, we need to express the problem as minimizing a function of a parameter vector. Since Optim.jl minimizes by default, we minimize the negative log-likelihood (which is equivalent to maximizing the log-likelihood).

We parameterize the $2 \times 2$ transition matrix by the two free parameters $\theta = (P_{11}, P_{22})$ and reconstruct the full matrix inside the objective function:

function neg_loglik(θ, s)
    p11, p22 = θ
    P_candidate = [p11 1-p11;
                   1-p22  p22]
    return -loglikelihood_mc(P_candidate, s)
end

neg_loglik (generic function with 1 method)

7.6.3 Running the Optimizer

Since probabilities must lie between 0 and 1, we use box-constrained optimization with the Fminbox wrapper around the L-BFGS algorithm. We start from a uniform initial guess of $\theta_0 = (0.5, 0.5)$:

θ₀ = [0.5, 0.5]                          # Initial guess
result = optimize(θ -> neg_loglik(θ, s),  # Objective function
                  [0.01, 0.01],           # Lower bounds
                  [0.99, 0.99],           # Upper bounds
                  θ₀,                     # Starting point
                  Fminbox(LBFGS()))        # Bounded L-BFGS algorithm

 * Status: success

 * Candidate solution
    Final objective value:     1.984667e+03

 * Found with
    Algorithm:     Fminbox with L-BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 9.57e-07 ≰ 1.0e-08

 * Work counters
    Seconds run:   12  (vs limit Inf)
    Iterations:    141
    f(x) calls:    85737
    ∇f(x) calls:   85737

The optimize function returns a results object containing the minimizer, the minimum function value, and convergence information.

7.6.4 Extracting and Comparing Results

We extract the optimal parameters and reconstruct the estimated transition matrix:

θ_mle = Optim.minimizer(result)
P̂_optim = [θ_mle[1] 1-θ_mle[1];
            1-θ_mle[2]  θ_mle[2]]

println("Optim MLE:")
display(P̂_optim)
println("\nClosed-form MLE:")
display(P̂)
println("\nTrue P:")
display(P)

Optim MLE:

Closed-form MLE:

True P:

2×2 Matrix{Float64}:
 0.899678  0.100322
 0.294256  0.705744

2×2 Matrix{Float64}:
 0.899678  0.100322
 0.294256  0.705744

2×2 Matrix{Float64}:
 0.9  0.1
 0.3  0.7

The numerically optimized estimate should match the closed-form MLE to high precision. Any tiny differences are due to the optimizer’s convergence tolerance, not a fundamental disagreement.

When to Use Numerical Optimization

For transition matrices, the closed-form MLE is simpler and more efficient. But Optim.jl becomes essential when:

The model has no closed-form solution (e.g., hidden Markov models, structural estimation).
You want to impose additional constraints beyond row-summing-to-one.
The likelihood involves latent variables or complex functional forms.

The pattern demonstrated here—define an objective function, choose an algorithm, extract the minimizer—is the same regardless of model complexity.

8 Multi-Step Transition Probabilities

8.1 The $k$-Step Transition Matrix

So far we have studied one-step transitions: the probability of moving from state $i$ to state $j$ in a single period is given by $P_{i,j}$. But many economic questions involve longer horizons. What is the probability that an economy currently in recession will be in an expansion five years from now? What is the distribution of states ten periods ahead?

The answer comes from a simple and elegant result: the $k$-step transition matrix is just the $k$-th power of $P$:

\[ P^{(k)} = P^k \]

The entry $P^k_{i,j}$ gives the probability of going from state $i$ to state $j$ in exactly $k$ steps. The intuition is straightforward. To go from state $i$ to state $j$ in two steps, the chain must pass through some intermediate state $m$ after one step. The probability of this particular two-step path is $P_{i,m} \cdot P_{m,j}$. Summing over all possible intermediate states gives:

\[ P^{(2)}_{i,j} = \sum_{m=1}^n P_{i,m} \cdot P_{m,j} \]

This is precisely the $(i,j)$ entry of the matrix product $P \cdot P = P^2$. The same logic extends inductively: $P^{(k)} = P^{k-1} \cdot P = P^k$.

8.2 Computing Multi-Step Transitions in Julia

In Julia, matrix powers are computed with the ^ operator, just as for scalars. Let us compute the two-step, ten-step, and hundred-step transition matrices for our business cycle model:

P^2     # 2-step transition matrix

2×2 Matrix{Float64}:
 0.84  0.16
 0.48  0.52

P^10    # 10-step transition matrix

2×2 Matrix{Float64}:
 0.751512  0.248488
 0.745465  0.254535

P^100   # 100-step transition matrix

2×2 Matrix{Float64}:
 0.75  0.25
 0.75  0.25

Notice something striking about $P^{100}$: both rows are nearly identical. This means that after 100 periods, the probability of being in each state is essentially the same regardless of the starting state. The chain has “forgotten” where it started. This is our first glimpse of the stationary distribution, which we will study in detail in the next section.

8.3 Example: Recession to Expansion in 5 Periods

Suppose the economy is currently in a recession (state 2). What is the probability of being in an expansion (state 1) in exactly 5 periods?

P5 = P^5
prob_rec_to_exp = P5[2, 1]
println("Probability of expansion in 5 periods, starting from recession: ", round(prob_rec_to_exp, digits=4))

Probability of expansion in 5 periods, starting from recession: 0.6917

We can also compute the entire distribution over states 5 periods ahead, starting from a recession:

P5[2, :]   # Full distribution in 5 periods, starting from recession

2-element Vector{Float64}:
 0.69168
 0.3083199999999999

8.4 Convergence of $P^k$

As $k$ grows, the rows of $P^k$ converge to a common vector. We can visualize this convergence by plotting $P^k_{1,1}$ (the probability of being in an expansion $k$ periods from now, given that we start in an expansion) and $P^k_{2,1}$ (the same probability starting from a recession):

K = 50
prob_exp_from_exp = [(P^k)[1, 1] for k in 1:K]
prob_exp_from_rec = [(P^k)[2, 1] for k in 1:K]

plot(1:K, prob_exp_from_exp, linewidth=2, color=:steelblue,
     label="Start in expansion", title="Convergence of Pᵏ",
     size=(700, 350), ylims=(0, 1))
plot!(1:K, prob_exp_from_rec, linewidth=2, color=:firebrick,
      label="Start in recession")
hline!([0.75], color=:gray, linestyle=:dash, label="Stationary (π₁ = 0.75)")
xlabel!("Periods ahead (k)")
ylabel!("Pr(expansion at period k)")

Both curves converge to the same value—approximately 0.75—regardless of the starting state. This limiting probability is the stationary distribution’s weight on the expansion state. The convergence is rapid: after about 15 periods, the two curves are virtually indistinguishable. The speed of this convergence is governed by the second-largest eigenvalue of $P$, which we will examine later.

8.5 Verifying Multi-Step Probabilities by Simulation

We can verify the multi-step transition probabilities using Monte Carlo simulation. The idea is simple: start many independent chains in state 2 (recession), run each for 5 periods, and count how many end up in state 1 (expansion). The fraction should converge to $P^5_{2,1}$:

N_sims = 50_000
k_steps = 5
count_expansion = 0

for i in 1:N_sims
    s_sim, _ = simulateMarkov(P, ḡ, k_steps + 1; s₁=2)   # Start in recession
    if s_sim[end] == 1
        count_expansion += 1
    end
end

prob_sim = count_expansion / N_sims
println("Simulated Pr(expansion in $k_steps steps | start in recession): ", round(prob_sim, digits=4))
println("Theoretical P⁵[2,1]: ", round(P5[2, 1], digits=4))

Simulated Pr(expansion in 5 steps | start in recession): 0.6896
Theoretical P⁵[2,1]: 0.6917

The simulated probability closely matches the theoretical value, confirming that matrix powers correctly compute multi-step transition probabilities.

9 Stationary Distributions

9.1 Definition

A probability vector $\pi = (\pi_1, \pi_2, \ldots, \pi_n)'$ is a stationary distribution (or invariant distribution) of a Markov chain with transition matrix $P$ if it satisfies two conditions:

Invariance: $\pi' P = \pi'$ — applying the transition matrix leaves the distribution unchanged.
Valid probability: $\sum_{i=1}^n \pi_i = 1$ and $\pi_i \geq 0$ for all $i$.

The interpretation is powerful: if the chain starts with the distribution $\pi$ over states, then after one period the distribution is still $\pi$. And therefore after two periods it is still $\pi$, and so on forever. The distribution is self-perpetuating—the chain has reached a kind of statistical equilibrium.

9.2 Economic Interpretation

The stationary distribution has a natural economic interpretation: $\pi_i$ is the long-run fraction of time the economy spends in state $i$. If $\pi_1 = 0.75$ for the expansion state in our business cycle model, that means the economy spends about 75% of its time in expansion and 25% in recession in the long run. This is a crucial quantity for economic analysis—it tells us the “average” state of the economy, the frequency of recessions, and the typical duration of business cycles.

9.3 Finding $\pi$ as an Eigenvector Problem

The invariance condition $\pi' P = \pi'$ can be rewritten as $P' \pi = \pi$. This says that $\pi$ is an eigenvector of the transpose $P'$ with eigenvalue 1. We can therefore find the stationary distribution by computing the eigendecomposition of $P'$ and extracting the eigenvector corresponding to eigenvalue 1.

A fundamental theorem in the theory of Markov chains (the Perron–Frobenius theorem) guarantees that every transition matrix has 1 as its largest eigenvalue, and the corresponding eigenvector can be chosen to have all non-negative entries. After normalizing so the entries sum to 1, we obtain $\pi$.

vals, vecs = eigen(P')
println("Eigenvalues of P': ", vals)

Eigenvalues of P': [0.6, 1.0]

We need the eigenvector corresponding to eigenvalue 1. Since eigenvalues are not always returned in a predictable order, we find the index of the eigenvalue closest to 1:

idx = argmin(abs.(vals .- 1.0))          # Index of eigenvalue closest to 1
π_raw = real.(vecs[:, idx])               # Extract the corresponding eigenvector
π_stat = π_raw ./ sum(π_raw)              # Normalize to sum to 1
println("Stationary distribution: π = ", round.(π_stat, digits=4))

Stationary distribution: π = [0.75, 0.25]

9.4 Alternative: Solving the Linear System

An equivalent approach is to solve the linear system $(P' - I)\pi = 0$ subject to $\sum_i \pi_i = 1$. We replace one of the equations in $(P' - I)\pi = 0$ (which are linearly dependent) with the normalization constraint:

n = size(P, 1)
A = copy(P') - I                        # P'π = π  ⟹  (P' - I)π = 0
A[end, :] .= 1.0                        # Replace last equation with Σπᵢ = 1
b = zeros(n)
b[end] = 1.0                            # Right-hand side: last entry = 1

π_linear = A \ b                         # Solve the system
println("Stationary distribution (linear system): π = ", round.(π_linear, digits=4))

Stationary distribution (linear system): π = [0.75, 0.25]

Both methods give the same answer. The eigenvector approach is more general and extends naturally to larger state spaces, while the linear system approach is slightly more direct for small problems.

9.5 Example: Business Cycle Stationary Distribution

For our business cycle model with $P = \begin{pmatrix} 0.9 & 0.1 \\ 0.3 & 0.7 \end{pmatrix}$, the stationary distribution tells us the long-run composition of the economy:

println("Long-run probability of expansion: ", round(π_stat[1], digits=4))
println("Long-run probability of recession: ", round(π_stat[2], digits=4))
println("Long-run average growth rate: ", round(dot(π_stat, ḡ), digits=4))

Long-run probability of expansion: 0.75
Long-run probability of recession: 0.25
Long-run average growth rate: 0.02

The economy spends 75% of its time in expansion and 25% in recession. The long-run average growth rate is the probability-weighted average of the two growth rates—a single number that summarizes the economy’s typical performance.

A related quantity is the expected duration of each state. If the economy is currently in state $i$, the probability of staying in state $i$ next period is $P_{ii}$. The number of consecutive periods in state $i$ follows a geometric distribution, so the expected duration is $1 / (1 - P_{ii})$:

dur_expansion = 1 / (1 - P[1, 1])
dur_recession = 1 / (1 - P[2, 2])
println("Expected duration of an expansion: ", round(dur_expansion, digits=1), " periods")
println("Expected duration of a recession:  ", round(dur_recession, digits=1), " periods")

Expected duration of an expansion: 10.0 periods
Expected duration of a recession:  3.3 periods

Expansions last on average 10 periods and recessions about 3.3 periods. This is consistent with the NBER business cycle dating for the U.S., where expansions have historically been much longer than recessions.

9.6 Verification by Simulation

We can verify the stationary distribution by running a very long simulation and comparing the empirical state frequencies to $\pi$. By running the chain for many periods, the fraction of time spent in each state should converge to the stationary probabilities:

T_long = 100_000
s_long, g_long = simulateMarkov(P, ḡ, T_long)

freq_expansion = sum(s_long .== 1) / T_long
freq_recession = sum(s_long .== 2) / T_long

println("Simulated frequency of expansion: ", round(freq_expansion, digits=4))
println("Simulated frequency of recession: ", round(freq_recession, digits=4))
println("Stationary distribution:           π = ", round.(π_stat, digits=4))

Simulated frequency of expansion: 0.7441
Simulated frequency of recession: 0.2559
Stationary distribution:           π = [0.75, 0.25]

The simulated frequencies and the analytical stationary distribution should be very close, confirming our calculations. The small discrepancy shrinks as T_long increases.

Existence and Uniqueness

Not every Markov chain has a unique stationary distribution. Two conditions guarantee existence and uniqueness:

Irreducibility: it is possible to reach every state from every other state (possibly in multiple steps). In our business cycle model, you can go from expansion to recession and back, so the chain is irreducible.
Aperiodicity: the chain does not cycle deterministically through a fixed sequence of states. Since both $P_{1,1} > 0$ and $P_{2,2} > 0$ (the chain can stay in any state), it is aperiodic.

If a Markov chain is both irreducible and aperiodic, it has exactly one stationary distribution, and $P^k$ converges to a matrix with identical rows equal to $\pi'$ as $k \to \infty$. If either condition fails, the long-run behavior can be more complex: multiple stationary distributions may exist, or the chain may cycle without settling down.

10 The Ergodic Theorem

10.1 Statement

The stationary distribution tells us the long-run fraction of time the chain spends in each state. The Ergodic Theorem formalizes this intuition and extends it to functions of the state. For an irreducible, aperiodic Markov chain with stationary distribution $\pi$, the time average of any function $f$ converges almost surely to its expectation under $\pi$:

\[ \frac{1}{T}\sum_{t=1}^{T} f(X_t) \xrightarrow{a.s.} \sum_{i=1}^{n} \pi_i \, f(\bar x_i) \quad \text{as } T \to \infty \]

In words: if you run the chain long enough and average the values of $f$ along the path, the result converges to the probability-weighted average of $f$ under the stationary distribution. Time averages equal ensemble averages.

10.2 Intuition

Why does this work? Because an irreducible, aperiodic chain visits each state $i$ with long-run frequency $\pi_i$. So in a sample of $T$ periods, state $i$ appears approximately $\pi_i \cdot T$ times. The time average of $f$ is therefore approximately:

\[ \frac{1}{T}\sum_{t=1}^{T} f(X_t) \approx \sum_{i=1}^{n} \frac{\pi_i T}{T} f(\bar x_i) = \sum_{i=1}^{n} \pi_i \, f(\bar x_i) \]

This is the same result we would get by drawing $T$ independent samples from $\pi$ and averaging—even though the Markov chain samples are dependent (each depends on the previous one). The ergodic theorem says that, for the purpose of computing long-run averages, this dependence does not matter.

Connection to the Law of Large Numbers

The ergodic theorem is the Markov chain generalization of the Law of Large Numbers (LLN). The classical LLN says that the average of i.i.d. random variables converges to their common mean. The ergodic theorem extends this to dependent sequences generated by a Markov chain. The i.i.d. case is actually a special case: an i.i.d. sequence is a Markov chain where every row of $P$ is identical (the next state does not depend on the current state at all).

10.3 The Theoretical Foundation for Monte Carlo Simulation

The ergodic theorem is the theoretical justification for the Monte Carlo simulations we have been running throughout these notes. Every time we simulated a long chain and computed a sample average—to verify expected values, conditional expectations, or state frequencies—we were implicitly relying on the ergodic theorem. It guarantees that our simulation-based estimates converge to the true values as the simulation length grows.

10.4 Example: Long-Run Average Growth Rate

Let us use the ergodic theorem to compute the long-run average GDP growth rate. The theoretical value, using the stationary distribution, is:

\[ \bar g_{\infty} = \sum_{i=1}^{n} \pi_i \, \bar g_i = \pi_1 \cdot 0.03 + \pi_2 \cdot (-0.01) \]

g_longrun_theory = dot(π_stat, ḡ)
println("Ergodic theorem (theoretical): long-run average growth = ", round(g_longrun_theory, digits=4))

Ergodic theorem (theoretical): long-run average growth = 0.02

Now let us compare this to the time average from our long simulation:

g_longrun_sim = mean(g_long)
println("Monte Carlo (simulated):       long-run average growth = ", round(g_longrun_sim, digits=4))
println("Difference: ", round(abs(g_longrun_theory - g_longrun_sim), digits=6))

Monte Carlo (simulated):       long-run average growth = 0.0198
Difference: 0.000235

The two values are nearly identical, confirming the ergodic theorem in action. The small remaining difference is sampling variability, which shrinks as $T$ increases.

10.5 Beyond Averages: Long-Run Variance and Recession Probability

The ergodic theorem applies to any function of the state, not just the identity. This means we can use it to compute any long-run statistic. For example, let $f(X_t) = X_t^2$. Then the ergodic theorem gives us $\frac{1}{T}\sum X_t^2 \to \sum \pi_i \bar x_i^2$, which lets us compute the long-run variance of growth:

# Long-run second moment
E_g2_theory = dot(π_stat, ḡ.^2)

# Long-run variance = E[g²] - (E[g])²
var_theory = E_g2_theory - g_longrun_theory^2
var_sim = var(g_long)

println("Theoretical long-run variance of growth: ", round(var_theory, digits=6))
println("Simulated variance of growth:            ", round(var_sim, digits=6))

Theoretical long-run variance of growth: 0.0003
Simulated variance of growth:            0.000305

We can also compute the long-run probability of negative growth by setting $f(X_t) = \mathbf{1}(X_t < 0)$—the indicator function for recession. The ergodic theorem tells us this converges to $\sum \pi_i \mathbf{1}(\bar x_i < 0) = \pi_2$, since only state 2 has negative growth:

# Fraction of periods with negative growth
frac_negative_sim = mean(g_long .< 0)
println("Simulated fraction of negative growth periods: ", round(frac_negative_sim, digits=4))
println("Theoretical (π₂):                              ", round(π_stat[2], digits=4))

Simulated fraction of negative growth periods: 0.2559
Theoretical (π₂):                              0.25

This illustrates the power of the ergodic theorem: a single long simulation provides consistent estimates of all long-run statistics simultaneously.

10.6 Convergence of the Time Average

How quickly does the time average converge to the ergodic limit? We can track the running average as $T$ grows to see the convergence in real time:

running_avg = cumsum(g_long) ./ (1:T_long)

plot(running_avg[1:10_000], linewidth=1.5, color=:steelblue,
     label="Running average of g", title="Ergodic Convergence",
     size=(700, 350), alpha=0.8)
hline!([g_longrun_theory], color=:firebrick, linewidth=2, linestyle=:dash,
       label="Ergodic limit (π'ḡ)")
xlabel!("Simulation length (T)")
ylabel!("Time average of growth rate")

The running average fluctuates considerably in the early periods when the sample is small, but it steadily settles down toward the ergodic limit (dashed red line) as $T$ grows. By $T = 5{,}000$ or so, the running average has essentially stabilized.

10.7 How Long Must the Chain Run?

The ergodic theorem guarantees convergence, but it does not tell us how fast convergence occurs. In practice, this matters: if you need a simulation of length $T = 10{,}000{,}000$ to get a reliable estimate, that is much more costly than if $T = 1{,}000$ suffices.

The convergence rate depends on how quickly the chain “mixes”—that is, how rapidly it forgets its initial state. A chain that switches between states frequently mixes fast, while a chain that lingers in each state for long stretches mixes slowly. The next section makes this precise by connecting the mixing speed to the eigenvalues of $P$.

11 Mixing Time and Convergence Speed

11.1 The Role of Eigenvalues

How fast does $P^k$ converge to the stationary distribution? The answer lies in the eigenvalues of $P$. Every $n \times n$ transition matrix has $n$ eigenvalues $\lambda_1, \lambda_2, \ldots, \lambda_n$. For an irreducible, aperiodic chain, the largest eigenvalue is always $\lambda_1 = 1$ (corresponding to the stationary distribution). All other eigenvalues satisfy $|\lambda_i| < 1$.

The convergence rate is controlled by the second-largest eigenvalue in absolute value, which we denote $|\lambda_2|$. The distance between $P^k$ and the limiting matrix shrinks approximately like $|\lambda_2|^k$:

\[ \| P^k - \mathbf{1}\pi' \| \approx C \cdot |\lambda_2|^k \]

where $\mathbf{1}\pi'$ is the matrix with every row equal to $\pi'$ and $C$ is a constant. The smaller $|\lambda_2|$ is, the faster the convergence.

11.2 Computing Eigenvalues in Julia

Let us compute the eigenvalues of our business cycle transition matrix:

λ = eigvals(P)
println("Eigenvalues of P: ", round.(λ, digits=4))
println("Second-largest eigenvalue: |λ₂| = ", round(abs(λ[1]), digits=4))

Eigenvalues of P: [0.6, 1.0]
Second-largest eigenvalue: |λ₂| = 0.6

λ_sorted = sort(abs.(λ), rev=true)
println("Eigenvalues sorted by magnitude: ", round.(λ_sorted, digits=4))
println("Spectral gap: 1 - |λ₂| = ", round(1 - λ_sorted[2], digits=4))

Eigenvalues sorted by magnitude: [1.0, 0.6]
Spectral gap: 1 - |λ₂| = 0.4

The spectral gap $1 - |\lambda_2|$ measures how much “room” there is between the dominant eigenvalue and the rest. A larger spectral gap means faster mixing.

11.3 Comparing Fast and Slow Mixing

To build intuition, let us compare two extreme parameterizations. A slowly mixing chain has high persistence—entries close to the identity matrix—meaning the economy rarely switches states. A fast mixing chain has entries closer to uniform, meaning switches happen frequently:

P_slow = [0.99 0.01;    # Very persistent: states almost never switch
          0.01 0.99]

P_fast = [0.5 0.5;      # No persistence: uniform transitions
          0.5 0.5]

λ_slow = sort(abs.(eigvals(P_slow)), rev=true)
λ_fast = sort(abs.(eigvals(P_fast)), rev=true)

println("Slow chain: |λ₂| = ", round(λ_slow[2], digits=4), "  spectral gap = ", round(1 - λ_slow[2], digits=4))
println("Fast chain: |λ₂| = ", round(λ_fast[2], digits=4), "  spectral gap = ", round(1 - λ_fast[2], digits=4))

Slow chain: |λ₂| = 0.98  spectral gap = 0.02
Fast chain: |λ₂| = 0.0  spectral gap = 1.0

The slowly mixing chain has $|\lambda_2| = 0.98$, very close to 1, so convergence is glacial. The fast mixing chain has $|\lambda_2| = 0$, so convergence is instant—after just one step, the chain has already reached the stationary distribution.

11.4 Visualizing Convergence Speed

We can see the difference dramatically by plotting how quickly $P^k_{1,1}$ converges to the stationary value for each parameterization:

K = 100
conv_original = [(P^k)[1,1] for k in 1:K]
conv_slow = [(P_slow^k)[1,1] for k in 1:K]
conv_fast = [(P_fast^k)[1,1] for k in 1:K]

plot(1:K, conv_original, linewidth=2, color=:steelblue,
     label="Original (|λ₂| = 0.60)", title="Convergence Speed and Eigenvalues",
     size=(700, 350), ylims=(0.4, 1.05))
plot!(1:K, conv_slow, linewidth=2, color=:firebrick,
      label="Slow mixing (|λ₂| = 0.98)")
plot!(1:K, conv_fast, linewidth=2, color=:green,
      label="Fast mixing (|λ₂| = 0.00)", linestyle=:dash)
xlabel!("Periods (k)")
ylabel!("Pᵏ₁₁")

The fast-mixing chain (green dashed) converges immediately. Our original business cycle model (blue) converges within about 15 periods. The slowly mixing chain (red) barely budges even after 100 periods—it would take hundreds of steps to approach its stationary distribution.

11.5 Economic Interpretation

The second-largest eigenvalue has a direct economic interpretation: it measures the persistence of the Markov chain. A chain with $|\lambda_2|$ close to 1 represents an economy where states are highly persistent—recessions last a long time and expansions last a long time. A chain with $|\lambda_2|$ close to 0 represents an economy where the current state provides little information about the future.

For our business cycle model, $|\lambda_2| = 0.60$ indicates moderate persistence: the economy’s current state is informative about the near future but not the distant future. After about 15 periods, the current state is essentially irrelevant for forecasting.

Practical Implications for Simulation

The mixing time has important consequences for Monte Carlo simulation:

Burn-in period: When simulating a Markov chain, the early observations are influenced by the arbitrary initial state. A common practice is to discard an initial “burn-in” period of several mixing times before collecting data. For a chain with $|\lambda_2| = 0.98$, the burn-in must be very long.
Simulation length: The ergodic theorem guarantees convergence of time averages, but slowly mixing chains require much longer simulations to achieve a given level of accuracy. If $|\lambda_2|$ is close to 1, you may need $T$ in the hundreds of thousands or millions.
Effective sample size: Even with $T$ observations from a slowly mixing chain, the effective number of independent observations is much smaller than $T$ because consecutive observations are highly correlated. The effective sample size is roughly $T \cdot (1 - |\lambda_2|) / (1 + |\lambda_2|)$.

When calibrating economic models, always check the eigenvalues of your transition matrix. If $|\lambda_2|$ is close to 1, plan for longer simulations and larger burn-in periods.

12 Summary

In these notes we covered the theory and Julia implementation of finite state Markov chains:

Discrete random variables are the foundation. They are defined by a vector of possible values $\bar x$ and a probability vector $p$, and can be simulated in Julia using Categorical from the Distributions package.

Finite state Markov chains extend discrete random variables by allowing the probability distribution to depend on the current state through a transition matrix $P$. The entry $P_{i,j}$ gives the probability of moving from state $i$ to state $j$. The Markov property—that only the current state matters, not the history—is what makes these models tractable.

Simulation is straightforward once you have a drawDiscrete function: at each time step, extract the appropriate row of $P$ and draw the next state. Pre-allocating arrays before the simulation loop is important for performance.

Expected values of discrete random variables are dot products: $\mathbb{E}[X] = p \cdot \bar X$. Conditional expectations for Markov chains are also dot products, one for each row of $P$:

\[ \mathbb{E}[X_{t+1} | X_t = \bar X_i] = P_{i,\cdot} \cdot \bar X \]

Stacking these gives the elegant matrix formula $P\bar X$.

Monte Carlo verification provides a way to check analytical formulas: simulate the process many times and compare sample averages to theoretical values. As the number of simulations grows, the empirical and theoretical values converge.

Maximum likelihood estimation gives us a principled way to estimate the transition matrix from observed data. The MLE has an elegant closed form: $\hat{P}_{ij} = n_{ij} / n_{i\cdot}$, the fraction of transitions from state $i$ that go to state $j$. This estimator is consistent and asymptotically efficient.

Numerical optimization with Optim.jl provides a general-purpose alternative when closed-form solutions are unavailable. The pattern of defining a negative log-likelihood, choosing an algorithm, and extracting the minimizer carries over to far more complex models.

Multi-step transition probabilities are computed by matrix powers: $P^k_{i,j}$ gives the probability of going from state $i$ to state $j$ in $k$ steps. As $k \to \infty$, the rows of $P^k$ converge to the stationary distribution.

Stationary distributions satisfy $\pi' P = \pi'$ and represent the long-run fraction of time the chain spends in each state. They can be found as eigenvectors of $P'$ or by solving a linear system. Existence and uniqueness require irreducibility and aperiodicity.

The Ergodic Theorem guarantees that time averages converge to ensemble averages: $\frac{1}{T}\sum f(X_t) \to \sum \pi_i f(\bar x_i)$. This is the theoretical foundation for Monte Carlo simulation with Markov chains.

Key Things to Remember

A Markov chain is defined by values $\bar x$ and a transition matrix $P$.
Row $i$ of $P$ gives the conditional distribution of the next state given current state $i$.
The vector of conditional expectations equals $P\bar X$.
The MLE of $P_{ij}$ is the fraction of transitions from $i$ to $j$: $\hat{P}_{ij} = n_{ij}/n_{i\cdot}$.
The $k$-step transition matrix is $P^k$; as $k \to \infty$ it converges to $\mathbf{1}\pi'$.
The stationary distribution $\pi$ is the left eigenvector of $P$ with eigenvalue 1.
The Ergodic Theorem: time averages equal expectations under $\pi$ for irreducible, aperiodic chains.
The second-largest eigenvalue $|\lambda_2|$ controls convergence speed; check it before running long simulations.
Use Categorical from Distributions to draw from discrete distributions.
Use Optim.jl for numerical MLE when no closed form exists.
Pre-allocate arrays with zeros(T) for efficient simulation.
Monte Carlo simulation is a powerful tool for verifying analytical results.

With these tools in hand, you are ready to apply Markov chains to economic models such as the McCall search model, the Diamond–Mortensen–Pissarides matching model, and the Kydland–Prescott real business cycle model.

1 Introduction

2 Discrete Random Variables

2.1 Definition

2.2 Example: A Coin Toss

2.3 Representing Discrete Random Variables in Julia

3 Finite State Markov Chains

3.1 Definition and the Transition Matrix

3.2 Why Markov Chains? Modeling Persistence

3.3 Business Cycle Example

3.4 Reading the Transition Matrix in Julia

4 Simulating Markov Chains

4.1 The drawDiscrete Function

4.2 The simulateMarkov Function

4.3 Simulating a Markov Chain

4.4 Plotting the Simulation

4.5 Animating the Markov Chain

5 Expectations

5.1 Expected Value of a Discrete Random Variable

5.2 Monte Carlo Verification

6 Conditional Expectations

6.1 Conditional Expectations for Markov Chains

6.2 Conditional Expectations as a Matrix Product

6.3 Verifying with Simulation

6.4 List Comprehensions: A Concise Alternative

7 Estimating the Transition Matrix

7.1 The Estimation Problem

7.2 The Likelihood Function

7.3 The Closed-Form MLE

7.4 Julia Implementation

7.4.1 Counting Transitions

7.4.2 Estimating the Transition Matrix

7.4.3 Testing on Simulated Data

7.4.4 How Sample Size Affects Accuracy

7.5 The Log-Likelihood Function

7.5.1 Computing the Log-Likelihood

7.5.2 Comparing Log-Likelihoods

7.5.3 Visualizing the Likelihood Surface

7.6 Numerical MLE with Optim.jl

7.6.1 Why Numerical Optimization?

7.6.2 Setting Up the Objective Function

7.6.3 Running the Optimizer

7.6.4 Extracting and Comparing Results

8 Multi-Step Transition Probabilities

8.1 The \(k\)-Step Transition Matrix

8.2 Computing Multi-Step Transitions in Julia

8.3 Example: Recession to Expansion in 5 Periods

8.4 Convergence of \(P^k\)

8.5 Verifying Multi-Step Probabilities by Simulation

9 Stationary Distributions

9.1 Definition

9.2 Economic Interpretation

9.3 Finding \(\pi\) as an Eigenvector Problem

9.4 Alternative: Solving the Linear System

9.5 Example: Business Cycle Stationary Distribution

9.6 Verification by Simulation

10 The Ergodic Theorem

10.1 Statement

10.2 Intuition

10.3 The Theoretical Foundation for Monte Carlo Simulation

10.4 Example: Long-Run Average Growth Rate

10.5 Beyond Averages: Long-Run Variance and Recession Probability

10.6 Convergence of the Time Average

10.7 How Long Must the Chain Run?

11 Mixing Time and Convergence Speed

11.1 The Role of Eigenvalues

11.2 Computing Eigenvalues in Julia

11.3 Comparing Fast and Slow Mixing

11.4 Visualizing Convergence Speed

11.5 Economic Interpretation

12 Summary

4.1 The `drawDiscrete` Function

4.2 The `simulateMarkov` Function

7.6 Numerical MLE with `Optim.jl`