Target audience: Intermediate
Estimated reading time: 5'
Sampling sits at the core of data science and analysis. This article explores a category of numerical sampling techniques, known as Markov Chain Monte Carlo, and how they can be implemented via reusable design patterns, using the Metropolis-Hastings model as an example.
What you will learn: How to build modular Python Classes for MCMC with Bridge Pattern: A Metropolis-Hastings Implementation.
Notes:
- Environments: Python 3.10.10, Numpy 1.25.1, Matplotlib 3.7.2
- To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.
- Source code available at Github.com Data_Exploration/Stats
Markov Chain Monte Carlo
Markov Chain Monte Carlo (MCMC) is a class of algorithms used in computational statistics to sample from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The core idea is to use a Markov chain to model the complex probability distribution you're interested in, where the chain's states represent the possible outcomes in the distribution. Over time, as the chain progresses, the states it visits are used to generate samples that are representative of the desired distribution.
MCMC methods are particularly useful in situations where direct sampling is difficult due to the complexity of the distribution or high-dimensional spaces. They are widely used in Bayesian statistics, where one needs to compute posterior distributions that are not analytically tractable [ref 1].
The result of a Markov Chain Monte Carlo (MCMC) procedure is a joint posterior probability distribution, or the combined probability of parameters theta, given certain data and a model/hypothesis, denoted as p(theta | Data, Model).
The generic Naive Bayes rule is defined as \[posterior = \frac{likelihood\ .\ prior}{evidence}\]
In the case of MCMC, the log of the posterior is computed as\[for\ model\ \mathfrak{M},\ a\ sampled\ value\ \theta\ and\ data \ \mathfrak{D}\]
\[log(p(\theta |\mathfrak{D},\mathfrak{M}))=log(p(\mathfrak{D} |\theta,\mathfrak{M})) + log(p(\theta | \mathfrak{M})) - log(p(\mathfrak{D} |\mathfrak{M}))\]
Essentially, MCMC can be understood as a method for estimating a credible or plausible range of model parameters that align with the observed data. The process of fitting a probability distribution emerges as a secondary outcome.
Metropolis-Hastings algorithm
The Metropolis-Hastings (MH) algorithm is one of the most famous MCMC techniques. It allows for sampling from a distribution by generating a sequence of sample points in a way that, even if you can't directly sample from the distribution itself, you can still obtain a set of samples that approximates the distribution after a sufficient number of iterations [ref 2]. Other popular MCMC methods include the Gibbs sampler and Hamiltonian Monte Carlo.
Theory
For this study, we consider a single variable probability distribution.
Generic algorithm:
The procedure at step i is as follows:
1 Sample the proposal distribution q \[(1)\ \ x^{*} \leftarrow q(x_{i}|x_{i-1})\] 2 Compute the probability u of accepting the sample \[(2)\ \ u(\theta^{*} | \theta_{i})=min\left ( 1, \frac{\widetilde{p}(\theta^{*} ).q\left ( \theta_{i} | \theta^{*}\right )}{\widetilde{p}(\theta_{i} ).q\left ( \theta^{*} | \theta_{i}\right )}\right )\]
3 Select the next value \[(3)\ \ IF\ [ urand_{[0, 1]} < u\left (\theta ^{*} | \theta_{i} \right )]\ \theta_{i+1} = \theta^{*}\ \ \ ELSE\ \theta_{i+1} = \theta_{i}\]
Random walk:
This is the case where the proposal distribution is symmetric. Therefore the probability of accepting the sample is simplified as \[u(\theta^{*} | \theta_{i})=min\left ( 1, \frac{\widetilde{p}(\theta^{*} )}{\widetilde{p}(\theta_{i} )}\right )\]
Reusable design pattern
Reusable software design patterns are standardized solutions to common problems in software design and architecture. They represent best practices evolved through collective experience, providing a template or blueprint for tackling recurring design challenges. Design patterns help in making the software development process more efficient and the software itself more maintainable and scalable by offering proven solutions to typical issues [ref 3].
- Creational Patterns deal with object creation mechanism.
- Structural Patterns concern class and object composition to ensure a flexible architecture.
- Behavioral Patterns are all about class's objects interactions.
The bridge pattern is a structural pattern. It aims to separate an abstraction from its implementation, making them independent of each other. In this context, the structure of the proposal distribution is kept separate from the structure of the Markov Chain Monte Carlo, allowing any algorithm to choose any distribution freely.
Bridge pattern for Markov Chain Monte Carlo algorithms
The two abstract classes, ProposalDistribution and MCMC define the signature of method to be overridden by each specialized class.
Proposal distribution implementation
First let's define the signature of the proposal distribution
class ProposalDistribution(object):
from abc import abstractmethod
@abstractmethod
def log_prior(self, theta: float) -> float:
pass
@abstractmethod
def log_likelihood(self, theta: float) -> float:
pass
@abstractmethod
def step(self, theta: float, sigma: float) -> float:
pass
def log_posterior(self, theta: float, prior: float) -> float:
pass
Beta distribution:
The beta distribution is most appropriate for discrete distributions
The log likelihood for the Beta distribution is \[log(\mathfrak{L})=log\left ( \binom{N}{x} h^{\theta} h^{N-\theta} \right )\]
class ProposalBeta(ProposalDistribution):
pi_2_inv = np.sqrt(2 * np.pi)
def __init__(self, alpha: int, beta: int, num_trials: int, h: int):
self.alpha = alpha
self.beta = beta
self.num_trials = num_trials
self.h = h
def log_prior(self, theta: float) -> float:
x = stats.beta(self.alpha, self.beta).pdf(theta)
return x if x > 0.0 else 1e-5
def log_likelihood(self, theta: float) -> float:
return math.log(stats.binom(self.num_trials, theta).pmf(self.h))
def step(self, theta: float, sigma: float) -> float:
return theta + stats.norm(0.0, sigma).rvs()
def log_posterior(self, theta: float, prior: float) -> float:
return self.log_likelihood(theta) + np.log(prior)
Normal distribution:
The log likelihood for the normal distribution is \[log(\mathfrak{L})=-\frac{N}{2}log(2\pi)-\frac{1}{2}\sum_{i=1}^{N}\left [ 2.log(\sigma_i) -{\frac{\theta_i^2}{\sigma_i^2}} \right ]\]
Metropolis-Hastings implementation
First, let's outline the signature or interface common, MCMC, to all Markov Chain Monte Carlo algorithms. The method 'sample' accepts the most recently sampled value, and produces a new sample.
class MCMC(object):
@abstractmethod
def sample(self, theta: float) -> float:
pass
The MetropolisHastings class encapsulates the implementation of the Metropolis-Hastings (MH) algorithm. Its constructor sets up the algorithm's parameters:
- proposal: The proposal distribution.
- num_iterations: The total number of iterations or samples to be gathered.
- burn_in_ratio: The proportion of samples allocated to the burn-in phase.
- sigma_delta: Variance of Normal distribution used as step size for computing next sample.
Note: It's important to note that the initial or seed value for the samples may not be precise. In fact, it's anticipated that the early samples generated before the algorithm stabilizes are unreliable and ought to be disregarded [ref 4]. These discarded samples are commonly known as the burn-in phase.
class MetropolisHastings(MCMC):
def __init__(self,
proposal: ProposalDistribution,
num_iterations: int,
burn_in_ratio: float,
sigma_delta: float = 0.2):
self.proposed_distribution = proposed_distribution
self.num_iterations = num_iterations
self.sigma_delta = sigma_delta
self.burn_ins = int(num_iterations*burn_in_ratio)
def sample(self, theta_0: float) -> (np.array, float):
The sampling method, sample, initiates the process with an initial, approximate sample. It outputs a tuple consisting of the samples gathered post-burn-in phase and the acceptance rate.
def sample(self, theta_0: float) -> (np.array, float):
num_valid_thetas = self.num_iterations - self.burn_ins. #1
theta_walk = np.zeros(num_valid_thetas)
accepted_count = 0
theta = theta_0
theta_walk[0] = theta #2
j = 0
for i in range(self.num_iterations):
theta_star = self.proposal_dist.step(theta, self.sigma_delta) #3
try:
cur_prior = self.proposal_dist.log_prior(theta) #4
new_prior = self.proposal_dist.log_prior(theta_star)
if cur_prior > 0.0 and new_prior > 0.0:
cur_log_posterior = self.proposal_dist.posterior(theta, cur_prior) #5
new_log_posterior = self.proposal_dist.posterior(theta_star, new_prior)
# 6 Apply the acceptance/rejection criteria
if MetropolisHastings.__acceptance_rule(cur_log_posterior, new_log_posterior):
theta = theta_star
if i > self.burn_ins: #7
accepted_count += 1
theta_walk[j + 1] = theta_star
j += 1
else:
if i > self.burn_ins:
theta_walk[j + 1] = theta_walk[j]
j += 1
except ArithmeticError as e:
print(f'Error {e}')
return theta_walk, float(accepted_count) / num_valid_thetas
Theta values to be recorded must omit those sampled during the initial burn-in phase (1). The first sample essentially serves as the starting point for Theta (2). At every step, the process involves:
- Calculating the subsequent Theta value with the stepping function tied to the proposal distribution (3).
- Determining the logarithm of the prior for both the last and the proposed, current sample (4).
- Calculating the posterior by applying the logarithm to the Naive Bayes formula (5).
- The new posterior probability undergoes the acceptance criteria test (6).
- Provided the new sample isn't part of the burn-in phase, it is then recorded (7).
The acceptance probability, denoted as u, is determined by calculating the logarithms of the prior, likelihood, and posterior probability. Consequently, the acceptance test outlined in equation (3) necessitates transforming the difference (proposed theta minus previous theta) into an exponential form.
@staticmethod
def __acceptance_rule(current: float, new: float) -> bool:
residual = new - current
return True if new > current else np.random.uniform(0, 1) < np.exp(residual)
Evaluation
The goal is to assess how different elements influence both the acceptance rate and the results produced by the Metropolis-Hastings (MH) algorithm. The elements under consideration include:
- Num iterations: The total number of iterations performed in the Monte Carlo simulation.
- Proposal: Utilizes the Beta distribution with shapes 4 and 6 as the proposal distribution. The evaluation focuses solely on the success rate of the Binomial distribution applied in calculating the likelihood.
- Initial Theta: The initially sampled value for the theta variable.
- Theta step: The increment size used when sampling the subsequent value of theta.
Num iterations | Proposal | Initial Theta | Theta step | Acceptance rate |
5000 | 4,6,0.5 | 0.5 | 0.10 | 0.480 |
5000 | 4,6,0.5 | 0.5 | 0.95 | 0.061 |
20000 | 4,6,0.5 | 0.5 | 0.95 | 0.077 |
1000 | 4,6,0.5 | 0.5 | 0.95 | 0.052 |
12000 | 4,6,0.5 | 0.1 | 0.50 | 0.121 |
12000 | 4,6,0.5 | 0.9 | 0.50 | 0.134 |
12000 | 4,6,0.5 | 0.5 | 0.50 | 0.125 |
12000 | 4,6,0.9 | 0.5 | 0.50 | 0.080 |
12000 | 4,6,0.1 | 0.5 | 0.50 | 0.079 |
Impact of proposal distribution
In this first test we evaluate the impact of the proposal distribution on the prediction of the output theta. We consider 5000 iterations with no burn-in, initial value of 0.8 and compare two proposal distributions
- Beta(4, 6) with success rate 0.88
- Beta(4, 6) with success rate 0.12
Impact of step size
This test evaluates the impact of the step size for computing the next sample as: \[\theta^{*} = \theta_{i-1}+ N(0, \triangle \sigma)\]
Output MH with Beta(4,6) proposal 20% burn-in phase, initial value 0.8, delta sigma 0.05
Output MH with Beta(4,6) proposal 20% burn-in phase, initial value 0.8, delta sigma 0.95
Impact of initial value
This second test evaluates the impact of the initial value on the output of the MH algorithm. We use a 10% burn-in ratio with 12,000 iteration and a Beta(4, 2) proposal distribution with a 16% success rate.
Limitations
The primary challenges associated with the Metropolis-Hastings algorithm are computational in nature. When proposals are generated randomly, it often requires a significant number of iterations to reach regions of higher (or more probable) posterior densities. Moreover, even the more efficient versions of the Metropolis-Hastings algorithm may accept fewer than 25% of the proposed samples.
Finally, the Metropolis-Hastings algorithm dependents heavily on the selection of the proposal distribution. An inappropriate choice for the proposal distribution can result in suboptimal performance of the algorithm.
References
- [1] Pattern Recognition and Machine Learning [sec 11.2] - C. Bishop - Springer 2006
- [2] Bayesian Stats - Markov Chain Monte Carlo
- [3] Design Patterns: Element of Reusable Object-Oriented Software; 151-161 - E. Gamma, R. Helm, R. Johnson, J. Vlissides - Addison-Wesley 1995
- [4] Machine Learning: A Probabilistic Perspective [sec 24.4.1] - K. Murphy - MIT Press 2012
- Github.com Data_Exploration/Stats
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3