Friday, December 1, 2023

Modular Markov Chain Monte Carlo in Python

Target audience: Intermediate
Estimated reading time: 5'

Sampling sits at the core of data science and analysis. This article explores a category of numerical sampling techniques, known as Markov Chain Monte Carlo, and how they can be implemented via reusable design patterns, using the Metropolis-Hastings model as an example.
Table of contents

       Theory
What you will learn: How to build modular Python Classes for MCMC with Bridge Pattern: A Metropolis-Hastings Implementation.

Notes:
  • Environments: Python 3.10.10, Numpy 1.25.1, Matplotlib 3.7.2
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.
  • Source code available at Github.com Data_Exploration/Stats

Markov Chain Monte Carlo 

Markov Chain Monte Carlo (MCMC) is a class of algorithms used in computational statistics to sample from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The core idea is to use a Markov chain to model the complex probability distribution you're interested in, where the chain's states represent the possible outcomes in the distribution. Over time, as the chain progresses, the states it visits are used to generate samples that are representative of the desired distribution.

MCMC methods are particularly useful in situations where direct sampling is difficult due to the complexity of the distribution or high-dimensional spaces. They are widely used in Bayesian statistics, where one needs to compute posterior distributions that are not analytically tractable [ref 1].


The result of a Markov Chain Monte Carlo (MCMC) procedure is a joint posterior probability distribution, or the combined probability of parameters theta, given certain data and a model/hypothesis, denoted as p(theta | Data, Model). 
The generic Naive Bayes rule is defined as \[posterior = \frac{likelihood\ .\ prior}{evidence}\]
In the case of MCMC, the log of the posterior is computed as
\[log(p(\theta |\mathfrak{D},\mathfrak{M}))=log(p(\mathfrak{D} |\theta,\mathfrak{M})) + log(p(\theta | \mathfrak{M})) - log(p(\mathfrak{D} |\mathfrak{M}))\]
\[for\ model\ \mathfrak{M},\ a\ sampled\ value\ \theta\ and\ data \ \mathfrak{D}\]
Essentially, MCMC can be understood as a method for estimating a credible or plausible range of model parameters that align with the observed data. The process of fitting a probability distribution emerges as a secondary outcome.

Metropolis-Hastings algorithm

The Metropolis-Hastings (MH) algorithm is one of the most famous MCMC techniques. It allows for sampling from a distribution by generating a sequence of sample points in a way that, even if you can't directly sample from the distribution itself, you can still obtain a set of samples that approximates the distribution after a sufficient number of iterations [ref 2]. Other popular MCMC methods include the Gibbs sampler and Hamiltonian Monte Carlo.

Theory

For this study, we consider a single variable probability distribution.

Generic algorithm:
The procedure at step i is as follows:
1 Sample the proposal distribution q \[(1)\ \  x^{*} \leftarrow q(x_{i}|x_{i-1})\] Compute the probability u of accepting the sample \[(2)\ \  u(\theta^{*} | \theta_{i})=min\left ( 1, \frac{\widetilde{p}(\theta^{*} ).q\left ( \theta_{i} | \theta^{*}\right )}{\widetilde{p}(\theta_{i} ).q\left ( \theta^{*} | \theta_{i}\right )}\right )\]
3 Select the next value \[(3)\ \  IF\ [ urand_{[0, 1]} < u\left (\theta ^{*} | \theta_{i} \right )]\ \theta_{i+1} = \theta^{*}\ \ \ ELSE\ \theta_{i+1} = \theta_{i}\]

Random walk:
This is the case where the proposal distribution is symmetric. Therefore the probability of accepting the sample is simplified as \[u(\theta^{*} | \theta_{i})=min\left ( 1, \frac{\widetilde{p}(\theta^{*} )}{\widetilde{p}(\theta_{i} )}\right )\]

Reusable design pattern

Reusable software design patterns are standardized solutions to common problems in software design and architecture. They represent best practices evolved through collective experience, providing a template or blueprint for tackling recurring design challenges. Design patterns help in making the software development process more efficient and the software itself more maintainable and scalable by offering proven solutions to typical issues [ref 3].
  • Creational Patterns deal with object creation mechanism.
  • Structural Patterns concern class and object composition to ensure a flexible architecture.
  • Behavioral Patterns are all about class's objects interactions. 

The bridge pattern is a structural pattern. It aims to separate an abstraction from its implementation, making them independent of each other. In this context, the structure of the proposal distribution is kept separate from the structure of the Markov Chain Monte Carlo, allowing any algorithm to choose any distribution freely.

Bridge pattern for Markov Chain Monte Carlo algorithms

The two abstract classes, ProposalDistribution and MCMC define the signature of method to be overridden by each specialized class.

Proposal distribution implementation

First let's define the signature of the proposal distribution

class ProposalDistribution(object):
    from abc import abstractmethod

    @abstractmethod
    def log_prior(self, theta: float) -> float:
        pass

    @abstractmethod
    def log_likelihood(self, theta: float) -> float:
        pass

    @abstractmethod
    def step(self, theta: float, sigma: float) -> float:
        pass

    def log_posterior(self, theta: float, prior: float) -> float:
        pass


Beta distribution:
The beta distribution is most appropriate for discrete distributions
The log likelihood for the Beta distribution is \[log(\mathfrak{L})=log\left ( \binom{N}{x} h^{\theta} h^{N-\theta} \right )\]
class ProposalBeta(ProposalDistribution):
    pi_2_inv = np.sqrt(2 * np.pi)

    def __init__(self, alpha: int, beta: int, num_trials: int, h: int):
        self.alpha = alpha
        self.beta = beta
        self.num_trials = num_trials
        self.h = h

    def log_prior(self, theta: float) -> float:
        x = stats.beta(self.alpha, self.beta).pdf(theta)
        return x if x > 0.0 else 1e-5

    def log_likelihood(self, theta: float) -> float:
        return math.log(stats.binom(self.num_trials, theta).pmf(self.h))

    def step(self, theta: float, sigma: float) -> float:
        return theta + stats.norm(0.0, sigma).rvs()

    def log_posterior(self, theta: float, prior: float) -> float:
        return self.log_likelihood(theta) + np.log(prior)


Normal distribution:
The log likelihood for the normal distribution is \[log(\mathfrak{L})=-\frac{N}{2}log(2\pi)-\frac{1}{2}\sum_{i=1}^{N}\left [ 2.log(\sigma_i) -{\frac{\theta_i^2}{\sigma_i^2}} \right ]\]

Metropolis-Hastings implementation

First, let's outline the signature or interface common, MCMC, to all Markov Chain Monte Carlo algorithms. The method 'sample' accepts the most recently sampled value, and produces a new sample.
class MCMC(object):

    @abstractmethod
    def sample(self, theta: float) -> float:
        pass

The MetropolisHastings class encapsulates the implementation of the Metropolis-Hastings (MH) algorithm. Its constructor sets up the algorithm's parameters:
  • proposal: The proposal distribution.
  • num_iterations: The total number of iterations or samples to be gathered.
  • burn_in_ratio: The proportion of samples allocated to the burn-in phase.
  • sigma_delta: Variance of Normal distribution used as step size for computing next sample.
Note:  It's important to note that the initial or seed value for the samples may not be precise. In fact, it's anticipated that the early samples generated before the algorithm stabilizes are unreliable and ought to be disregarded [ref 4]. These discarded samples are commonly known as the burn-in phase.

class MetropolisHastings(MCMC):

    def __init__(self,
                 proposal: ProposalDistribution,
                 num_iterations: int,
                 burn_in_ratio: float,
                 sigma_delta: float = 0.2):
       
        self.proposed_distribution = proposed_distribution
        self.num_iterations = num_iterations
        self.sigma_delta = sigma_delta
        self.burn_ins = int(num_iterations*burn_in_ratio)

    def sample(self, theta_0: float) -> (np.array, float):

The sampling method, sample, initiates the process with an initial, approximate sample. It outputs a tuple consisting of the samples gathered post-burn-in phase and the acceptance rate.
def sample(self, theta_0: float) -> (np.array, float):
     num_valid_thetas = self.num_iterations - self.burn_ins.  #1
     theta_walk = np.zeros(num_valid_thetas)
     accepted_count = 0
     theta = theta_0
     theta_walk[0] = theta       #2

     j = 0
     for i in range(self.num_iterations):
        theta_star = self.proposal_dist.step(theta, self.sigma_delta) #3 

        try:
           cur_prior = self.proposal_dist.log_prior(theta)  #4
           new_prior = self.proposal_dist.log_prior(theta_star)

           if cur_prior > 0.0 and new_prior > 0.0:
             cur_log_posterior = self.proposal_dist.posterior(theta, cur_prior)  #5
             new_log_posterior = self.proposal_dist.posterior(theta_star, new_prior)

             # 6 Apply the acceptance/rejection criteria
             if MetropolisHastings.__acceptance_rule(cur_log_posterior, new_log_posterior):
               theta = theta_star
               if i > self.burn_ins:   #7
                     accepted_count += 1
                     theta_walk[j + 1] = theta_star
                     j += 1
            else:
               if i > self.burn_ins:
                  theta_walk[j + 1] = theta_walk[j]
                  j += 1

         except ArithmeticError as e:
                print(f'Error {e}')
        return theta_walk, float(accepted_count) / num_valid_thetas


Theta values to be recorded must omit those sampled during the initial burn-in phase (1). The first sample essentially serves as the starting point for Theta (2). At every step, the process involves:
  1. Calculating the subsequent Theta value with the stepping function tied to the proposal distribution (3).
  2. Determining the logarithm of the prior for both the last and the proposed, current sample (4).
  3. Calculating the posterior by applying the logarithm to the Naive Bayes formula (5).
  4. The new posterior probability undergoes the acceptance criteria test (6).
  5. Provided the new sample isn't part of the burn-in phase, it is then recorded (7).
The acceptance probability, denoted as u, is determined by calculating the logarithms of the prior, likelihood, and posterior probability. Consequently, the acceptance test outlined in equation (3) necessitates transforming the difference (proposed theta minus previous theta) into an exponential form.
@staticmethod
def __acceptance_rule(current: float, new: float) -> bool:
    residual = new - current
    return True if new > current else np.random.uniform(0, 1) < np.exp(residual)


Evaluation

The goal is to assess how different elements influence both the acceptance rate and the results produced by the Metropolis-Hastings (MH) algorithm. The elements under consideration include:
  • Num iterations: The total number of iterations performed in the Monte Carlo simulation.
  • Proposal: Utilizes the Beta distribution with shapes 4 and 6 as the proposal distribution. The evaluation focuses solely on the success rate of the Binomial distribution applied in calculating the likelihood.
  • Initial Theta: The initially sampled value for the theta variable.
  • Theta step: The increment size used when sampling the subsequent value of theta.

Num iterationsProposalInitial ThetaTheta stepAcceptance rate
50004,6,0.50.50.10           0.480 
50004,6,0.50.50.95           0.061 
200004,6,0.50.50.95           0.077 
10004,6,0.50.50.95           0.052 
120004,6,0.50.10.50           0.121 
120004,6,0.50.90.50           0.134 
120004,6,0.50.50.50           0.125 
120004,6,0.90.50.50           0.080 
120004,6,0.10.50.50           0.079 


Impact of proposal distribution
In this first test we evaluate the impact of the proposal distribution on the prediction of the output theta. We consider 5000 iterations with no burn-in, initial value of 0.8 and compare two proposal distributions
  • Beta(4, 6) with success rate 0.88
  • Beta(4, 6) with success rate 0.12
Output MH with Beta(4,6) proposal w success rate 0.88, no burn-in, initial value 0.8
Output MH with Beta(4,6) proposal w success rate 0.12, no burn-in, initial value 0.8

Impact of step size
This test evaluates the impact of the step size for computing the next sample as: \[\theta^{*} = \theta_{i-1}+ N(0, \triangle \sigma)\]
Output MH with Beta(4,6) proposal 20% burn-in phase, initial value 0.8, delta sigma 0.05
Output MH with Beta(4,6) proposal 20% burn-in phase, initial value 0.8, delta sigma 0.95


Impact of initial value
This second test evaluates the impact of the initial value on the output of the MH algorithm. We use a 10% burn-in ratio with 12,000 iteration and a Beta(4, 2) proposal distribution with a 16% success rate.
Output MH with Beta(4,2) proposal w success rate 0.16, 10% burn-in, initial value 0.1
Output MH with Beta(4,2) proposal with success rate 0.16, 10% burn-in, initial value 0.1


Limitations

The primary challenges associated with the Metropolis-Hastings algorithm are computational in nature. When proposals are generated randomly, it often requires a significant number of iterations to reach regions of higher (or more probable) posterior densities. Moreover, even the more efficient versions of the Metropolis-Hastings algorithm may accept fewer than 25% of the proposed samples.

Finally, the Metropolis-Hastings algorithm dependents heavily on the selection of the proposal distribution. An inappropriate choice for the proposal distribution can result in suboptimal performance of the algorithm.


References

  • [1] Pattern Recognition and Machine Learning [sec 11.2] - C. Bishop - Springer 2006
  • [2Bayesian Stats - Markov Chain Monte Carlo
  • [3Design Patterns: Element of Reusable  Object-Oriented Software; 151-161 - E. Gamma, R. Helm, R. Johnson, J. Vlissides - Addison-Wesley 1995
  • [4] Machine Learning: A Probabilistic Perspective [sec 24.4.1] - K. Murphy - MIT Press 2012
  • Github.com Data_Exploration/Stats


Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.