Sunday, October 20, 2019

Contextual Thompson Sampling

Target audience: Advanced
Estimated reading time: 5'

In this article, we delve into the notion of the multi-armed bandit, also known as the K-armed bandit, with a particular focus on Thompson sampling as applied to the contextual bandit.

Table of contents
       Epsilon-greedy
Follow me on LinkedIn

What you will learn: How to add contextual information as prior to the Thompson sampling algorithm.

Multi-arm bandit problem

The multi-armed bandit problem is a well-known reinforcement learning technique,  widely used for its simplicity. Let's consider a  slot machine with n arms (bandits) with each arm having its own biased probability distribution of success. In its simplest form pulling any one of the arms gives you a  reward of either 1 for success, or 0 for failure. 

Our objective is to pull the arms one-by-one in sequence such that we maximize our total reward collected in the long run. 
Data scientists have developed several solutions to tackle the multi-armed bandit problem for which the 3 most common algorithms are:
  • Epsilon-Greedy
  • Upper confidence bounds
  • Thompson sampling

Epsilon-greedy

The Epsilon-greedy algorithm stands as the most straightforward method for navigating the exploration-exploitation dilemma. In the exploitation phase, it consistently chooses the lever with the highest recorded payoff. Nevertheless, occasionally (with a frequency of ε, where ε is less than 0.1), it opts for a random lever to 'explore'—this allows for the investigation of levers whose payouts are not well-known. Levers with established or confirmed high payouts are selected the majority of the time, specifically (1-ε) of the instances.

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) algorithm is arguably the most popular approach to solving the multi-armed bandit problem. It is occasionally described by the principle of 'Optimism in the Face of Uncertainty': It operates under the assumption that the uncertain average rewards of each option will be on the higher end, according to past outcomes.

Thompson sampling

The Thompson sampling algorithm is essentially a Bayesian optimization method. Its central tenet, known as the probability matching strategy, can be distilled to the concept of 'selecting a lever based on its likelihood of being the optimal choice.' This means the frequency of choosing a specific lever should correspond to its actual chances of being the best lever available.

The following diagram illustrates the difference between the upper confidence bounds and the Thompson sampling.


Fig. 1 Confidence Bounds for Multi-Armed bandit problem

Context anyone?

The methods discussed previously do not presume any specifics about the environment or the agent that is choosing the options. There are two scenarios:
  • Context-Free Bandit: In this scenario, the choice of the option with the highest reward is based entirely on historical data, which includes prior outcomes and the record of rewards (both successes and failures). This data can be represented using a probability distribution like the Bernoulli distribution. For example, the chance of rolling a '6' on a dice is not influenced by who is rolling the dice.
  • Contextual Bandit: Here, the current state of the environment, also known as the context, plays a role in the decision-making process for selecting the option with the highest expected reward. The algorithm takes into account a new context (state), makes a choice (selects an arm), and then observes the result (reward). For instance, in online advertising, the choice of which ad banner to display on a website is influenced by the historical reward data linked to the user's demographic information.
All of the strategies suited to the multi-armed bandit problem can be adapted for use with or without context. The following sections will concentrate on the problem of the contextual multi-armed bandit.

Contextual Thompson sampling (CTS)

Let's dive into the key characteristics of the Thompson sampling
  • We assume the prior distribution on the parameters of the distribution (unknown) of the reward for each arm.
  • At each step, t, the arm is selected according to the posterior probability to be the optimal arm. 
The components of the contextual Thompson sampling are
1. Model of parameters w
2. A prior distribution p(w) of the model
3. History H consisting of a context x and reward r
4. Likelihood or probability p(r|x, w) of a reward given a context x and parameters w
5. Posterior distribution computed using naïve Bayes \[p\left( \tilde{w} | H\right ) = \frac{p\left( H | \tilde{w} \right ) p(\tilde{w})}{p(H)}\]

But how can we model a context?

Actually, a process to select the most rewarding arm is actually a predictor or a learner. Each predictor takes the context, defines as a vector of features and predicts which arm will generate the highest reward.
The predictor is a model that can be defined as
- Linear model
- Generalized linear model (i.e. logistic)
- Neural network

The algorithm is implemented as a linear model (weights w) for estimating the reward from a context x  as \[w^{T}.x\]
The ultimate objective for any reinforcement learning model is to extract a policy which quantifies the behavior of the agent. Given a set X of context xi and a set A of arms, the policy π  is defined by the selection of an arm given a context x
\[\pi : X\rightarrow A\]
Variables initialization\[V_{0} = \lambda I_{d}, \hat{w}=0,b_{0}\]
Iteration
FOR t = 1, ... T \[\begin{matrix} 1: \tilde{w_{t}} \leftarrow \mathbb{N}(\tilde{w_{t}} , \frac{v^{2}}{V_{t}} ) \\ \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] 2: a_{t}^{*} = \arg \max _{j} x_{j,t}^{T} \widehat{w_{t}} \\ \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] 3: a_{t} = \arg \max _{j} x_{j,t}^{T} \tilde{w_{t}} \\ \phantom] \phantom] \phantom] 4: reward \rightarrow r_{t} \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \\ \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] 5: V_{t+1} = V_{t} + x_{a_{t},t}.x_{a_{t},t}^T \\ \phantom] \phantom] \phantom] 6: b_{t+1}= b_{t} + x_{a_{t},t}.r_{t} \\ \phantom] 7: \tilde{w} = V_{t+1}^{T}.b \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \\ \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] \phantom] 8: \mathbb{R}_{t}=x_{a^{*},t}^T.\tilde{w_{t}} - x_{a,t}^T, \tilde{w_{t}} \end{matrix}\] 
The sampling of the normal distribution (line 1) is described in details in the post Multivariate Normal Sampling with Scala. The algorithm computes the maximum reward estimation through sampling the normal distribution (line 2) and play the arm a* associated to it (line 3).
The parameters V and b are updated with the estimated value (line 5 and 6) and the actual reward Rt is computed (line 8) after the weights of the context w are updated (line 7).

Thank you for reading this article. For more information ...


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3


Monday, July 22, 2019

Multivariate Normal Sampling with Scala and ND4j

Target audience: Intermediate
Estimated reading time: 6'

The multivariate normal sampling function plays a pivotal role in several machine learning methods, including Markov chains Monte Carlo (MCMC) and the contextual multi-armed bandit. The foundation of this sampling is data vectorization, a technique familiar to both Python developers and data scientists.
In this article, we outline the process of implementing multivariate normal sampling with the aid of ND4j.

Table of contents


Notes: 
  • The post requires some knowledge of data vectorization (numpy, datavec, ND4j..) as well as Scala programming language.
  • The code associated with this article is written using Scala 2.11.8

Vectorization

Python Numpy is a well-known and reliable vectorized linear algebra library which is a foundation of scientific (SciPy) and machine learning (Sciktlearn) libraries. No serious data scientific projects can reasonably succeeds with the power and efficiency of numpy. 
The vectorization of datasets is the main reason behind the performance of machine learning models (training and prediction) build in Python.

Is there a similar linear algebra library, supporting vectorization, available to Scala and Spark developers? Yes, ND4j


ND4j, BLAS and LAPACK

ND4j library replicates the functionality of numpy for Java developers. ND4j can be downloaded as an independent library or as component of the deep learning library, deeplearning4j. It leverages the BLAS and LAPACK libraries.
From a Java developer perspective, the data represented by an NDArray is stored outside of the Java Virtual Machine. This design has the following benefits:
  • Avoid taxing the Garbage Collector
  • Interoperability with high-performance BLAS libraries such as OpenBlas
  • Allow number of array elements exceeds Int.MaxValue
The BLAS (Basic Linear Algebra Subprograms) are functions performing basic vector and matrix operations. The library is divided in 3 levels:
  • Level 1 BLAS perform scalar, vector and vector-vector operations,
  • Level 2 BLAS perform matrix-vector operations
  • Level 3 BLAS perform matrix-matrix operations. 
OpenBLAS is an optimized Basic Linear Algebra Subprograms (BLAS) library based on GotoBLAS2,  a C-library of linear algebra supporting a large variety of micro-processor. Its usage is governed by the BSD license.
LAPACK are Fortran routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems and matrix factorizations.

Implicit ND4j array conversion

The first step is to create a implicit conversation between ND4j and Scala data types.  
The following code convert an array of double into a INDArray using org.nd4j.linalg.factory.Nd4j java class and its constructor create(double[] values, int[] shape)
  • In case of a vector, the shape is defined in Scala as  Array[Int](size_vector)
  • In case of a matrix, the shape is defined as Array[Int](numRows, numCols)
The following snippet implement a very simple conversion from a Scala array to a INDArray


@throws(clazz = classOf[IllegalArgumentException])
 
implicit def double2NDArray(values: Array[Double]): INDArray = {
  require(values.nonEmpty, "ERROR: ND4, conversion ...")

  Nd4j.create(values, Array[Int](1, values.length))
}


Multivariate Normal distribution

The sampling of a multivariate normal distribution is defined by the formula 
\[\mathbb{N}(\mu, \Sigma )=\frac{1}{\sqrt{(2\pi)^n \left | \Sigma \right |}} e^{\frac{1}{2}(x-\mu)^T {\Sigma '}^{-1}(x-\mu)}\] A generalized definition adds a very small random perturbation factor r |r| <= 1 on the variance value (diagonal of the covariance matrix) \[\Sigma ' = \Sigma + r.I\] Sigma is the covariance matrix and the mu is the mean value. 

 The computation of the multivariate normal sampling can be approximated by the Cholesky decomposition. In a nutshell, the Cholesky algorithm decompose a positive-definite matrix into a product of two matrix
  • lower triangle matrix
  • transposition of its conjugate
It serves the same purpose as the ubiquitous LU decomposition with less computation. \[\mathbb{N}(\mu, \Sigma) \sim \mu + Chol(\Sigma).Z^T\] where \[L=Chol(\Sigma)\] and \[L.L^T=\Sigma\]. The vector Z is a multivariate normal noise \[Z= \{ z_i | z_i=N(0, 1)\}\]
The following implementation relies on the direct invocation of LAPACK library function potrf. The LAPACK functionality are accessed through the BLAS wrapper getBlasWrapper.


 final def choleskyDecomposition(matrix: INDArray): INDArray = {
     val matrixDup = matrix.dup
     Nd4j.getBlasWrapper.lapack.potrf(matrixDup, false)
     matrixDup
 }

Note that the matrix is duplicated prior to the LAPACK function call as we do not want to alter the input matrix. 
Let's implement the multivariate Normal sampler with perturbation using the Cholesky decomposition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
@throws(clazz = classOf[IllegalArgumentException])
@throws(clazz = classOf[IllegalStateException])
final def multiVariateNormalSampling(
   mean: INDArray,
   cov: INDArray,
   perturbation: Double = 0.0): INDArray = {

    import scala.util.Random
    require(cov.size(0) == mean.length, s"Sigma shape ${cov.size(0)} should be mean size ${mean.length}")
    require(cov.size(1) == mean.length, s"Sigma shape ${cov.size(1)} should be ${mean.length}")

    try {
      // Add a perturbation epsilon value to the covariance matrix
      // if it is defined
      val perturbMatrix =
         if(perturbation > 0.0)
            cov.add( squareIdentityMatrix(cov.size(0), 
                           perturbation*(Random.nextDouble-0.5)))
         else
            cov

      // Apply the Cholesky decomposition
      val perturbed: INDArray = choleskyDecomposition(perturbMatrix)
      // Instantiate a normal distribution
      val normalPdf = new NormalDistribution(
           new DefaultRandom, 
           0.0, 
           1.0)

      val normalDensity = Array.fill(mean.size(0))(normalPdf.sample)
      val normalNDArray: INDArray = normalDensity

       // This is an implementation of the Dot product
      val normalCov = perturbed.mmul(normalNDArray.transpose)
      // Add the normalized perturbed covariance to the mean value
      mean.add(normalCov)
   }
   catch {
       case e: org.nd4j.linalg.api.blas.BlasException =>
           throw new IllegalStateException(e.getMessage)
       case e: org.apache.commons.math3.exception.NotStrictlyPositiveException =>
           throw new IllegalStateException(e.getMessage)
       case e: Throwable =>
           throw new IllegalStateException(e.getMessage)
   }
}

Let's look at the full implementation of the multi-variate normal sampling.
The first step validates the shape of the mean and covariance input matrices [line 8, 9]. As mentioned earlier, the generalized normal distribution introduces an optional random perturbation of small magnitude (~1e-4) [line 14-17] that is useful for application that requires some stochastic
The 'perturbed' covariance matrix is factorized using the Cholesky decomposition [line 22]. The normal probability density function (mean 0.0 and standard deviation 1.0) is used to generate random values [line 24-30] which is then applied to the covariance matrix [line 33].
The normal randomized variance is added to the vector of mean values [line 35].

For the sake of convenience, the multivariate normal density function uses the Apache Math common 3.5 API [line 24].


References



---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3