Friday, July 1, 2016

Monte Carlo Integration in Scala

Target audience: Intermediate
Estimated reading time: 5'


This post introduces an overlooked numerical integration method leveraging the ubiquitous Monte Carlo simulation.


Table of contents
Follow me on LinkedIn

What you will learn: How to apply Monte Carlo method to compute intractable integral.

Notes:
  • This article assumes the reader has some basic knowledge of the Scala programming language
  • Scala version: 2.12.4

Introduction

Some functions do not have a closed-form solution for calculating a definite integral, a process called symbolic integration. For such cases, numerous numerical integration techniques are available for continuous functions. Examples include Simpson's formula, the Newton-Cotes quadrature rule, and Gaussian quadrature. These approaches are deterministic in their execution.

On the other hand, the Monte Carlo method of numerical integration adopts a stochastic approach. It uses randomly chosen values to approximate the area under a curve, across a surface, or within any multidimensional space [ref 1]. Demonstrations of the Monte Carlo integration technique often involve applying a uniform random distribution of data points to estimate the integral of a single-variable function.

Basic concept

Let's consider the single variable function illustrated in the following line plot.
\[f(x)=\frac{1}{x}-0.5\]
Definition of outer area for Monte Carlo random simulation

Our goal is to calculate the integral of a function across the range [1, 3]. The Monte Carlo method for numerical integration involves three key steps [ref 2]:
  1. Establish the bounding area for the function based on the minimum and maximum values of x and y. In this instance, the x-values range from [1, 3], and the y-values span from [0.5, -0.12].
  2. Produce random data points distributed within this outer area.
  3. Determine the proportion of these random data points that fall within the area defined by the function, relative to the total number of generated points.

Scala implementation

This implementation relies on Scala features available in 2.11 and later version [ref 3].
Let's define a generic class,
MonteCarloIntegrator to encapsulate these 3 steps:
The class has two arguments (line 1)
  • The function f to sum
  • The number of random data points used in the methodology
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class MonteCarloIntegrator(f: Double => Double, numPoints: Int) {

  def integrate(from: Double, to: Double): Double = {
    val (min, max) = getBounds(from, to)
    val width = to - from
    val height = if (min >= 0.0) max else max - min
    val outerArea = width * height
    val randomx = new Random(System.currentTimeMillis)
    val randomy = new Random(System.currentTimeMillis + 42L)

    def randomSquare(randomx: Random, randomy: Random): Double = {
      val numInsideArea = Range(0, numPoints).foldLeft(0)(
        (s, n) => {
        val ptx = randomx.nextDouble * width + from
        val pty = randomy.nextDouble * height
        randomx.setSeed(randomy.nextLong)
        randomy.setSeed(randomx.nextLong)

        s + (if (pty > 0.0 && pty < f(ptx)) 1 
            else if (pty < 0.0 && pty > f(ptx)) -1 
            else 0)
      }
     )
      numInsideArea.toDouble * outerArea / numPoints
    }
    randomSquare(randomx, randomy)
  }
}

The method integrate implements the sum of the function over the interval [from, to] (line 3). The first step is to extract the bounds getBounds of the outer area (line 4) which size is computed on line 7. Each coordinate is assigned a random generator randomx and randomy (lines 8 & 9).
The nested method randomSquare records the number of data points, numInsideArea that falls into the area delimited by the function (line 13 - 21). The sum is computed as the relative number of data points inside the area delimited by the function (line 24).


The method getBounds is described in the following code snippet. It is a simple, although not particularly efficient approach to extract the boundary of the integration. It breaks down the interval into of steps (lines 2 & 3) and collects the minimum and maximum values of the function (lines 7 - 12). 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def getBounds(from: Double, to: Double): (Double, Double) = {
  val numSteps = Math.sqrt(numPoints).floor.toInt
  val stepSize = (to - from) / numSteps

  (0 to numSteps).foldLeft(Double.MaxValue, -Double.MaxValue))(
    (minMax, n) => {
      val y = f(n * stepSize + from)
      updateBounds(y, minMax) match {
        case 0x01 => (y, minMax._2)
        case 0x02 => (minMax._1, y)
        case 0x03 => (y, y)
        case _ => minMax
      }
    }
  )
}


def updateBounds(y: Double, minMax: (Double,Double)): Int = {
   var flag = 0x00
 
   if (y < minMax._1) flag += 0x01
   if (y > minMax._2) flag += 0x02
   flag
}


Precision

You may wonder about the accuracy of the Monte Carlo method and how many randomly generated data points are needed for a decent accuracy. Let's consider the same function \[f(x)=\frac{1}{x}-0.5\] and its indefinite integral, that is used to generate the expected sum for the function f \[\int f(x)dx=log(x)-\frac{1}{2x}+C\] The simple test consists of computing the error between the value produced by the definite integral and the sum from the Monte Carlo method as implemented in the following code snippet. 

val fct =(x: Double) => 2 * x - 1.0
val integral = (x: Double, c: Double) => x*x - x + c)

final val numPoints=10000
val integrator = new MonteCarloIntegrator(fct, numPoints)

val predicted = monteCarloIntegrator.integrate(1.0, 2.0)
val expected = integral(2.0, 0.0) - integral(1.0, 0.0)

Let's run the test 100 times for number of random data points varying between 100 and 50,000.


The Monte Carlo method can achieve a reasonable level of accuracy even with a limited number of data points. In this scenario, the marginal gain in precision doesn't warrant generating a large quantity of random data points beyond 1000.

A more efficient strategy would be to use a termination condition that concludes the summation process as quickly as possible, eliminating the need to guess the ideal number of random data points. In each iteration, a new batch of random data points, totaling 'numPoints', is introduced. One straightforward method for determining convergence is to compare the difference in the sum between two successive iterations:
  sum(existing data points + new data points) - sum(existing data points) < eps
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def randomSquare(randomx: Random, randomy: Random): Double = {
  var oldValue = 0.0

  Range(0, numPoints).takeWhile(
      _ => {
         val numInsideArea = ....
          // ...
            s + ....
          })
   
       val newValue = numInsideArea.toDouble * outerArea / numPoints
       val diff = Math.abs(newValue - oldValue) 
    
      oldValue = newValue
      diff < eps
    }
  )
  oldValue
}
In each iteration, we randomly generate a fresh batch of 'numPoints' data points to improve the accuracy of the overall summation. The termination of this process is governed by the 'takeWhile' method, a higher-order function used in Scala collections (as seen in lines 3 & 12).
It's important to note that this implementation of Monte Carlo integration is straightforward and serves to demonstrate the application of stochastic methods in calculus. For functions with significant inflection points (characterized by extreme second-order derivatives), recursive stratified sampling has proven to be more precise in calculating definite integrals.

References

[1Monte Carlo Integration Dartmouth College - C. Robert, G. Casella
[2]  The Clever Machine: Monte Carlo Approximations Dustin Stansbury
[3] Programming in Scala - 3rd edition M. Odersky, L. Spoon, B. Venners
github.com/patnicolas



-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3

Thursday, June 16, 2016

Analyze Scala Performance Using Javap

Target audience: Beginner
Estimated reading time: 4'

Table of contents
Overview
Follow me on LinkedIn

Overview

As mentioned in a previous post, iterators in Scala, such as foreach, for & while loop have different performance characteristics. A recurrent question between developers is the relative performance of Scala and Java regarding for and while.

The "for comprehension" closure in Scala is indeed a syntactic sugar wraps around a sequence of flatMap and map methods, Monad in Practice. It is expected to be significantly slower and should not be used as an iterator.
We select the higher method foreach to implement the for iterator in Scala.


Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted.

Evaluation

The simple evaluation consists of processing a large number of iterations of a very simple statement which execution does not interfere with the actual performance of the iterator. The code is also is easy to understand. The code for the while loop is described below.

def testRun(numIterations: Int) : Unit = {
  val startTime = System.currentTimeMillis
  var sum: Long = 0
  var i: Int = 0

  while(i < numIterations) {
    var j = 0
    while(j  < 1000) {
      sum += 1
      j += 1
    }
     i += 1
  }
  Console.printf(s"${(System.currentTimeMillis-startTime)}")
}

The test is executed on a 4 cores Intel i7 CPU with java 1.7.02 64-bit and Scala 2.10.2. The chart below, compares the duration of the execution of the for iterator for Java and Scala.


The performance of Scala and Java for executing the while loop are very similar.

The second test consists of comparing the relative performance of Java for loop and Scala foreach higher order method.

def testRun(numIterations: Int) : Unit = {
    var sum: Long = 0
    (0 until numIterations.foreach( sum += _)
}




Analysis using Javap

The first step is to analyze the number of byte-code instructions generated by the Scala compiler. We use the Java Class File Disassembler,  javap, to print out the actual instructions processed by the Java virtual machine.
            javap -c -verbose xxx.class
The sequence of instructions generated for the execution of the while loops are displayed below.

0:   lconst_0
1:   lstore_1
2:   iconst_0
3:   istore_3
4:   iload_3
5:   sipush  1000
8:   if_icmpge       42
11:  iconst_0
12:  istore  4
14:  iload   4
16:  sipush  1000
19:  if_icmpge       35
22:  lload_1
23:  lconst_1
24:  ladd
25:  lstore_1
26:  iload   4
28:  iconst_1
29:  iadd
30:  istore  4
32:  goto    14
35:  iload_3
36:  iconst_1
37:  iadd
38:  istore_3
39:  goto    4
42:  return

Disassembling the Java class for the code with the for iterators produces the following print out:

0:   new     #12; //class scala/runtime/LongRef
3:   dup
4:   lconst_0
5:   invokespecial   #16; //Method scala/runtime/LongRef."<init>":(J)V
8:   astore_1
9:   getstatic       #22; //Field scala/runtime/RichInt$.MODULE$:Lscala/runtime/RichInt$;
12:  getstatic       #27; //Field scala/Predef$.MODULE$:Lscala/Predef$;
15:  iconst_0
16:  invokevirtual   #31; //Method scala/Predef$.intWrapper:(I)I
19:  sipush  1000
22:  invokevirtual   #35; //Method scala/runtime/RichInt$.until$extension0:
                            (II Lscala/collection/immutable/Range;
25:  new             #37; //class JavaScala$$anonfun$testRun$1
28:  dup
29:  aload_0
30:  aload_1
31:  invokespecial   #40; //Method JavaScala$$anonfun$testRun$1."<init>
                          (LJavaScala;Lscala/runtime/LongRef;)V
34:  invokevirtual   #46; //Method scala/collection/immutable/Range.foreach$mVc$sp:
                         (Lscala/Function1;)
37:  return


Although the number of instructions for the for loop is smaller than the number of instructions for while, most of those instructions are function calls:
- conversion of counter to long
- static conversion of int to RichInt to wrap Java Integer:
            @inline implicit def intWrapper(x: Int)= new runtime.RichInt(x)
- ultimately the foreach method is invoked to execute the loop.

Interestingly enough, the foreach method for collection is implemented using the while loop not the for loop. Scala compiler plug-in such as ScalaCL to optimize the execution of iteration on Scala collections, Arrays, Lists,... have been introduced to get around this issue. The reader can also take comfort in using the Java Class File Dissambler to understand how a method or piece of code translate into efficient, or in our case, inefficient byte-code. Quite a few methods in Scala such as foldLeft, reduceLeft uses tail recursion to traverse collections. It would be interesting to compare the relative performance of those methods with alternatives using iterators.. stay tune.

References

Thursday, May 5, 2016

Bootstrapping With Replacement in Scala

Target audience: Intermediate
Estimated reading time: 6'

Bootstrapping is a method in statistics where random sampling of a dataset is done with replacement. It allows data scientists to determine the sampling distribution for a broad range of probability distributions through this technique.

Background

A primary goal of bootstrapping is to evaluate the precision of various statistical measures like mean, standard deviation, median, mode, or error. These measures, sf, often termed as estimators, serve to approximate a distribution. The most frequently used approximation is called the empirical distribution function. When the data points x are independent and identically distributed (iid), this empirical or approximate distribution can be assessed by employing resampling techniques.

The following diagram captures the essence of bootstrapping by resampling.

Generation of bootstrap replicates by resampling

Each of the B bootstrap samples has the same number of observations or data points as the original data set from which the samples are created. Once the samples are created, a statistical function sf such as mean, mode, median or standard deviation is computed for each sample.
The standard deviation for the B statistics should be similar to the standard deviation of the original data set.

Implementation in Scala

The purpose of this post is to illustrate some basic properties of bootstrapped sampling
  • Profile of the distribution of statistics sf for a given probability distribution
  • Comparison of the standard deviation of the statistics sf with the standard deviation of the original dataset
Let's implement a bootstrap by resampling in Scala, starting with a class Bootstrap.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class Bootstrap(
    numSamples: Int = 1000,
    sf: Vector[Double] => Double,
    inputDistribution: Vector[Double],
    randomizer: Random
) {

  lazy val bootstrappedReplicates: Array[Double] =
     (0 until numSamples)./:( new mutable.ArrayBuffer[Double] )(
        ( buf, _ ) => buf += createBootstrapSample
      ).toArray

  def createBootstrapSample: Double {}

  lazy val mean = bootstrappedReplicates.reduce( _ + _ )/numSamples

  def error: Double = {}
}

The class Bootstrap is instantiated with a pre-defined number of samples, numSamples (line 2), a statistic function sf (line 3), a data set generated by a given distribution inputDistribution (line 4) and a randomizer (line 5).
The computation of the bootstrap replicates, bootstrappedReplicates is central to resampling (lines 8 - 11). As described in the introduction, a replicate, s is computed from a sample of the original data set with the method createBootstrapSample (line 10).

Let's implement the method createBootstrapSample.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def createBootstrapSample: Double =
    sf(
       (0 until inputDistribution.size)./:( new mutable.ArrayBuffer[Double] )(
         ( buf, _ ) => {
            randomizer.setSeed( randomizer.nextLong )
            val randomValueIndex = randomizer.nextInt( inputDistribution.size )
            buf += inputDistribution( randomValueIndex )
         }
       ).toVector
    )

The method createBootstrapSample
- Samples the original dataset using a uniform random function (line 6)
- Applies the statistic function sf to this sample dataset (line 1 & 11)

The last step consists of computing the error (deviation) on the bootstrap replicates

1
2
3
4
5
6
7
8
  def error: Double = {
      val sumOfSquaredDiff = bootstrappedReplicates.reduce(
        (s1: Double, s2: Double) => (s1 - mean) (s1 - mean) +  (s2 - mean)*(s2 - mean)
      )

      Math.sqrt(sumOfSquaredDiff / (numSamples - 1))
  }


Evaluation

The first evaluation consists of comparing the distribution of replicates with the original distribution. To this purpose, we generate an input dataset using
  • Normal distribution
  • LogNormal distribution

Setup

Let's create a method, bootstrapEvaluation to compare the distribution of the bootstrap replicates with the dataset from which the bootstrap samples are generated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def bootstrapEvaluation( 
     dist: RealDistribution, 
     random: Random, 
     gen: (Double, Double) 
): (Double, Double) = {

    val inputDistribution = (0 until 5000)./:(new ArrayBuffer[(Double, Double)])
     (
       ( buf, _ ) => {
          val x = gen._1 * random.nextDouble - gen._2
          buf += ( ( x, dist.density( x ) ) )
      }
    ).toVector

    val mean = (x: Vector[Double]) => x.sum/x.length
    val bootstrap = new Bootstrap(
        numReplicates,
        mean, 
        inputDistribution.map( _._2 ), 
        new Random( System.currentTimeMillis)
    )

    val meanS = bootstrap.bootstrappedReplicates.sum / 
          bootstrap.bootstrappedReplicates.size
    val sProb = bootstrap.bootstrappedReplicates.map(_ - meanS)
         // .. plotting histogram of distribution sProb
    (bootstrap.mean, bootstrap.error)
  }

We are using the normal and log normal probability density function defined in the Apache Commons Math Java library. These probability density functions are defined in the org.apache.commons.math3.distribution package.

The comparative method bootstrapEvaluation has the following argument:
  • dist: A probability density function used to generate the data set upon which sampling is performed (line 2).
  • random: A random number generator (line 3)
  • gen: A pair of parameters for the linear transform for the generation of random values (a.r + b) (line 4).
The input distribution inputDistribution { (x, pdf(x)} is generated for 5000 data points (lines 7 - 13).
Next the bootstrap is created with the appropriate number of replicates, numReplicates, the mean of the input data set as the statistical function s, the input distribution and the generic random number generator of Scala library, as arguments (lines 16 -20).
Let's plot the distribution the input data set generated from a normal density function.

val (meanNormal, errorNormal) = bootstrap(
    new NormalDistribution, 
    new scala.util.Random, 
    (5.0, 2.5)
)

Normally distributed dataset

The first graph plots the distribution of the input dataset using the Normal distribution.


The second graph illustrates the distribution (histogram) of the replicates s - mean.


The bootstrap replicates sf(x) are also normally distributed. The mean value for the bootstrap replicates is 0.1978 and the error is 0.001691

Dataset with a log normal distribution

We repeat the same process for the lognormal distribution. This time around the dataset to sample from follows a log-normal distribution.

val (meanLogNormal, errorLogNormal) = bootstrap(
    new LogNormalDistribution, 
    new scala.util.Random, 
    (2.0, 0.0)
)



Although the original dataset used for generated the bootstrap samples is normally distributed, the bootstrap replicates sf(x) are normally distributed. The mean for the bootstrap replicates is 0.3801 and the error is 0.002937

The error for the bootstrap resampling from a log normal distribution is twice as much as the error related to the normal distribution
The result is not surprising: The bootstrap replicates follow a normal distribution which matches closely the original dataset created using the same probability density function. 

References

  • Programming in Scala - 3rd edition M Odersky, L. Spoon, B. Venners - Artima - 2016
  • Elements of Statistics Learning: Data mining, Inference and Prediction - 7.11 Bootstrap method Springer - 2001
  • github.com/patnicolas