Thursday, May 5, 2016

Bootstrapping With Replacement in Scala

Target audience: Intermediate
Estimated reading time: 6'

Bootstrapping is a method in statistics where random sampling of a dataset is done with replacement. It allows data scientists to determine the sampling distribution for a broad range of probability distributions through this technique.

Background

A primary goal of bootstrapping is to evaluate the precision of various statistical measures like mean, standard deviation, median, mode, or error. These measures, sf, often termed as estimators, serve to approximate a distribution. The most frequently used approximation is called the empirical distribution function. When the data points x are independent and identically distributed (iid), this empirical or approximate distribution can be assessed by employing resampling techniques.

The following diagram captures the essence of bootstrapping by resampling.

Generation of bootstrap replicates by resampling

Each of the B bootstrap samples has the same number of observations or data points as the original data set from which the samples are created. Once the samples are created, a statistical function sf such as mean, mode, median or standard deviation is computed for each sample.
The standard deviation for the B statistics should be similar to the standard deviation of the original data set.

Implementation in Scala

The purpose of this post is to illustrate some basic properties of bootstrapped sampling
  • Profile of the distribution of statistics sf for a given probability distribution
  • Comparison of the standard deviation of the statistics sf with the standard deviation of the original dataset
Let's implement a bootstrap by resampling in Scala, starting with a class Bootstrap.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class Bootstrap(
    numSamples: Int = 1000,
    sf: Vector[Double] => Double,
    inputDistribution: Vector[Double],
    randomizer: Random
) {

  lazy val bootstrappedReplicates: Array[Double] =
     (0 until numSamples)./:( new mutable.ArrayBuffer[Double] )(
        ( buf, _ ) => buf += createBootstrapSample
      ).toArray

  def createBootstrapSample: Double {}

  lazy val mean = bootstrappedReplicates.reduce( _ + _ )/numSamples

  def error: Double = {}
}

The class Bootstrap is instantiated with a pre-defined number of samples, numSamples (line 2), a statistic function sf (line 3), a data set generated by a given distribution inputDistribution (line 4) and a randomizer (line 5).
The computation of the bootstrap replicates, bootstrappedReplicates is central to resampling (lines 8 - 11). As described in the introduction, a replicate, s is computed from a sample of the original data set with the method createBootstrapSample (line 10).

Let's implement the method createBootstrapSample.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def createBootstrapSample: Double =
    sf(
       (0 until inputDistribution.size)./:( new mutable.ArrayBuffer[Double] )(
         ( buf, _ ) => {
            randomizer.setSeed( randomizer.nextLong )
            val randomValueIndex = randomizer.nextInt( inputDistribution.size )
            buf += inputDistribution( randomValueIndex )
         }
       ).toVector
    )

The method createBootstrapSample
- Samples the original dataset using a uniform random function (line 6)
- Applies the statistic function sf to this sample dataset (line 1 & 11)

The last step consists of computing the error (deviation) on the bootstrap replicates

1
2
3
4
5
6
7
8
  def error: Double = {
      val sumOfSquaredDiff = bootstrappedReplicates.reduce(
        (s1: Double, s2: Double) => (s1 - mean) (s1 - mean) +  (s2 - mean)*(s2 - mean)
      )

      Math.sqrt(sumOfSquaredDiff / (numSamples - 1))
  }


Evaluation

The first evaluation consists of comparing the distribution of replicates with the original distribution. To this purpose, we generate an input dataset using
  • Normal distribution
  • LogNormal distribution

Setup

Let's create a method, bootstrapEvaluation to compare the distribution of the bootstrap replicates with the dataset from which the bootstrap samples are generated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def bootstrapEvaluation( 
     dist: RealDistribution, 
     random: Random, 
     gen: (Double, Double) 
): (Double, Double) = {

    val inputDistribution = (0 until 5000)./:(new ArrayBuffer[(Double, Double)])
     (
       ( buf, _ ) => {
          val x = gen._1 * random.nextDouble - gen._2
          buf += ( ( x, dist.density( x ) ) )
      }
    ).toVector

    val mean = (x: Vector[Double]) => x.sum/x.length
    val bootstrap = new Bootstrap(
        numReplicates,
        mean, 
        inputDistribution.map( _._2 ), 
        new Random( System.currentTimeMillis)
    )

    val meanS = bootstrap.bootstrappedReplicates.sum / 
          bootstrap.bootstrappedReplicates.size
    val sProb = bootstrap.bootstrappedReplicates.map(_ - meanS)
         // .. plotting histogram of distribution sProb
    (bootstrap.mean, bootstrap.error)
  }

We are using the normal and log normal probability density function defined in the Apache Commons Math Java library. These probability density functions are defined in the org.apache.commons.math3.distribution package.

The comparative method bootstrapEvaluation has the following argument:
  • dist: A probability density function used to generate the data set upon which sampling is performed (line 2).
  • random: A random number generator (line 3)
  • gen: A pair of parameters for the linear transform for the generation of random values (a.r + b) (line 4).
The input distribution inputDistribution { (x, pdf(x)} is generated for 5000 data points (lines 7 - 13).
Next the bootstrap is created with the appropriate number of replicates, numReplicates, the mean of the input data set as the statistical function s, the input distribution and the generic random number generator of Scala library, as arguments (lines 16 -20).
Let's plot the distribution the input data set generated from a normal density function.

val (meanNormal, errorNormal) = bootstrap(
    new NormalDistribution, 
    new scala.util.Random, 
    (5.0, 2.5)
)

Normally distributed dataset

The first graph plots the distribution of the input dataset using the Normal distribution.


The second graph illustrates the distribution (histogram) of the replicates s - mean.


The bootstrap replicates sf(x) are also normally distributed. The mean value for the bootstrap replicates is 0.1978 and the error is 0.001691

Dataset with a log normal distribution

We repeat the same process for the lognormal distribution. This time around the dataset to sample from follows a log-normal distribution.

val (meanLogNormal, errorLogNormal) = bootstrap(
    new LogNormalDistribution, 
    new scala.util.Random, 
    (2.0, 0.0)
)



Although the original dataset used for generated the bootstrap samples is normally distributed, the bootstrap replicates sf(x) are normally distributed. The mean for the bootstrap replicates is 0.3801 and the error is 0.002937

The error for the bootstrap resampling from a log normal distribution is twice as much as the error related to the normal distribution
The result is not surprising: The bootstrap replicates follow a normal distribution which matches closely the original dataset created using the same probability density function. 

References

  • Programming in Scala - 3rd edition M Odersky, L. Spoon, B. Venners - Artima - 2016
  • Elements of Statistics Learning: Data mining, Inference and Prediction - 7.11 Bootstrap method Springer - 2001
  • github.com/patnicolas