Wednesday, December 2, 2015

Fast Kullback-Leibler Divergence Using Spark

Target audience: Intermediate

Estimated reading time: 5'

In my role as a data engineer, I've often encountered the challenge of measuring the discrepancy between two probability distributions over identical variables, such as when calculating the loss function for variational auto-encoders or generative adversarial networks.

This article delves into the Kullback-Leibler divergence and how to implement it using Apache Spark for distributed computing.

Table of contents

Note: The source code relies on Spark 3.3.1 and Scala 2.12.4

What you will learn: How to implement the Kullback-Leibler divergence on Apache Spark with Scala

Overview

The Kullback-Leibler divergence, originating from information theory and often linked to Information Gain, is also referred to as the relative entropy between two distributions [ref 1]. It quantifies the difference in the distribution of random values, specifically in their probability density functions.

Let's consider two distribution P and Q with probability density functions p and q

\[D_{KL}(P||Q)= - \int_{-\infty }^{+\infty }p(x).log\frac{p(x)}{q(x)}\] The formula can be interpreted as the Information lost when the distribution Q is used to approximate the distribution P

It is obvious that the computation if not symmetric: \[D_{KL}(P||Q)\neq D_{KL}(Q||P)\]The Kullback-Leibler divergence plays a role in calculating Mutual Information, a method for feature extraction, and is a member of the F-Divergences family.

In this article, we aim to:

Craft an implementation of the Kullback-Leibler divergence in Scala, leveraging the Apache Spark framework for distributed computation.
Showcase how the Kullback-Leibler divergence can be utilized to contrast various continuous probability density functions.

Scala implementation

First, we'll establish the Kullback-Leibler formula to assess the difference between a dataset {xi, yi} (which is likely produced by a certain random variable or distribution) and a specified continuous distribution with a probability density function, f.

We rely on the foldLeft Scala method to compute the summation of the cross entropy value for each data point [ref 2]

object KullbackLiebler {

   final val EPS = 1e-10
   type DATASET = Iterator[(Double, Double)]

   def execute(      // #1
      xy: DATASET, 
      f: Double => Double): Double = {

      val z = xy.filter{ case(x, y) => abs(y) > EPS}
      - z.foldLeft(0.0){ 
         case(s, (x, y)) => {
            val px = f(x)
            s + px*log(px/y)}
         }
      }. 

   def execute(     // #2
     xy: DATASET, 
     fs: Iterable[Double=>Double]): Iterable[Double] =  fs.map(execute(xy, _))
}

The divergence methods execute are defined as static (object methods)

The first invocation of the method execute (# 1) computes the dissimilarity between a distribution xy with the distribution with a probability density f.

The second execute (# 2) method computes the KL divergence between a distribution xy with a sequence fs of probability density functions.

Next let's define some of the continuous random distributions used to evaluate the distribution represented by the dataset xy: Normal, beta, logNormal, Gamma, LogGamma and ChiSquare.

  // One Gaussian distribution
val gauss = (x: Double) => INV_PI*exp(-x*x/2.0)

   // Uniform distribution
val uniform = (x: Double) => x

  // Log normal distribution
val logNormal = (x: Double) => { 
   val lx = log(x)
   INV_PI/x*exp(-lx*lx) 
}
  // Gamma distribution
val gamma = (x: Double, n: Int) => 
    exp(-x)*pow(x, n)/fact(n)

  // Log Gamma distribution
val logGamma = (x: Double, alpha: Int, beta: Int) =>
   exp(beta*x)*exp(-exp(x)/alpha)/(pow(alpha, beta)*fact(beta-1))

  // Simple computation of m! (for beta)
def fact(m: Int): Int = if(m < 2) 1 else m*fact(m-1)

  // Normalization factor for Beta
val cBeta = (n: Int, m: Int) => {
    val f = if(n < 2) 1 else fact(n-1)
    val g = if(m < 2) 1 else fact(m-1)
    f*g/fact(n+m -1).toDouble
}

  // Beta distribution
val beta = (x: Double, alpha: Int, beta: Int) =>
   pow(x, alpha-1)*pow(x, beta-1)/cBeta(alpha, beta)

  // Chi-Square distribution
val chiSquare = (x: Double, k: Int) => {
   val k_2 = k >>1
   pow(x, k_2-1)*exp(-0.5*x) /((1 << k_2)*fact(k_2))
}

Some of these probability density functions are parameterized (i.e. log gamma has an alpha and beta parameters). The probability density functions are implemented as Scala partially applied functions with predefined parameter values, as illustrated in the following code snippet.

val gamma2 = gamma( _ : Double, 2)
val beta25 = beta(_: Double, 2, 5)
val chiSquare4 = chiSquare(_: Double, 4)

Notes:

The implementation of the probability density functions in the code snippet is not optimized for performance.
Please refer to your favorite Statistics handbook to learn more about these probability distributions.

Spark to the rescue

Apache Spark is a free, open-source framework for cluster computing, specifically designed to process data in real time via distributed computing [ref 3]. Its primary applications include:

Analytics: Spark's capability to quickly produce responses allows for interactive data handling, rather than relying solely on predefined queries.
Data Integration: Often, the data from various systems is inconsistent and cannot be combined for analysis directly. To obtain consistent data, processes like Extract, Transform, and Load (ETL) are employed. Spark streamlines this ETL process, making it more cost-effective and time-efficient.
Streaming: Managing real-time data, such as log files, is challenging. Spark excels in processing these data streams and can identify and block potentially fraudulent activities.
Machine Learning: The growing volume of data has made machine learning techniques more viable and accurate. Spark's ability to store data in memory and execute repeated queries swiftly facilitates the use of machine learning algorithms.

Computing the Kullback-Leibler divergence on a sizable dataset with many random distributions can lead to delays and considerable use of computational power. One approach to tackle this is by leveraging the Apache Spark framework for parallel processing of the divergence.

To do this:

Segment the primary dataset.
Distribute the initial sequence of probability density functions.
Apply the Kullback-Leibler formula to each segment using mapPartitions.
Retrieve and combine the divergence value from every segment.

final val NUM_DATA_POINTS = 10000000.    // #1
val numTasks: Int = 128                                   // #2
  
val conf = new SparkConf()                            // #3
     .setAppName("Kullback-Liebler")
     .setMaster(s"local[$numTasks]")


val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")

Once we define the number of input data points (# 1) and the number of concurrent tasks (# 2), the Apache Spark context sc is configured with an instance of SparkConf (# 3).
Next let's implement the broadcasting mechanism for the batch or sequence of probability density functions.

lazy val pdfs = Map[Int, Double => Double](
  1 -> uniform, 
  2 -> logNormal, 
  3 -> gamma1, 
  4 -> gamma2, 
  5 -> gamma4, 
  6 -> logGamma12, 
  7 -> logGamma22, 
  8 -> beta22, 
  9 -> beta25, 
  10 -> chiSquare2, 
  11 -> chiSquare4
)

val pdfs_broadcast = sc.broadcast[Iterable[Int]](pdfs.map(_._1))

The pdfs_broadcast variable broadcasts the sequence of probability density functions, pdfs, across all partitions.

Such value broadcasting is extensively employed in the machine learning library MLlib to send model coefficients (or weights) to the data node for gradient and loss function computations.

The mapPartitions method converts an array of value pairs {x, y} into an array of length equivalent to pdfs.size. This new array holds the KL divergence values corresponding to each distribution being assessed.

val kl_rdd  = master_rdd.mapPartitions(
  (it:DATASET) => {
     val pdfsList = pdfs_broadcast.value.map( n => pdfs.get(n).get)
     execute(it, pdfsList).iterator
  }
)

In the end, the divergence values from each partitions are collected onto the Spark master then aggregated using the Scala fold operator foldLeft.

val kl_master = kl_rdd.collect

val divergences = (0 until kl_master.size by pdfs.size)
  .foldLeft(Array.fill(pdfs.size)(0.0))( (s, n) => {
    (0 until pdfs.size).foreach(j =>
       s.update(j, kl_master(n+j)))
       s
   }
).map( _ / kl_master.length)

Conclusion

We conducted the test to gauge the capability of the Kullback-Leibler divergence in assessing the differences among multiple probability distributions. We chose the Gaussian distribution as our reference point against which other distributions were contrasted.

The comparative findings can be seen as follows:

fig 1. KL divergence between various distribution and a normal distribution

As anticipated, the Gaussian and Normal distributions show great similarity. Yet, there's a notable divergence when comparing the Gaussian with the most advanced orders of the Gamma and Beta distributions. This distinction becomes even more apparent when examining the graphs of their respective probability density functions.

References

[1] Pattern Recognition and Machine Learning - 1.6.1 Relative entropy and mutual information C. Bishop - Springer Science - 2006

[2] Scala programming language
[3] Apache Spark documentation
github.com/patnicolas

-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3

Monday, November 9, 2015

Time Complexity for Graph & ML Algorithms

Target audience: Beginners

Estimated reading time: 3'

Posts history

Asymptotic time complexity gives a high-level understanding of an algorithm's efficiency and scalability, which is crucial for comparing algorithms, predicting their performance, and identifying potential bottlenecks.
This short article lists the time complexity for the most common graph, linear algebra and machine learning algorithms.

The most used Big O notation defines the upper bound on the time complexity, so the runtime does not exceed a certain threshold as the size of input grows indefinitely.

Overview

Time complexity (or worst case scenario for the duration of execution given a number of elements) is commonly used in computing science. However, you will be hard pressed to find a comparison of machine learning algorithms using their asymptotic execution time.

Summary

The following summary list the time complexity of some of the most common algorithms used in machine learning including, recursive computation for dynamic programming, linear algebra, optimization and discriminative supervised training.

Algorithm	Time complexity
Hash table search	1
Array	N
Binary tree	Log N
KD tree	N
Recursion with one element reduction	N²
Recursion with halving reduction	logN
Recursion with halving reduction	N
Naïve Bayes	N.D.C
Nearest neighbors search	D.logk.n.Logn
PCA	D²n + n³
Linear regression	D²n + n³
Logistic regression	InDC
Matrix multiplication (a, b) x (b, c)	a.b.c
Matrix multiplication (a, a)	a³
Matrix multiplication (a, a) Strassen	a^2.8
Partial eigenvalues extraction	e.n²
Complete eigenvalues extraction	n³
Minimum spanning tree Prim linear queue	V²
Minimum spanning tree Prim binary heap	(E + V).logV
Minimum spanning tree Prim Fibonacci h	V.logV
Shortest paths Graph Dijkstra linear sort	V²
Shortest paths Graph Dijkstra binary heap	(E + V).logV
Shortest paths Graph Dijkstra Fibonacci	V.logV
Shortest paths kNN Graph - Dijkstra	(k + logN).N²
Shortest paths kNN Graph Floyd-Warshall	N³
Fast Fourier transform	n.logn
Batched gradient descent	n.P.ep
Stochastic gradient descent	n.P.ep
Newton with Hessian computation	n³.ep
Conjugate gradient	n.n0.sqrt(cd)
L-BFGS	n.g.I
K-means (*)	C.n.D.I
Decision tree	n².D
Lasso regularization - LARS(*)	n.D.min(n,D)
Hidden Markov Model Forward/backward	s².n
Multilayer Perceptron (*)	i.h.o.n.ep
Support vector machine (*) Newton	D²n + n³
Support vector machine (*) Cholesky	D²n + n²
Support vector machine (*) - SMO	D²n + n²
Principal Components Analysis (*)	D²n + n³
Expectation-Maximization (*)	D²n
Laplacian Eigenmaps	D.logk.n.logn + m.n.k² + D.N²
Genetic algorithms	pz.logpz.I.G
Feed forward neural network	n.h.(i + h²/H).H

N	Number elements
n	Number of observations
D	Dimension/features
C	Number of classes
I	Number of iterations
ep	Number of epochs
k	Number of neighbors
a	1^st dimension matrix
b	2^nd dimension of matrix
c	3^rd dimension of matrix
e	Number of eigenvalues
V	Number of vertices
E	Number of edges
P	Number of variables
cd	Condition of matrix
n0	Number of non-zero
g	Number of gradients
s	Number of states
i	Number of input variables
h	Number of hidden neurons
H	Number of hidden layers
o	Number output variables
G	Number of genes
pz	Population size

References

Introduction to Algorithms T. Cormen, C. Leiserson, R. Rivest - MIT Press 1990
Machine Learning: A probabilistic Perspective K. Murphy - MIT Press, 2012
Big O cheat sheet
github.com/patnicolas

Wednesday, December 2, 2015

Fast Kullback-Leibler Divergence Using Spark

Overview

Scala implementation

Spark to the rescue

Conclusion

References

Monday, November 9, 2015

Time Complexity for Graph & ML Algorithms

Overview

Summary

Contact Form

Equation Editor

Popular Posts