## Tuesday, October 22, 2013

### Type erasure, manifest, specialization in Scala

Overview
Scala and Java programming languages use type erasure to compile away generics. The type parameters [U <: T]  are removed and replaced by their upper bound T or Any. The process involves boxing and un-boxing the primitives types if they are used in the code as type parameters, degrading performance. This post describes two approaches to work around type erasure: Manifest and type specialization.

Description
Let consider a class ListCompare that compare lists of parametric type U bounded by the type Item. Note that ancillary code such as returned Option type or check on method and class arguments are omitted for the sake of simplicity.

 case class Item class ListCompare[U <: Item](val xs: List[U])(implicit f: U => Ordered[U]) { def compare(xso: List[U]): Boolean = { xso match { case str: List[String] => if(xs.size==xso.size) xs.zip(xso).exists( x=> x._1.compareTo(x._2) != 0) else false case n: List[Int] => if(xs.size==xso.size) xs.zip(xso).exists(x => x._1!= x._2) else false case _ => false } } } 

The class has to have an implicit conversion U => Ordered[T] to support the comparison of strings. The code above will generate the following warning message: "non-variable type argument String in type pattern List[String] is unchecked since it is eliminated by erasure". The warning message is generated by the compiler because the two parameterized type, List[String] and List[Int] may not be available to JVM during the execution of the pattern matching.
On solution is to use a Manifest. A Manifest[T] is an opaque descriptor for type T. It allows access to the erasure of the type as a Class instance. The most common usage of manifest is related to the creation of native Arrays if the class is not known at compile time as illustrated in the code snippet below.

 def myArray[T] = new Array[T](0 def myArray[T](implicit m: Manifest[T]) = new Array[T](0) def myArray[T: Manifest] = new Array[T](0) 

The first line of code won't compile. The second function will maintain the erasure by passing the manifest as an implicit parameter. The last function defined the manifest as a context bound. The solution to our problem is to use the Manifest as a implicit parameter to the compare function.

 class ListCompare[U <: Item](val xs: List[U])(implicit f: U => Ordered[U]) { def compare(xso: List[U])(implicit u: Manifest[List[U]]): Boolean = { if( u <:< manifest[List[String]] ) if( xs.size == xso.size) xs.zip(xso).exists( x=> x._1.compareTo(x._2) != 0) else false else if(u <:< manifest[List[Int]]) if( xs.size == xso.size) xs.zip(xso).exists(x => x._1!=x._2) else false else false } } 

A second option is to generate a separate class for the primitive type, Int and String using the @specialized annotation. The annotation forces the compiler to generate byte code for each primitive listed as specialized type. For instance the instruction @specialized(Int, Double) T generates two extra, efficient classes for the primitive types Int and Double. The original ListCompare class is re-written using the annotation as follows:

 class ListComp[@specialized Int, String, U <: Item] (val xs:List[U])(implicit f: U =>Ordered[U]) { def compare(xso: List[U]): Boolean = if(xs.size == xso.size) xs.zip(xso).exists( x=> x._1.compareTo(x._2)!=0) else false } 

The code above will not throw a warning or error. However there is not such a thing as a free lunch, as the compiler generates extra byte code for the methods associated to the specialized primitive types. The objective in this case is to trade a higher memory consumption for performance improvement.

References
Programming in Scala M. Odesky, L.Spoon, B. Venners - Artima 2008
Scala for the Impatient - Cay Horstman Addison-Wesley 2012

## Monday, October 14, 2013

### Curried and partial functions in Scala

Introduction
Although most of Scala developers have some level of knowledge of curried and partial functions, they struggle to grasp the different use case either of those functional programming techniques are applied and their relative benefits

Partial Functions
Partially defined functions are commonly used to restrict the domain of applicability of function arguments. The restriction can apply to either the type of the argument or its values. Let's consider the computation of square root of a floating point value. The value of the argument has to be positive. A simple implementation relies on the Option monad.
 def _sqrt(x: Double): Option[Double] = if( x < 0.0) None else Some(Math.sqrt(x)) 

The same method can be implemented using a partial function by applying the matching pattern to the argument as follows.
 def _sqrt: PartialFunction[Double] = { case x: Double if(x >= 0.0) => Math.sqrt(x) } 

A similar restriction can be applied to the type of argument. Let's consider the incremental method as follow.
 class Value(val x: Int) { def += (any: Any): Any = { if(any.isInstanceOf[Int]) { val value = any.asInstanceOf[Int] (x + value).asInstanceOf[Any] } else (any.isInstanceOf[Double]) { val value = any.asInstanceOf[Double].floor.toInt (x + value).asInstanceOf[Any] } .... } 

The implementation is cumbersome to say the list. It makes sense to replace it with a pattern matching on the type and defined a partial function to handle any type of arguments
 class Value(val x: Int) { def += PartialFunction[Any] = { case n: Int => x + n case y: Double => x + y.floor.toInt } } 
In the example above, we do not have to handle the case for each the argument has an improper type. The partial function will simply discards it.
The method Actor.receive that define a message loop in an actor, consuming messages from the mail box are indeed partial functions.

Currying
Currying is the transformation of function with multiple arguments into a chain of function taking a single argument. if f: x-> f(x,y) then curry(f): x -> (y->f(x,y))
Let's take a simple example of a sum of two floating point values. The original 2 arguments functions (1) can be converted into a single argument function returning a anonymous function taking the second argument as parameter (2). Scala provides developers with a simple syntax sugar to define the cascade of functions calls (3)
 def sum(x: Double, y: Double): Double = x+y def sum(x: Double): Double = (y: Double) => x+y def sum(x: Double)(y: Double): Double = x+y 

Most of high order methods on collections are curried. The following example illustrate the commonly used foldLeft.
 class Collection[T](private val values: Array[T]) { def foldLeft[U](u:U, op:(U,T)=>U):U = values.foldLeft[U](u)((u,t)=> op(u,t)) def foldLeft[U](u:U)(op:(U,T)=>U):U = this.foldLeft(u, op) } val myCollection = new Collection[Int](Array[Int](3, 5, 8)) val product = myCollection.foldLeft[Int](0)((prod, x) => prod*x) 

Is there any benefits of using curried function instead of functions or methods with multiple arguments? Yes, in the case the type inferencer has more information that the second argument can use. Let's consider the foldLeft method above:
def foldLeft[U](u: U)(op:(U, T)=>U):U = this.foldLeft(u, op)
The type inferencer determine the type U of the first argument and used it subsequently in the binary operator parameters op:(U, T)=>U

References
Scala for the Impatient - Cay Horstman Addison-Wesley 2012

## Thursday, October 3, 2013

### Bloom Filter in Scala

Overview
Bloom filter became a popular probabilistic data structure to enable membership queries (object x belonging to set or category Y) a couple of years ago. The main benefit of Bloom filter is to reduce the requirement of large memory allocation by avoiding allocating objects in memory much like HashSet or Hash Table. The compact representation comes with a trade-off: although the filter does not allow false negatives it does not guarantee that there is no false positives. In other words, a query returns:
- very high probability that an object belong to a set
- an object does not belong to a set
A Bloom filter is quite often used as a front end to a regular, deterministic algorithm

Theory
Let's consider a set A = {a0,.. an-1} of n elements for which a query to determine membership is executed. The data structure consists of a bit vector V of m bits and k completely independent hash functions that are associated to a position in the bit vector. The assignment (or mapping) of hash functions to bits has to follow a uniform distribution. The diagram below illustrates the basic mechanism behind the Bloom filter. The set A is defined by the pair a1 and a2. The hash functions h1 and h2 map the elements to bit position (bit set to 1) in the bit vector. The element b has one of the position set to 0 and therefore does not belong to the set. The element c belongs to the set because its associated positions have bits set to 1

However, the algorithm does not prevent false positive. For instance, a bit may have been set to 1 during the insertion of previous elements and the query reports erroneously that the element belongs to the set.
The insertion of an elements depends on the h hash functions, therefore the time needed to add a new element is h (number of hash functions) and independent from size of the bit vector: asymptotic insertion time = O(h). However, the filter requires h bits for each element and is less effective that traditional bit array for small sets.
The probability of false positives decreases as the number n of inserted elements decreases and the size of the bitvector m, increases. The number of hash functions that minimizes the probability of false positives is defined by h = m.ln2/n.

Implementation in Scala
The implementation relies on the MessageDigest java library class to generated the unique hash values. Ancillary methods and condition on methods arguments are ommitted for sake of clarity.

 class BloomFilter(private val length: Int, private val numHashs: Int, private val algorithm: String) { private[this] val set = new Array[Byte](length) private[this] var numElements = 0 private[this] val digest = { try { MessageDigest.getInstance(algorithm) } catch { case e: NoSuchAlgorithmException => null } } // add an array of elements to the filter def add(anyArray: Array[Any]): Boolean = { if( digest != null) { anyArray.foreach( getSet(_).foreach( set(_) = 1) ) numElements += anyArray.size } digest != null } @inline def add(any: Any): Boolean = this.add(Array[Any](any)) @inline def contains(any: Any): Boolean = if( digest != null) !getSet(any).exists( set(_) != 1) else false private[this] def hash(value: Int) : Int = { digest.reset digest.update(value) // invoke implicit conversion from Int to bytes Math.abs(new BigInteger(1, digest.digest).intValue) % (set.size -1) } private[this] def getSet(any: Any): Array[Int] = { val newSet = new Array[Int](numHashs) newSet.update(0, hash(any.hashCode)) getSet(newSet, 1) newSet } @scala.annotation.tailrec private[this] def getSet(values: Array[Int], index: Int) : Boolean = { if( index >= values.size) true else { values.update(index, hash(values(index-1))) getSet(values, index+1) } } } 

Then MessageDigest class generates a hash value using either MD5 or SHA-1 algorithm. The only two methods that are allowed are add an array of objects and query if a object is contained in the set. Tail recursion is used to generate the set with numHash hash functions. The last code snippet implements a very simple Int to Array[Byte] conversion and a test code.

 implicit def int2Bytes(value: Int) : Array[Byte] = { val bytes = new Array[Byte](4) bytes.map( x => { val offset=(bytes.size-1-bytes.indexOf(x))<<3 ((value>>>offset) & 0xFF).asInstanceOf[Byte] }) bytes } val filter = new BloomFilter(100, 100, "SHA") final val newValues = Array[Any](57, 97, 91, 23, 67, 33) filter.add(newValues) println( filter.contains(22) ) println( filter.contains(23) ) 

References
github.com/prnicolas

## Saturday, September 28, 2013

### Performance of tail recursion in Scala

Overview
In Scala, tail recursion is a commonly used technique to apply a transformation to the elements of a collection. The purpose of this post is to evaluate the performance degradation of the tail recursion comparatively to iterative based methods.

Test Benchmark
For our test, we select a transformation that is applied to a slice or sliding window of an array. Such algorithms are widely used in signal processing and technical analysis of financial markets (i.e. moving average, filters). For the sake of simplicity, we create a simple polynomial transform on a array of values {X0, ... ,Xn, ... Xp} defined as
f(Xn) = (n-1)Xn-1 + (n-2)Xn-2 + (n-3)Xn-3.

 final val WINDOW_SIZE = 3 def polynomial(values: Array[Int]): Int = { val arraySlice = if( values.size < W_SIZE) values else values.takeRight(W_SIZE) arraySlice.foldLeft[Int](0)((sum, x) => sum + x*values.indexOf(x)) } 

The first implementation of the polynomial transform is a tail recursion on each element Xn of the array. The transform f compute f (values(cursor) ) from the array values[0, ... , cursor-1] as describe in the code snippet below

 class Evaluation(val values: Array[Int]) { def recurse(f: Array[Int] => Int): Array[Int] = { val results = new Array[Int](values.size) recurse(f, 0, results) results } @scala.annotation.tailrec private[this] def recurse(f: Array[Int] => Int, cursor:Int, results:Array[Int]): Boolean = { // exit condition if( cursor >= values.size) true else { results.update(cursor, f(values.take(cursor+1))) recurse(f, cursor+1, results) } } } 

The second implementation relies on the scan method that return a cumulative of transformed value f(Xn).
 def iterScan(f: Array[Int] => Int): Array[Int] = values.scanLeft(0)((sum, x) => f(values.take(values.indexOf(x)+1))).takeRight(values.size) 

Finally, we implement the polynomial transform on the sliding array window with a simple for loop.
 def iterLoop(f: Array[Int] => Int): Array[Int] = { val results = new Array[Int](values.size) for( i <- 0 until values.size ) results.update(i, f(values.take(i+1)) results } 

Results
For the test, each of those 3 methods is executed 1000 on a dual core i7 with 8 Gbyte RAM and MacOS X Mountain Lion 10.8. The first test consists of executing the 3 methods and varying the size of the array from 10 to 90. The duration is measured in milliseconds.

The tail recursion is significantly faster than the two iterative methods. The scan methods (scan, scanLeft, scanRight) have significant overhead that cannot be "amortized" over a small array. However the implementation using the scan method outperforms easily the other two methods, as the size of the array increases. It is worth noticing that the performance of "for loop" method and the tail recursion are oddly similar. As described in earlier posts, the for loop in Scala has a significant overhead (i.e. javap analysis). The relative performance of those 3 methods is confirmed while testing with large size array (from 100 to 900 items).

References
The Scala Programming Language - M. Odersky, L. Spoon, B.Venners - Artima 2007 github.com/prnicolas

## Wednesday, September 18, 2013

### Performance evaluation of Scala Map*

Overview
The Scala programming language API provides developers with a set of short immutable maps of predefined length, Map1, .. Map4,  Set. I thought it would be worthwhile to evaluate the performance of those short,  dedicated collections compare to their generic counterpart

Benchmarking
The benchmark consists of a very simple methods calls on immutable Map and Map4 instances as well as a mutable HashMap as defined in the code snippet below.
 val map4 = new Map4("0", 0, "1",1, "2",2, "3", 3) val map0 = Map[String, Int]("0" -> 0, "1" -> 1, "2" -> 2, "3"-> 3) val hMap = HashMap[String, Int]("0" -> 0, "1" -> 1, "2" -> 2, "3"-> 3)

We evaluate the time needed to execute 10 million iterations of a map, foldLeft and get methods as follows
 aMap map { kv => kv._2 + 2 } aMap get("2") aMap.foldLeft[Int](0)( (sum, kv) => { sum + kv._2} ) 

The results of the performance test is shown in the table below. As expected the "short" collection is faster that the immutable Map and the mutable HashMap. However the performance improvement is not very significant.

Methods immutable.Map.Map4 immutable.Map mutable.HashMap
get 0.835 s 0.879 s 1.091 s
map 7.462 s 7.566 s 1.106 s
foldLeft 3.444 s 3.621 s 4.782 s

References
Scala for the impatient - C Horstmann Addison-Wesley 2012 github.com/prnicolas

## Tuesday, September 10, 2013

### Variance annotations in Scala

Overview
The purpose of this post, is to introduce the basics of parameterized classes, upper and lower bounds and variance subtyping in Scala. Scala is a statically typed languages similar to Java and C++. Most of those concepts have been already introduced and commonly in Java with generics and type parameters wildcard. However Scala added some useful innovation regarding sub-typing covariance and contravariance. API designers and library developers are facing the dilemma of choosing between parameterized types and abstract types and which parameters should be invariant, covariant or contravariant

Sub-classing
Like in Java, generic or parameterized type have been introduced to overcome the limitation of polymorphism. Moreover, understanding of concept behind the Scala type system help developers decipher and fix the occasional compilation failure related to typing.
Let's consider a simple container class that add and retrieve an array of items as implemented in the code snippet below.
 class Item case class SubItem extends Item class Container { def add(items: Array[Item]) : Unit = items.foreach( add(_) ) def retrieve(items: Array[Item]) : Unit = { for( i <- 0 until items.size ) items.update(i, retrieve) } private[this] def add(item: Item) : Unit = { } private[this] def retrieve : Item = new SubItem } 

Although the client code can invoke the add and retrieve method on array of type Item, passing an array of elements of type SubItem will generate a compiling error.
 val subItemsContainer = new Container val array = Array[SubItem](new SubItem)subItemsContainer .add(array) // Compile error subItemsContainer .retrieve(Array[SubItem](new SubItem)) 

The Scala compiler throw a compiling error because the class Container does not have the adequate information to subtype Item to SubItem in the add method and super type SubItem to Item in the retrieve method. As with Java variance implemented through wildcards and , Scala provides developers with upper bounds and lower bounds subtyping.

Variance to the Rescue
The upper type bound for parameterized type, T <: provide the Scala compiler that an object of sub-type of Item can be processed as a 'producer' operation of type Item, such as adding to an collection. The lower type bound for the parameterized type, U >: SubItem indicates that any base type of SubItem can be processed as a 'consumer' operation of type SubItem, such as retrieving elements from a collection.
 class Container[T <: Item] { def add(items: Array[T]) : Unit = items.foreach( add(_) ) def retrieve[U >: SubItem](items: Array[U]) : Unit = { for( i <- 0 until items.size ) items.update(i, retrieve) } private[this] def add(item: T) : Unit = { } private[this] def retrieve : U = new SubItem } 

Container[T] is a covariant class for the type Item and the method retrieve[U] is a contravariant method for the type SubItem. This time around the compiler has the adequate information to subtype Item and supType SubItem, therefore the following code snippet won't generate a compiling error
 val itemsContainer = new ContainerT[SubItem] itemsContainer.add(Array[SubItem](new SubItem))) itemsContainer.retrieve[Item](Array[Item](new SubItem)) 

Scala has a dedicated annotation for unvariance 'T', covariance '+T' and contravariance '-T'. Using this annotation the container class can be expressed as follow.
 class Container[+T] { def add(items: Array[T]) : Unit = items.foreach( add(_) ) def retrieve[-U](items: Array[U]) : Unit = { for( i <- 0 until items.size ) items.update(i, retrieve) } private[this] def add(item: T) : Unit = { } private[this] def retrieve : U = new SubItem } 

In a nutshell, variances can be represented as functors:
-  Covariance: if (B inherit A) then (Container[B] inherit Container[A])
-  Contravariance: if( B inherit A) then (Container[A] super class of Container[B])
-  Univariance: No rule applies.
The subtyping rules above can be represented graphically

References
The Scala Programming Language - M. Odersky, L. Spoon, B.Venners - Artima 2007
Effective Java - 2nd edition - J. Block Addison-Wesley 2008 github.com/prnicolas

## Friday, August 30, 2013

### Breakable loops in Scala

Overview
Contrary to C++ and Java, Scala does not allow developer to exit an iterative execution prematurely using break.
  var sum = 0 for( i <- 0 until 100) { sum += i if( sum > 400) break // won't compile! } 
This post review the different options to 'break' from a loop according to a predefined condition.

Breakable control
Although breaking out of a loop may not be appealing to "pure" functional developer, the Scala language provides former Java and C++ developer with the option to use break and continue statements. The breakable statement define the scope for which break is valid.

 import scala.util.control.Breaks._ var sum = 0 breakable { for (i <- 0 until 100 ) { sum += i if( sum > 55) break } } 

Any experienced developer would be quite reluctant to use such an idiom: beside the fact that the accumulator is defined as a variable, the implementation is unnecessary convoluted adding a wrapper around the loop. Let's try to find a more functional like approach to breaking out of any loop

scan & fold to the rescue
Luckily, Scala provides collections with functional traversal patterns that can be used and chained to break, elegantly from a loop. The following code snippets introduce applies some of those traversal patterns to an array and a associative map to illustrate the overall "functional" approach to iteration.
Let's consider the problem of extracting the elements of an array or map before a predefined condition is met for an element. For instance, let's extract the elements of the array or map until one element has the value 0. The Scala programming language provide us with the takeWhile method that allows to to end traversing a collection and return a elements of the collection visited when a condition is met.

 val values = Array[Int](9, 45, 11, 0, 67, 33) val mappedValues = Map[String, Int]("A"->2, "B"->45, "C"->67, "D"->0, "E"->56) // Extract a subarray of values until the condition is met values takeWhile( _ != 0) // -> Array(9, 45, 11) // Extract a submap until the condition is met mappedValues takeWhile( _._2 != 0) // -> Map(E -> 56, A -> 2, B -> 45, C -> 67) 

The second case consists in accumulating values until a condition on the accumulator is met. Contrary to fold and reduce methods which apply a function (i.e summation) to all the elements of a collection to return a single value, scan, scanLeft and scanRight methods return a collection of values processed by the function.
In the following example scanLeft generate an array or a map with the cumulative values, similar to a map method. The takeWhile method is then applied to the resulting array or map to return a collection of cumulative values until the accumulator exceeds 56. The same invocation can be used to return the element which push the accumulator beyond the threshold.

 values.scanLeft(0)(_ + _).takeWhile (_ < 56) //> Array(0, 9, 54) mappedValues.scanLeft(0) (_ + _._2).takeWhile( _ < 56) val indexTargetEl = values.scanLeft(0)(_ + _).takeWhile (_ < 56).size -1 val targetEl = values(indexTargetEl) 

Tail recursion
A 3rd alternative to exit from a loop is using the tail recursion supported natively in the Scala language. The first method, findValue exits the loop when a condition on the value is reached. The second method, findCummulative, is implemented as a closure and exits when the sum of elements exceeds a predefined value.
 val values = Array[Int](9, 45, 11, 0, 67, 33) def findValue(values: Array[Int], index: Int) : Int = { if( values(index) == 0 || index >= values.size) index else findValue(values, index + 1) } val newValues = values.slice(0, findValue(values, 0)) val MAX_VALUE :Int = 86 def findCumulative(sum: Int, index: Int): Int = { if( index >= values.size ||sum >= MAX_VALUE) index -1 else findCumulative(sum + values(index), index + 1) } val newValues2 = values.slice(0, findCumulative(0, 0)) 

References
github.com/prnicolas

## Sunday, August 18, 2013

### Symbolic Regression for Data Modeling

Overview
Symbolic Regression allows domain experts to create, add modify rules or policies extracted from data. The most commonly used algorithms used in Symbolic Regression are:
- Genetic Algorithms
- Learning Classifiers Systems

Symbolic regression is used in many application ranging from network performance optimization, predicting failure (MTBF), streaming data to detecting security breaches.

Presentation
The following presentation describes the main components, benefits and drawbacks of symbolic regression.

References
Genetic Programming: on the Programming of Computers by Means of Natural Selection - J. Koza - MIT Press 1992
Reinforcement Learning: An introduction (Adaptive Computation and Machine learning) - R. Sutton, A. Barto - MIT Press 1998