Thursday, October 30, 2014

Scala High Order Methods: Collect & Partition

Target audience: Beginner
Estimated reading time: 4'

This post describes the use cases and typical implementation of the Scala collect and partition higher order methods.


Table of contents
Follow me on LinkedIn

The Scala higher order methods collect, collectFirst and partition are not commonly used, even though these collection methods provide developers with a higher degree of flexibility than any combination of map, find and filter.

TraversableLike.collectFirst

The method create a new collection by applying a partial function to all elements of this traversable collection, such as arrays, list or map on which the function is defined. It signature is
    def collect[B](pf: PartialFunction[A, B]): Traversable[B]

The use case is to validate K set (or samples) of data from a dataset. Once validated, these K sets are used in K-fold validation of a model generated through training of an machine learning algorithm: K-1 sets are used for training and the last set is used for validation. The validation consists of extracting K samples arrays from a generic array then test that each of these samples are not too noisy (standard deviation does not exceed a high threshold.

. The first step is to create the two generic functions of the validation: breaking the dataset into K sets, then compute the standard deviation of each set. This feat is accomplished by the ValidateSample trait

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
val sqr = (x : Double) => x*x

trait ValidateSample {
  type DVector = Array[Double]

    // Split a vector into sub vectors
  def split(xt: DVector, nSegments: Int): Iterator[DVector] =  
    xt.grouped(((xt.size/nSegments).ceil).toInt)
 
  lazy val stdDev = (xt: DVector) => {
    val mu = xt.sum/xt.size
    val var =(xt.map(_ - mu)
              .map(sqr(_))
              .reduce( _ + _))/(xt.size-1)
    Math.sqrt(var)
  }
 
  def isValid(x: DVector, nSegments: Int): Boolean
}

The first method, split breaks down the initial array x into an indexed sequence of segments or sub-arrays. The standard deviation stdDev is computed by folding the sum of values and sum of squared values. The value is defined as lazy so it is computed on demand once for all. The first validation class ValidateSampleMap uses a sequence of map and find to test that all the data segments extracted from the dataset have a standard deviation less than 0.8

class ValidateWithMap extends ValidateSample {
   override def isValid(x: DVector, nSegs: Int): Boolean =
       split(x, nSegs).map( stdDev(_) ).find( _ > 0.8) == None
}

The second implementation of the validation ValidateSampleCollect uses the higher order function collectFirst to test that all the data segments (validation folds) are not very noisy. collectFirst requires a PartialFunction to be defined with a condition of the standard deviation.

class ValidateWithCollect extends ValidateSample {
   override def isValid(x: DVector, nSegs: Int): Boolean =
     split(x, nSegs).collectFirst { 
        case xt: DVector => (stdDev(xt) > 0.8) } == None
    }
}

There are two main differences between the first implementation combining map and find and collectFirst implementation
  • The second version requires a single higher order function, collectFirst , while the first version uses map and find.
  • The second version throws a MatchErr exception as soon as a data segment does not comply
These two implementations can be evaluated using a simple driver application that takes a ValidateSample as argument.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
val rValues = Array.fill(NUM_VALUES)(Random.nextDouble)
  
Try ( 
  new ValidateWithMap(0.8).isValid(rValues, 2) 
).getOrElse( false)
 

Try ( 
    new ValidateWithCollect(0.8).isValid(rValues, 2) 
) match {
    case Success(seq) => {}
    case Failure(e) => e match {
    case ex: MatchError => {}
    case _ => {}
  }
}


TraversableLike.collect

The method collect behavior similar to collectFirst. As collectFirst is a "partial function" version of "find", then collect is the "partial function" version of "filter".

def filter1(x: DVector, nSegments: Int): Iterator[DVector] = 
  split(x, nSegments).collect(pf)
  
def filter2(x: DVector, nSegments: Int): Iterator[DVector] = 
  split(x, nSegments).filter( stdDev( _ ) > ratio)



TraversableLike.partition

The Higher order method partition is used to partition or segment a mutable indexed sequence of object into a two indexed sequences given a boolean condition (or predicate).
def partition(p: (A) ⇒ Boolean): (Repr, Repr)
The test case consists of segmenting an array of random values, along the mean value 0.5 then compare the size of the two data segments. The data segments, segs should have similar size.

final val NUM_VALUES = 10000
val rValues = Array.fill(NUM_VALUES)(Random.nextDouble)
 
val segs = rValues.partition( _ >= 0.5)
val ratio = segs._1.size.toDouble/segs._2.size
println(s"Relative size of segments $ratio")

The test is executed with different size of arrays.:
NUM_VALUES  ratio
     50       0.9371
 1000       1.0041
10000      1.0002
As expected the difference between the two data segments size converges toward zero as the size of the original data set increases (law of large numbers).

Sunday, September 14, 2014

Mixin Constraint on Self-typed Methods

Target audience: Intermediate
Estimated reading time: 4'

This post illustrates the appropriate use of self-type to restrict the composition of (stackable) traits (mixins) in relation to an existing class or trait.

Follow me on LinkedIn

Overview

Mixin traits with self-type restriction has commonly used in Scala. Dependency injection and the Cake pattern in particular, relies on constraint imposed by a trait that it can be used only with subclass of a predefined types. The same approach can be used to constraint a trait to be used with class that support one or several methods.

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted.

Mixin constraint on self-type

Let's consider the case of a class taxonomy (or hierarchy) that classifies machine learning algorithms as either supervised learning or unsupervised learning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
trait Learning {
  def predict(x: Double): Double { }
}
 
trait SupervisedLearning {
  def train(x: Array[Double]): Int = { ... }
}
 
trait Validation {
  self: SupervisedLearning => 
    def validate(x: Array[Double]): Double 
}
 
class SVM 
  extends SupervisedLearning 
    with Validation {

  override def train(x: Array[Double]): Int = { ... }
}

The support vector machine of type SVM is a type of supervised learning algorithm, and therefore extends the SupervisedLearning trait (line 5 & 115). The code snippet compiles because the class SVM (line 14) complies with the restriction imposed by the trait Validation on sub-types of SupervisedLearning (line 10).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
trait UnsupervisedLearning {
  def group(x: Array[Double]): int
}
 
    // Validation: failed self-typed inheritance 
    // from SupervisedLearning trait
class EM 
  extends UnupervisedLearning 
    with Validation { 
  
  override def train(x: Array[Double]): Int = { ... }
}

The Scala code snippet does not compile because the expectation-maximization algorithm, EM is an unsupervised learning algorithm and therefore is not a sub-class of SupervisedLearning.

Mixin constraint on self-typed method

Marking a trait to be extended (implemented) with sub-class with predefined method(s) is the same as marking a trait to be extended (implemented) with sub-class with predefined type.
Let's reuse the class hierarchy introduced in the previous section.

trait Validation { 
  self: { def train(x: Array[Double]): Int } =>
     def validate(x: Array[Double]): Double = -1.0
}

This time around the Validation can be mixed with a class that implements the method train
As previously seen, the class SVM complies with the restriction imposed by the Validation. However the declaration of the reinforcement learning algorithm QLearning generated a compilation error because it does not implement the method train as required.

// No training needed for Q-Learning
class QLearning 
  extends Learning 
     with Validation{ 
   
  def predict(x: Double): Double { } 
}

Although brief, this introduction to self-referential condition should help you to start considering this technique to protect you code from unintentional erroneous sub-typing and trait mixing.

References

  • Scala Cookbook A. Alexander O' Reilly 2013 
  • The Scala Programming Language - M. Odersky, L. Spoon, B.Venners - Artima 2007
  • github.com/patnicolas

Friday, August 22, 2014

Akka Master-Worker Design

Target audience: Intermediate
Estimated reading time: 5'

This article describes and implements the Master-Workers design for distributed computing using Akka actors and Scala programming language.

Follow me on LinkedIn

Overview

Traditional multi-threaded applications rely on accessing data located in shared memory. The mechanism relies on synchronization monitors such as locks, mutexes or semaphores to avoid deadlocks and inconsistent mutable states. Those applications are difficult to debug because of race condition and incur the cost of a large number of context switches.
The Actor model addresses those issues by using immutable data structures (messages) and asynchronous (non-blocking) communication. The actor model has already been described in the previous post "Scala Share-nothing Actors". 

Note: This post focuses on the simple Master-worker model using Akka framework 2.3.4 and Scala 2.11.7

Master-workers model

In this design, the "worker" actors are initialized and managed by the "master" actor which is responsible for controlling the iterative process, state, and termination condition of the algorithm. The orchestration of the distributed tasks (or steps) executing the algorithm is performed through message passing:
* Activate from master to workers to launch the execution of distributed tasks
* Complete from workers to master to notify completion of tasks and return results.
* Terminate from master to terminate the worker actors.


The first step is to defined the immutable messages.

sealed abstract class Message(val id: Int)

case class Activate(i: Int, xt: Array[Double]) extends Message(i)
case class Completed(i: Int, xt: Array[Double]) extends Message(i)
case class Start(i: Int =0) extends Message(i)


The Start message is sent to the master by the client code, (external to the master-workers communication) to launch the computation.
The following sequence diagram illustrates the management of worker' tasks by the master actor through immutable, asynchronous messages.


The next step is to define the key attributes of the master. The constructor takes 4 arguments:
* A time series xt (line 5)
* A transformation function fct (line 6)
* A data partitioner (line 7)
* A method to aggregate the results from all the worker actors aggr (line 8)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
type DblSeries = Array[Array[Double]]
type DblVector = Array[Double]
 
abstract class Master(
  xt: DblSeries, 
  fct: DblSeries => DblSeries, 
  partitioner: Partitioner, 
  aggr: (List[DblSeries]) => immutable.Seq[DblVector]) extends Actor{
 
   val workers = List.tabulate(partitioner.numPartitions)(n => 
       context.actorOf(Props(new Worker(n, fct)), 
           name = s"${worker_ String.valueOf(n)}"))
   
    workers.foreach( context.watch ( _ ) )
     // ...
}

The master actor creates list of worker actors, workers using the higher order method tabulate (line 10). The master registers the worker actor to be notified of their termination context.watch (line 14).

Events handler

In the implementation of the event handler receive for the master below, the Start message triggers the partitioning of the original dataset through a split function (line 3).
Upon completion of their tasks, the workers emit a Completed message to the master (line 6). The master counts the number of workers which have completed their tasks. Once all the workers have completed their tasks with the condition aggregator.size >= partitioner.numPartitions-1, the master computes the aggregated value (line 8), aggr then stop all the workers through its context workers.foreach( context.stop(_) ) (line 9).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
override def receive = {
    // Sent by client to master to initiate the computation
   case s: Start => split
 
    // Sent by workers on completion of their computation
   case msg: Completed => {
      if(aggregator.size >= partitioner.numPartitions-1) {
         val aggr = aggregate.take(MAX_NUM_DATAPOINTS).toArray
         workers.foreach( context.stop(_) )
      }
      aggregator.append(msg.xt)
   }
   
     // Sent by client to shutdown master and workers
   case Terminated(sender) => {
       // wait the current execution of workers completes
      if( aggregator.size >= partitioner.numPartitions-1) {
         context.stop(self)
         context.system.shutdown
      }
   }
}

The message Terminated (line 15) shuts down the master and the global context for all the actors, context.system.shutdown (lines 18 & 19).

The next step consists of defining the tasks for the worker actors. A worker actors is fully specified by its id and the data transformation fct (lines 2 & 3).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
final class Worker(
     id: Int, 
     fct: DblSeries => DblSeries) extends Actor {
   
  override def receive = {
     // Sent by master to start execution
     case msg: Activate => {
        val msgId = msg.id+id
        val output = fct(msg.xt)
        sender ! Completed(msgId, output)
  }
}

The event loop processes only one type of message, Activate, (line 7) which executes the data transformation fct (lines 8 & 9).

The last step is the implementation of the test application. Let's consider the case of the cancellation of noise on a very large dataset xt executed across multiple worker actors. The dedicated master actor of type NoiseRemover partitions the dataset using an instance of Partitioner distributed the cancellation algorithm cancelNoise to its worker (or slave) actors. The results aggregation function aggr has to be defined for this specific operation.

def cancelNoise(xt: DblSeries): DblSeries
 
class NoiseRemover(
    xt: DblSeries,
    partitioner: Partitioner,
    aggr: List[DblSeries] => immutable.Seq[DblVector])
 extends Master(xt, cancelNoise, partitioner, aggr)


The Akka actor context ActorSystem is initialized (line 1). The test driver implements a very simple results aggregation function aggregate passed as parameter of the noise remover master actor, controller (line 4). The reference to the controller is generated by the Akka actor factory method ActorSystem.actorOf (line 8).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
val actorSystem = ActorSystem("System")
   
  // Specifies the aggregator used in the master
def aggregate(aggr: List[DblSeries]): Seq[DblVector] =
    aggr.transpose.map( _.sum).toSeq
 
  // Create the Akka master actor
val controller = actorSystem.actorOf(
   Props(new NoiseRemover(xt, partitioner, aggregate)), "Master"
)

controller ! Start(1)

Finally the execution is started with a "fire and forget" message Start (line 12)


Tuesday, July 22, 2014

Akka Actors Blocking on Futures

Target audience: Intermediate
Estimated reading time: 6'

This is a brief introduction to distributed computing using blocking Scala/Akka futures.


Table of contents
Follow me on LinkedIn

Overview

Akka is actor based and message-driven framework for building concurrent and reliable applications. As such Akka supports futures for which the requester never block waiting for a response. A message or request is sent for a execution and the expect response (success or failure/exception) is delivered asynchronously in the future. The code snippets in our two examples omits condition test or supporting methods for clarity sake. The code compiles and executes with Akka 2.2.4 and 2.3.6

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted 

Concurrent processing

One simple and common usage of Akka/Scala futures is to have some computation performed concurrently without going through the pain of creating actors.
Not every computation requires a sequential execution for which the input of one task depends on the output of another tasks. Let's consider the case of the evaluation of a model or function against a predefined time series or dataset.

The first step is to create a controller to manage the concurrent tasks, FuturesController (line 3). The controller takes the input time series xt (line 4) and a list of model candidates, xs as function of time Function1[Double] (line 5). The time series uses a single variable (dimension 1), so the models are simply defined as a one variable function (x: Double) => f(x).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
case class Start()
 
final class FuturesController(
  xt: Array[Double],
  xs: List[Double => Double])
 (implicit timeout: Timeout) extends Actor {
   
  override def receive = {
    case s: Start => {
      val futures = createFutures
      val results = futures.map(
         Await.result(_, timeout.duration)
      )

      sender ! results.zipWithIndex.minBy( _._1 )._2
    }
    case _ => logger.error("Message not recognized")
  }
 
  def createFutures: List[Future[Double]]
}

The event handler receive (line 8) for the message Start creates as many future as needed (equal to the number of models to evaluate) (line 10). The controller/actor blocks by invoking Await until each of the future tasks completes (line 12). Finally, the controller returns the result of the computation (line 15), in this case the fittest of the models xs. The handler logs a simple error message in case a message other than Start is received (line 17).

The futures are created through the method createFutures. The implementation of createFutures consists of computing the least squared error for each model relative to the input dataset using a map transformation and a fold aggregator (lines 4 - 7).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def createFutures: List[Future[Double]] =
  
  xs.map(f => Future[Double] { 
    val yt = Range(0, xt.size).map( f(_) )
    val r = (xt, yt).zipped./:(0.0)((sum, z) => {
       val diff = z._1 - z._2
       sum + diff*diff
    })

    Math.sqrt(r)
  })
}

Evaluation

Let's consider an input data set generated with the following model
              y = m(m-1) + r // r [0. 1]
where r is a random value between 0 and 1, representing noise.



val TIME_SERIES_LEN = 200
val xt = Array.tabulate(TIME_SERIES_LEN)(
    m => m*(m - 1.0) + Random.nextDouble
)

The following code snippet lists all key packages to be imported for most common applications using Akka actors and futures (lines 1 to 6).

The driver program instantiates the Actor system (line 12), creates the controller actor master using the Akka actor factory actorOf (lines 13, 14). It sends a ask request (line 18) and returns the best model (line 22) if the controller completes it tasks within a predefined timeout (line 10). The code print the stack trace in case of failure (line 23).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import akka.actor.{Actor, Props, ActorSystem}
import akka.util.Timeout
import scala.concurrent.{Await, Future}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import akka.pattern.ask
 

private val duration = Duration(2000, "millis")
implicit val timeout = new Timeout(duration)
   
val actorSystem = ActorSystem("system")
val master = actorSystem.actorOf(
  Props(new FuturesController(xt, approx)), "Function_eval"
)
 
Try {
  val future = master ? Start()
  Await.result(future, timeout.duration)
} 
match {
  case Success(n) => logger.info(s"Best fit is $n")
  case Failure(e) => e.printStackTrace
}

actorSystem.shutdown


It is critical that the application shuts down the the Akka system before it exits (line 26).

References