Friday, August 22, 2014

Akka Master-Worker Design

Target audience: Intermediate
Estimated reading time: 5'

This article describes and implements the Master-Workers design for distributed computing using Akka actors and Scala programming language.

Follow me on LinkedIn

Overview

Traditional multi-threaded applications rely on accessing data located in shared memory. The mechanism relies on synchronization monitors such as locks, mutexes or semaphores to avoid deadlocks and inconsistent mutable states. Those applications are difficult to debug because of race condition and incur the cost of a large number of context switches.
The Actor model addresses those issues by using immutable data structures (messages) and asynchronous (non-blocking) communication. The actor model has already been described in the previous post "Scala Share-nothing Actors". 

Note: This post focuses on the simple Master-worker model using Akka framework 2.3.4 and Scala 2.11.7

Master-workers model

In this design, the "worker" actors are initialized and managed by the "master" actor which is responsible for controlling the iterative process, state, and termination condition of the algorithm. The orchestration of the distributed tasks (or steps) executing the algorithm is performed through message passing:
* Activate from master to workers to launch the execution of distributed tasks
* Complete from workers to master to notify completion of tasks and return results.
* Terminate from master to terminate the worker actors.


The first step is to defined the immutable messages.

sealed abstract class Message(val id: Int)

case class Activate(i: Int, xt: Array[Double]) extends Message(i)
case class Completed(i: Int, xt: Array[Double]) extends Message(i)
case class Start(i: Int =0) extends Message(i)


The Start message is sent to the master by the client code, (external to the master-workers communication) to launch the computation.
The following sequence diagram illustrates the management of worker' tasks by the master actor through immutable, asynchronous messages.


The next step is to define the key attributes of the master. The constructor takes 4 arguments:
* A time series xt (line 5)
* A transformation function fct (line 6)
* A data partitioner (line 7)
* A method to aggregate the results from all the worker actors aggr (line 8)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
type DblSeries = Array[Array[Double]]
type DblVector = Array[Double]
 
abstract class Master(
  xt: DblSeries, 
  fct: DblSeries => DblSeries, 
  partitioner: Partitioner, 
  aggr: (List[DblSeries]) => immutable.Seq[DblVector]) extends Actor{
 
   val workers = List.tabulate(partitioner.numPartitions)(n => 
       context.actorOf(Props(new Worker(n, fct)), 
           name = s"${worker_ String.valueOf(n)}"))
   
    workers.foreach( context.watch ( _ ) )
     // ...
}

The master actor creates list of worker actors, workers using the higher order method tabulate (line 10). The master registers the worker actor to be notified of their termination context.watch (line 14).

Events handler

In the implementation of the event handler receive for the master below, the Start message triggers the partitioning of the original dataset through a split function (line 3).
Upon completion of their tasks, the workers emit a Completed message to the master (line 6). The master counts the number of workers which have completed their tasks. Once all the workers have completed their tasks with the condition aggregator.size >= partitioner.numPartitions-1, the master computes the aggregated value (line 8), aggr then stop all the workers through its context workers.foreach( context.stop(_) ) (line 9).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
override def receive = {
    // Sent by client to master to initiate the computation
   case s: Start => split
 
    // Sent by workers on completion of their computation
   case msg: Completed => {
      if(aggregator.size >= partitioner.numPartitions-1) {
         val aggr = aggregate.take(MAX_NUM_DATAPOINTS).toArray
         workers.foreach( context.stop(_) )
      }
      aggregator.append(msg.xt)
   }
   
     // Sent by client to shutdown master and workers
   case Terminated(sender) => {
       // wait the current execution of workers completes
      if( aggregator.size >= partitioner.numPartitions-1) {
         context.stop(self)
         context.system.shutdown
      }
   }
}

The message Terminated (line 15) shuts down the master and the global context for all the actors, context.system.shutdown (lines 18 & 19).

The next step consists of defining the tasks for the worker actors. A worker actors is fully specified by its id and the data transformation fct (lines 2 & 3).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
final class Worker(
     id: Int, 
     fct: DblSeries => DblSeries) extends Actor {
   
  override def receive = {
     // Sent by master to start execution
     case msg: Activate => {
        val msgId = msg.id+id
        val output = fct(msg.xt)
        sender ! Completed(msgId, output)
  }
}

The event loop processes only one type of message, Activate, (line 7) which executes the data transformation fct (lines 8 & 9).

The last step is the implementation of the test application. Let's consider the case of the cancellation of noise on a very large dataset xt executed across multiple worker actors. The dedicated master actor of type NoiseRemover partitions the dataset using an instance of Partitioner distributed the cancellation algorithm cancelNoise to its worker (or slave) actors. The results aggregation function aggr has to be defined for this specific operation.

def cancelNoise(xt: DblSeries): DblSeries
 
class NoiseRemover(
    xt: DblSeries,
    partitioner: Partitioner,
    aggr: List[DblSeries] => immutable.Seq[DblVector])
 extends Master(xt, cancelNoise, partitioner, aggr)


The Akka actor context ActorSystem is initialized (line 1). The test driver implements a very simple results aggregation function aggregate passed as parameter of the noise remover master actor, controller (line 4). The reference to the controller is generated by the Akka actor factory method ActorSystem.actorOf (line 8).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
val actorSystem = ActorSystem("System")
   
  // Specifies the aggregator used in the master
def aggregate(aggr: List[DblSeries]): Seq[DblVector] =
    aggr.transpose.map( _.sum).toSeq
 
  // Create the Akka master actor
val controller = actorSystem.actorOf(
   Props(new NoiseRemover(xt, partitioner, aggregate)), "Master"
)

controller ! Start(1)

Finally the execution is started with a "fire and forget" message Start (line 12)


Tuesday, July 22, 2014

Akka Actors Blocking on Futures

Target audience: Intermediate
Estimated reading time: 6'

This is a brief introduction to distributed computing using blocking Scala/Akka futures.


Table of contents
Follow me on LinkedIn

Overview

Akka is actor based and message-driven framework for building concurrent and reliable applications. As such Akka supports futures for which the requester never block waiting for a response. A message or request is sent for a execution and the expect response (success or failure/exception) is delivered asynchronously in the future. The code snippets in our two examples omits condition test or supporting methods for clarity sake. The code compiles and executes with Akka 2.2.4 and 2.3.6

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted 

Concurrent processing

One simple and common usage of Akka/Scala futures is to have some computation performed concurrently without going through the pain of creating actors.
Not every computation requires a sequential execution for which the input of one task depends on the output of another tasks. Let's consider the case of the evaluation of a model or function against a predefined time series or dataset.

The first step is to create a controller to manage the concurrent tasks, FuturesController (line 3). The controller takes the input time series xt (line 4) and a list of model candidates, xs as function of time Function1[Double] (line 5). The time series uses a single variable (dimension 1), so the models are simply defined as a one variable function (x: Double) => f(x).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
case class Start()
 
final class FuturesController(
  xt: Array[Double],
  xs: List[Double => Double])
 (implicit timeout: Timeout) extends Actor {
   
  override def receive = {
    case s: Start => {
      val futures = createFutures
      val results = futures.map(
         Await.result(_, timeout.duration)
      )

      sender ! results.zipWithIndex.minBy( _._1 )._2
    }
    case _ => logger.error("Message not recognized")
  }
 
  def createFutures: List[Future[Double]]
}

The event handler receive (line 8) for the message Start creates as many future as needed (equal to the number of models to evaluate) (line 10). The controller/actor blocks by invoking Await until each of the future tasks completes (line 12). Finally, the controller returns the result of the computation (line 15), in this case the fittest of the models xs. The handler logs a simple error message in case a message other than Start is received (line 17).

The futures are created through the method createFutures. The implementation of createFutures consists of computing the least squared error for each model relative to the input dataset using a map transformation and a fold aggregator (lines 4 - 7).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def createFutures: List[Future[Double]] =
  
  xs.map(f => Future[Double] { 
    val yt = Range(0, xt.size).map( f(_) )
    val r = (xt, yt).zipped./:(0.0)((sum, z) => {
       val diff = z._1 - z._2
       sum + diff*diff
    })

    Math.sqrt(r)
  })
}

Evaluation

Let's consider an input data set generated with the following model
              y = m(m-1) + r // r [0. 1]
where r is a random value between 0 and 1, representing noise.



val TIME_SERIES_LEN = 200
val xt = Array.tabulate(TIME_SERIES_LEN)(
    m => m*(m - 1.0) + Random.nextDouble
)

The following code snippet lists all key packages to be imported for most common applications using Akka actors and futures (lines 1 to 6).

The driver program instantiates the Actor system (line 12), creates the controller actor master using the Akka actor factory actorOf (lines 13, 14). It sends a ask request (line 18) and returns the best model (line 22) if the controller completes it tasks within a predefined timeout (line 10). The code print the stack trace in case of failure (line 23).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import akka.actor.{Actor, Props, ActorSystem}
import akka.util.Timeout
import scala.concurrent.{Await, Future}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import akka.pattern.ask
 

private val duration = Duration(2000, "millis")
implicit val timeout = new Timeout(duration)
   
val actorSystem = ActorSystem("system")
val master = actorSystem.actorOf(
  Props(new FuturesController(xt, approx)), "Function_eval"
)
 
Try {
  val future = master ? Start()
  Await.result(future, timeout.duration)
} 
match {
  case Success(n) => logger.info(s"Best fit is $n")
  case Failure(e) => e.printStackTrace
}

actorSystem.shutdown


It is critical that the application shuts down the the Akka system before it exits (line 26).

References

Tuesday, June 10, 2014

Introduction to Scala Discretized Streams

Target audience: Beginner
Estimated reading time: 5'


How can we leverage Scala Streams to manage very large data sets with limited computing resources?


Table of contents
Follow me on LinkedIn

Overview

A Stream instance can be regarded as lazy list, or more accurately a list with lazy elements. The elements are allocated only when accessed. Stream allows Scala developers to write infinite sequences. Elements can be removed from memory (to be handled by the GC) defined by eliminating any reference to its elements once no longer needed.

Performance Evaluation

It is easy to understand the benefit of Stream in term of memory management. But what about the performance?
Let's compare Stream and List using 3 simple operations:
  • Allocating elements
  • Retrieving a random element
  • Traversing the entire collection
Let's start by defining these operations in the context of computing the mean value of a very large sequence of double values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
val NUM = 10000000 
 
   // Allocation test
val lst = List.tabulate(NUM)( _.toDouble)

   // Reading test
var y = 0.0
Range(0, 10000).foreach( _ => 
     {y = lst(Random.nextInt(NUM-1)}
)
   // Reducer test
lst.reduce( _ + _ )/lst.size

The operation of reading a value at a random index is repeated 10,000 times in order to make the performance evaluation more reliable (line 8, 9). The mean is computed using a simple reduce method (line 12)

Let's implement the same sequence of operations using Stream class.

1
2
3
4
5
6
7
8
val strm = Stream.tabulate(NUM)( _.toDouble)
   // Reading test
var y = 0.0
Range(0, 10000).foreach( _ => 
  {y = strm(Random.nextInt(NUM-1)}
)
   // Reducer test
strm.reduceRight( _ + _ )/strm.size

The implementation of the generation of random values using Stream is very similar to the implementation using List (line 4, 5). The mean of the stream is also computed with a reducer (line 8).

The test is run 20 times to avoid distortion of the initialization of the JVM. 


The allocation of the elements in the stream is slightly faster than the creation of the list.
The main difference is the time required by the List and Stream to traverse the entire collection using the reduceRight method as a reducer. In this code snippet above, the Stream has to allocate all its elements at once. This scenario is very unlikely as Streams are usually needed to process section or slices of a very large sequence of values or objects, as demonstrated in the next section.

Use case: moving average

The most common application of Scala Stream is iterative or recursive application of a function/transform or sequence of functions to a very large data set, in this case, the mean value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
val strm = Stream.fill(NUM)( Random.nextDouble )
  
val STEP = 5
val sum = strm.take(STEP).sum
val avStrm = strm.drop(STEP)

 // Apply the updating formula 
 // Sum(n, n+STEP) = Sum(n -1, STEP) - x(n-1) + x(n)
avStrm.zip(avStrm.tail)
      .map(x => sum - x._1 + x._2)
      .map( _ /STEP)


First, the code creates a reference strm of a stream of NUM random values (line 1). Then it computes the sum of the first STEP elements of the stream (line 4). Once the sum is computed, these elements are dropped from the stream (line 5). The mean value is updated for each new batch of new STEP elements (line 9-11).

Here is an alternative implementation of the computation of the moving average on a stream of floating point values using the tail recursion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def average(strm: Stream[Double], window: Int): Stream[Double] = {
  
  @scala.annotation.tailrec
  def average(
    src: Stream[Double], 
    target: Stream[Double]): Unit = {
    
    if( !src.isEmpty ) {
      val tailSrc = src.tail
      val newSum = sum - src.head + tailSrc.head
       average(strm.tail, target :+ newSum)
    }
  }
   
  val _strm = Stream.empty[Double] :+ strm.take(window).sum
  average(strm.drop(window), _strm)
  _strm.map( _/ window) 
}

The recursive call average (line 4) has two arguments: the stream src traversed through the recursion (line 5), and the stream that collects the average (mean) values (line 6). The method recurses as long as the source stream src is not empty (line 8).
The performance of the computation of the mean can be greatly improved by parallel its execution,
Stream.par

References

Monday, May 5, 2014

Performance Scala Parallel Collections

Target audience: Beginner
Estimated reading time: 3'

This post evaluates the performance improvement of Scala parallel collections ovr mutable and immutable collections.


Table of contents
Follow me on LinkedIn

Overview

The Scala standard library includes some parallel collections which purpose is to shield developers from the intricacies of concurrent thread execution and race condition. The parallel collections are a very convenient approach to encapsulate concurrency into a high level abstraction similar to the traditional data workflow, scientists are familiar with.
Parallel computing is supported for some collection using the par method as listed below.

  • List[T].par: ParSeq[T]
  • Array[T].par: ParArray[T]
  • Map[K,V].par: ParMap[K,V]
  • HashMap[K,V].par: ParHashMap[K,V]
  • Set[T].par: ParSet[T]
  • ParVector, ParRange and ParIterable
Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted

Benchmark for parallel arrays

The main purpose of parallel collections is to improve the performance of execution through concurrency. Let’s consider a map and reduce function applied to an array of arbitrary length

final val sz = 100000
val data = Array.tabulate(sz) ( _ << 1)
data.par.map( x => f(x))
data.par.reduceLeft( _ + _)

The next step is to create a set of benchmark test classes, ParArrayBenchmark and ParMapBenchmark that automates the performance evaluation of parallel arrays and maps over an arbitrary number of tasks, nTasks.
The first step is to define a timing function (line 1), that executes a function g for times iterations (line 4).

1
2
3
4
5
6
def timing(g: Int => Unit, times: Int): Long = {
   // Measure duration of 'times' execution of g
   val startTime = System.currentTimeMillis
   Range(0, times).foreach(g)
   System.currentTimeMillis - startTime
}

The benchmark is parameterized for the type U of elements in an array. The constructor takes an Array (line 2) and a parallelizable array ParArray (line 3) as arguments.
The benchmark ParArrayBenchmark evaluate and compare the performance of an array and a parallel array for the most commonly used higher order methods: map (line 6), filter (line 14) and reduce (line 22).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class ParArrayBenchmark[U](
  u: Array[U], 
  v: ParArray[U], 
  times: Int) {

  def map(f: U => U)(nTasks: Int): Double = {
    v.tasksupport = new ForkJoinTaskSupport(
      new ForkJoinPool(nTasks)
    )
    val duration = timing(_ => u.map(f), times).toDouble
    timing( _ => v.map(f), times )/duration
  }
 
  def filter(f: U => Boolean)(nTasks: Int): Double = {
     v.tasksupport = new ForkJoinTaskSupport(
       new ForkJoinPool(nTasks)
     )
     val duration = timing(_ => u.filter(f), times).toDouble
     timing( _ => v.filter(f), times )/duration
  }

  def reduce(f: (U,U) => U)(nTasks: Int): Double = {
     v.tasksupport = new ForkJoinTaskSupport(
       new ForkJoinPool(nTasks)
     )
     val duration = timing(_ => u.reduceLeft(f), times).toDouble
     timing( _ => v.reduceLeft(f), times )/duration
  }
}

The benchmark is flexible enough to support any kind of method argument f with any type U for all three methods; map, filter and reduce.
The Scala classes ForkJoinTaskSupport and ForkJoinPool are wrappers around the Java classes, ForkJoinTask and ForkJoinPool. ForkJoinPool (lines 8, 16 and 24) provides Scala developers with a very efficient way to manage threads pool: It executes nTasks tasks that are potentially created by other tasks.
The tasks are implemented using Java threads, managed by an executor service, familiar to most Java developers.

Benchmark for parallel maps

Let's create a benchmark for evaluating the performance of parallel maps, similar to the benchmark on parallel arrays.
Once again, the evaluation methods map (line 7) and filter (line 21) are flexible enough to accommodate any function argument f of any type U. The implementation of these two methods follows the same pattern as the one use for the parallel array. The duration of the execution of map and filter is computed through the timing method, introduced in the previous paragraph.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class ParMapBenchmark[U](
  u: immutable.Map[Int, U], 
  v: ParMap[Int, U], 
  times: Int) {
      
   //Define the map operator for the performance benchmark of map
  def map(f: U => U)(nTasks: Int): Double = {
     v.tasksupport = new ForkJoinTaskSupport(
       new ForkJoinPool(nTasks)
     )

     val duration = timing(_ => u.map{
      case (e1, e2) => (e1, f(e2))
     }, times).toDouble
     timing( _ => v.map{ 
       case (e1, e2) => (e1, f(e2))
     }, times )/duration
   }
 
    //Define the filter operator for the performance benchmark of Scala map
  def filter( f: U => Boolean)(nTasks: Int): Double = {
    v.tasksupport = new ForkJoinTaskSupport(
      new ForkJoinPool(nTasks)
    )

    val duration = timing(_ => u.filter{ 
      case (e1, e2) => f(e2)
    }, times).toDouble
    timing( _ => v.filter{ 
      case (e1, e2) => f(e2)
    }, times)/duration
   }
}


Performance Results

The objective of the performance test is to evaluate the efficiency of the Scala parallel collection according to
  • The number of available CPU cores
  • The complexity of the computation
  • The size of the collection
Let’s look at the relative performance of the map task on a single threaded Array and a parallel array ParArray.
Let's use fairly simple mathematical functions mapF (line 2) (resp. filterF (line 5) and reduceF (line 8) for evaluating the map (resp. filter and reduce) functions on array and parallel arrays. The arrays are create and populated with random values (lines 10 & 11).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
  // Arbitrary map function
val mapF = (x: Double) => Math.sin(x*0.01) + Math.exp(-x)
    
  // Arbitrary filter function
val filterF = (x: Double) => (x > 0.8)
     
  // Arbitrary reduce function
val reduceF = (x:Double, y:Double) => (x+y)*x

val data = Array.fill(SZ)(Random.nextDouble)
val pData = ParArray.fill(SZ)(Random.nextDouble)
 
   // Initialized and execute the benchmark for the parallel array
val benchmark = new ParArrayBenchmark[Double](data, pData, TIMES)

benchmark.map(mapF)(n)
benchmark.filter(filterF)(n)


The results are not surprising in the following respects:
  • The reducer doesn't take advantage of the parallelism of the array. The reduction of ParArray has a small overhead in the single-task scenario and then matches the performance of Array.
  • The performance of the map function benefits from the parallelization of the array. The performance levels off when the number of tasks allocated equals or exceeds the number of CPU core.
The second test consists of comparing the behavior of two parallel collections, ParArray and ParHashMap, on two methods, map and filter, using a configuration identical to the fist test as follows:

We reuse the mathematical functions for evaluate the map, filter and reduce functions used for arrays in the test client code.

1
2
3
4
5
6
7
8
val mapData = new HashMap[Int, Double]
Range(0, SZ).foreach(n => mapData.put(n, Random.nextDouble) )

val parMapData = new ParHashMap[Int, Double]
Range(0, SZ).foreach(n => parMapData.put(n, Random.nextDouble) )

benchmark.map(mapF)(n)
benchmark.filter(filterF)(n)



The impact of the parallelization of collections is very similar across methods and across collections. It's important to notice that the performance of the parallel collections levels off at around four times the single thread collections for fie concurrent tasks and above.


References