Tuesday, December 10, 2013

Reinforcement Learning in Scala: States & Policies

Target audience: Advanced

Estimated reading time: 9'

This post describes a very common reinforcement learning methodology: Tenporal difference update as implemented in Scala. This first section introduces and implements the concept of states and policies.

Table of contents

Overview

Temporal difference

On-policy and off-policy

States and actions

Learning policy

References

Overview

There are many different approaches to implement reinforcement learning

One of the most commonly used method is searching the value function space using temporal difference method

All known reinforcement learning methods share the same objective of solving the sequential decision tasks. In a sequential decision task, an agent interacts with a dynamic system by selecting actions that affect the transition between states in order to optimize a given reward function.

At any given step i, the agent select an action a(i) on the current state s(i). The dynamic system responds by rewarding the agent for its optimal selection of the next state:\[s_{i+1}=V(s_{i})\]

The learning agent infers the policy that maps the set of states {s} to the set of available actions {a}, using a value function \[V(s_{i})\] The policy is defined at \[\pi :\,\{s_{i}\} \mapsto \{a_{i}\} \left \{ s_{i}|s_{i+1}=V(s_{i}) \right \}\]

Temporal difference

The most common approach of learning a value function V is to use the Temporal Difference method (TD). The method uses observations of prediction differences from consecutive states, s(i) & s(i+1). If we note r the reward for selection an action from state s(i) to s(i+1) and n the learning rate, then the value V is updated as \[V(s_{i})\leftarrow V(s_{i})+\eta .(V(s_{i+1}) -V(s_{i}) + r_{i})\]

Therefore the goal of the temporal difference method is to learn the value function for the optimal policy. The Q 'action-value' function represents the expected value of action a on a state s and defined as \[Q(s_{i},a_{i}) = r(s_{i}) + V(s_{i})\] where r is the reward value for the state.

On-policy vs. off-policy

The Temporal Difference method relies on the estimate of the final reward to be computed for each state. There are two methods of the Temporal Difference algorithm:On-Policy and Off-Policy:

- On-Policy method learns the value of the policy used to make the decision. The value function is derived from the execution of actions using the same policy but based on history

- Off-Policy method learns potentially different policies. Therefore the estimate is computed using actions that have not been executed yet.

The most common formula for temporal difference approach is the Q-learning formula. It introduces the concept of discount rate to reduce the impact of the first few states on the optimization of the policy. It does not need a model of its environment. The exploitation of action-value approach consists of selecting the next state is by computing the action with the maximum reward. Conversely the exploration approach focus on the total anticipated reward.The update equation for the Q-Learning is \[Q(s_{i},a_{i}) \leftarrow Q(s_{i},a_{i}) + \eta .(r_{i+1} +\alpha .max_{a_{i+1}}Q(s_{i+1},a_{i+1}) - Q(s_{i},a_{i}))\] \[Q(s_{i},a_{i}): \mathrm{expected\,value\,action\,a\,on\,state\,s}\,\,\eta : \mathrm{learning\,rate}\,\,\alpha : \mathrm{discount\,rate}\] . One of the most commonly used On-Policy method is Sarsa which does not necessarily select the action that offer the most value.The update equation is defined as\[Q(s_{i},a_{i}) \leftarrow Q(s_{i},a_{i}) + \eta .(r_{i+1} +\alpha .Q(s_{i+1},a_{i+1}) - Q(s_{i},a_{i}))\]

States and actions

Functional languages are particularly suitable for iterative computation. We use Scala for the implementation of the temporal difference algorithm. We allow the user to specify any variant of the learning formula, using local functions or closures.

Firstly, we have to define a state class, QLState (line 1) that contains a list of actions of type QLAction (line 3) that can be executed from this state. The only purpose of this class is to connect a list of action to a source state. The parameterized class argument property (line 4) is used to "attach" some extra characteristics to this state.

class QLState[T](
  val id: Int, 
  val actions: List[QLAction[T]] = List.empty, 
  property: T) {
    
  @inline
  def isGoal: Boolean = !actions.isEmpty
}

As described in the introduction, an action of class QLAction has a source state from and a destination state to(state which is reached following the action). A state except the goal state, has multiple actions but an action has only one destination or resulting state.

case class QLAction[T](from: Int, to: Int)

The state and action can be loaded, generated and managed by a directed graph or search space of type QLSpace. The search space contains the list of all the possible states available to the agent.
One or more of these states can be selected as goals. The algorithm does not restrict the agent to a single state. The process ends when one of the goal states is reached (OR logic). The algorithm does not support combined goals (AND logic).

Let's implement the basic components of the search space QLSpace. The class list all available states (line 2) and one or more final or goal states goalIds (line 3). Although you would expect that the search space contains a single final or goal state, it is not uncommon to have online training using more than one goal state.

class QLSpace[T](
   states: Array[QLState[T]], 
   goalIds: Array[Int]) {

    // Indexed map of states 
  val statesMap: immutable.Map[Int, QLState[T]] = 
    states.map(st => (st.id, st)).toMap
    // List set of one or more goals  
  val goalStates = new immutable.HashSet[Int]() ++ goalIds
 
    // Compute the maximum Q value for a given state and policy
  def maxQ(st: QLState[T], policy: QLPolicy[T]): Double = { 
    val best = states.filter( _ != st)
       .maxBy(_st => policy.EQ(st.id, _st.id))
    policy.EQ(st.id, best.id)
  }
 
    // Retrieves the list of states destination of state, st
  def nextStates(st: QLState[T]): List[QLState[T]] =
     st.actions.map(ac => statesMap.get(ac.to).get)
 
  def init(r: Random): QLState[T] = 
    states(r.nextInt(states.size-1))
}

A hash map statesMap maintains a dictionary of all the possible states with the state id as unique key (lines 6, 7). The class QLSpace has three important methods:

init initializes the search with a random state for each training epoch (lines 22, 23)
nextStates returns the list of destination states associated to the state st (lines 19, 20)
maxQ return the maximum Q-value for this state st given the current policy policy(lines 12-15). The method filters out itself from the search from the next best action. It then compute the maximum reward or Q(state, action) value according to the given policy policy

The next step is to defined a policy.

Learning policy

A policy is defined by three components

A reward collected after transitioning from one state to another state (line 2). The reward is provided by the user
A Q(State, Action) value, value associated to a transition state and an action (line 4)
A probability (with default values of 1.0) that defines the obstacles or hindrance to migrate from one state to another (line 3)

The estimate combine the Q-value (incentive to move to the best next step) and probability (hindrance to move to any particular state) (line 7).

class QLData {
  var reward: Double = 1.0
  var probability: Double = 1.0
  var value: Double = 0.0) {
  
  @inline
  final def estimate: Double = value*probability
}

The policy of type QLPolicy is a container for the state transition attributes, rewards, Q-values and probabilities.

class QLPolicy[T](numStates: Int, input: Array[QLInput]) {
 
  val qlData = {
    val data = Array.tabulate(numStates)(
      _ => Array.fill(numStates)(new QLData)
    )
 
    input.foreach(in => {  
      data(in.from)(in.to).reward = in.reward
      data(in.from)(in.to).probability = in.prob
    })
    data
  }
  
  def setQ(from: Int, to: Int, value: Double): Unit =
     qlData(from)(to).value = value
 
  def Q(from: Int, to: Int): Double = qlData(from)(to).value
}

The constructor for QLPolicy takes two arguments:

Number of states numStates (line 1)
Sequence of input of type QLInput to the policy

The constructor create a numStates x numStates matrix of transition of type QLData (lines 3 - 12), from the input.

The type QLInput wraps the input data (index of the input state from, index of the output state to, reward and probability associated to the state transition) into a single convenient class.

case class QLInput(
   from: Int, 
   to: Int, 
   reward: Double = 1.0, 
   prob: Double = 1.0

The next post will dig into the generation of a model through Q-learning training

References

Online Value Function Determination for Reinforcement Learning J Laird, N Debersky, N Tinkerhess
github.com/patnicolas
Scala for Machine Learning - Chapter 11 Reinforcement Learning / Q-Learning algorithm P. Nicolas - Packt Publishing 2014

Sunday, November 3, 2013

Breakable Loops in Scala

Target audience: Beginner

Estimated reading time: 4'

Posts history

Table of contents

Introduction

Breakable control

scan & fold to the rescue

Tail recursion

References

Introduction

Contrary to C++ and Java, Scala does not allow developer to exit an iterative execution prematurely using a syntax equivalent to break.

var sum = 0

for( i <- 0 until 100) {
  sum += i
  if( sum > 400) 
    break   // won't compile!                                  
}

Scala purists tend to stay away from this type of constructs and use higher order collection methods such as

  exists( p: (T) => Boolean)
   find( p: (T) => Boolean)
   takeWhile( p: (T) => Boolean
)

However these methods are not available outside collections. There are cases where a `break` construct may be a simpler solution.

This post review the different options to 'break' from a loop according to a predefined condition.

Breakable control

Although breaking out of a loop may not be appealing to "pure" functional developer, the Scala language provides formal Java and C++ developers with the option to use break and continue statements. The breakable statement define the scope for which break is valid.

import scala.util.control.Breaks._                            

var sum = 0
breakable { 
  for (i <- 0 until 100 ) {
    sum += i
    if( sum > 55) 
      break
  }
}

Any experienced developer would be quite reluctant to use such an idiom: beside the fact that the accumulator is defined as a variable, the implementation is unnecessary convoluted adding a wrapper around the loop. Let's try to find a more functional like approach to breaking out of any loop.

scan & fold to the rescue

Luckily, Scala provides collections with functional traversal patterns that can be used and chained to break, elegantly from a loop. The following code snippets introduce applies some of those traversal patterns to an array and a associative map to illustrate the overall "functional" approach to iteration.

Let's consider the problem of extracting the elements of an array or map before a predefined condition is met for an element. For instance, let's extract the elements of an array or a hash map until one element has the value 0. The Scala programming language provide us with the takeWhile method (lines 7 & 10) that allows to to end traversing a collection and return a elements of the collection visited when a condition is met.

val values = Array[Int](9, 45, 11, 0, 67, 33)

val mappedValues = Map[String, Int](
  "A"->2, "B"->45, "C"->67, "D"->0, "E"->56
)
      // Extract a subarray of values until the condition is met
values takeWhile( _ != 0) // -> Array(9, 45, 11)
 
     // Extract a submap until the condition is met
mappedValues takeWhile( _._2 != 0) 
     // -> Map(E -> 56, A -> 2, B -> 45, C -> 67)

The second case consists in accumulating values until a condition on the accumulator is met. Contrary to fold and reduce methods which apply a function (i.e summation) to all the elements of a collection to return a single value, scan, scanLeft and scanRight methods return a collection of values processed by the function.

In the following example, the invocation of the higher order method scanLeft (lines 4 generates an array and a hash map with the cumulative values. The takeWhile method i(lines 5 & 8) s then applied to the resulting array or map to return a collection of cumulative values until the accumulator exceeds 56. The same invocation can be used to return the element which push the accumulator beyond the threshold.

values.scanLeft(0)(_ + _).takeWhile (_ < 56) 
   //-> Array(0, 9, 54)
 
mappedValues.scanLeft(0) (_ + _._2)
   .takeWhile( _ < 56)
 
val indexTargetEl = values.scanLeft(0)(_ + _)
      .takeWhile (_ < 56).size -1            
val targetEl = values(indexTargetEl)

Tail recursion

A third and elegant alternative to exit from a loop is using the tail recursion which is supported natively in the Scala language. The first method, findValue exits the loop when a condition on the value is reached (line 5). The second method, findCummulative, is implemented as a closure and exits when the sum of elements exceeds a predefined value (line 15).

val values = Array[Int](9, 45, 11, 0, 67, 33)
 
@scala.annotation.tailrec
def findValue(values: Array[Int], index: Int) : Int = {
   if( values(index) == 0 || index >= values.size)
     index
   else
     findValue(values, index + 1)
}
val newValues = values.slice(0, findValue(values, 0))   

val MAX_VALUE :Int = 86

@scala.annotation.tailrec
def findCumulative(sum: Int, index: Int): Int = {
   if( index >= values.size ||sum >= MAX_VALUE)
     index -1
   else
     findCumulative(sum + values(index), index + 1)
}

val newValues2 = values.slice(0, findCumulative(0, 0))

This concludes our review of three different Scala constructs available to developer to exit prematurely from a loop

breakable construct
Higher order method such as exists, find, takeWhile....
tail recursion

References

Tuesday, October 8, 2013

Symbolic Regression for Data Modeling

Target audience: Beginner

Estimated reading time: 6'

Posts history

Overview

Symbolic Regression allows domain experts to create, add modify rules or policies extracted from data. The most commonly used algorithms used in Symbolic Regression are:

Genetic Algorithms
Learning Classifiers Systems

Symbolic regression is used in many application ranging from network performance optimization, predicting failure (MTBF), streaming data to detecting security breaches.

Presentation

The following presentation describes the main components, benefits and drawbacks of symbolic regression.

References

Genetic Programming: on the Programming of Computers by Means of Natural Selection - J. Koza - MIT Press 1992
Reinforcement Learning: An introduction (Adaptive Computation and Machine learning) - R. Sutton, A. Barto - MIT Press 1998

Friday, September 13, 2013

Performance of Error Handling in Scala

Target audience: Beginner

Estimated reading time: 4'

Posts history

Table of contents

Overview

Evaluation

Positive test

Negative test

References

Overview

The Scala programming language supports three different approaches to handle errors.

Error codes
Exceptions
Option monadic pattern

The Java and C++ programmers are familiar with the first 2 approaches. Scala introduces the Option type which is defined as a Monad.
Let's take a simple example; computation of the function sin(sqrt(x). The client code unwraps the return type, Option[Double] to handle the error.

def sqrt(x: Double): Option[Double] = {
  if(x < 0.0) None
  else Some(Math.sqrt(x))
}

sqrt(a) match {
  case Some(a) => Math.sin(a)
  case None => Console.println(s"argument $a < 0.0")
}

An alternative is to use a default value with getOrElse in case or failure.

sqrt(a).map( Math.sin(_) ).getOrElse {s"argument $a < 0.0"; 0.0 }

Caution: You should never insecurely unwrap an option using the method get.
sqrt(-3.0).get generates a java.util.NoSuchElementException: None.get exception.

Option type has few important benefits:

The return type None represents absence of returned value or reference, which is safer to process by the client code than a return Null (i.e stray pointers)

The Option type allows developers to create their own error handler: case None => f(do whatever you want or need to do)

The returned value(s) and error handler are encapsulate into the same entity, Option class.

However, there is no "free lunch" and I was curious to find out whether the benefits of using Option type comes with a performance cost. Let's compare the relative performance of the Option type, exception handling and basic error code on a very simple example.

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted

Evaluation

I selected the division of double precision floating point values as our simple test. The simple test code, below is compiled and run with Scala 2.10.2 and Java JDK 1.7._45 on 64-bit Windows.

     // Handling divide by zero using error code NaN
def divErrorCode(x : Double, y : Double) : Double =
  if( Math.abs(y) < 1e-10) Double.NaN else x/y

       // Handling divide by zero using Arithmetic Exception
def divException(x : Double, y : Double) : Double = {
  if( Math.abs(y) < 1e-10) 
        throw  new ArithmeticException("Cannot divide by 0")
  x/y
}

     // Handling divide by zero using Option[Double] return type
def divOption(x : Double, y : Double) : Option[Double] =
     if( Math.abs(y) < 1e-10) None else Some(x/y)

The source code for the three handling errors follows:

  // Option error handling
divOption(x,y).getOrElse(-1.0)

  // Exception error handling 
Try ( divException(x,y) ).getOrElse(-1.0)

  // Return value error handling
val result = divErrorCode(x,y)
if( result.isNaN) -1.0 else result

Positive test

The test consists of running each of those local functions through a large number of iteration varying from 2,000,000 to 18,000,000. The graph that summarizes the test is defined below

As expected, the performance of each of those 3 error handling mechanisms degrades linearly according to the number of iterations. Clearly, the option error handling has the best performance and the exception has the highest overhead. incurs the lowest performance while the exception handler is by far the most efficient.

Negative test

We run the same test with the number of iterations varying from 200,000 to 1,800,000, but with an arithmetic error at each iteration.

The exception handling mechanism has by far the highest overhead. The option monad and the returned error code mechanism have very similar performance.
Performance is only one of the elements to consider when selecting the most appropriate error handling mechanism. However, all things being equal, the overhead generated by repeatedly throwing exception (i.e. lengthy iteration or recursion) should be an incentive to consider alternative solutions

Note: The evaluation of the error handling mechanism has been performed using Scala 2.9. Results may vary in future releases.

References

Option Type in Wikipedia
Option in Scala by David Matsuzek
Programming in Scala - M. Odersky, L.Spoon, B. Venners - Artima 2007
github.com/patnicolas

Monday, August 12, 2013

Performance of Scala iterators

Target audience: Beginner

Estimated reading time: 10'

Posts history

Table of contents

Objective

Scala loops for summation

Data transformation

References

Objective

The Scala programming language provides software developers with several options to iterate through the elements of a collection:

for,while loops and foreach ( x => f(x)) higher order function.

map[Y] { f: X => Y) : Collection[Y] that creates a new collection by applying a function f to each element of the collection

foldLeft[Y] (y : Y)(f : (Y,X)=>Y) : Y) (resp. foldRight[Y] (y : Y)(f : (X,Y)=>Y) : Y) that applies a binary operator to each element of this collection starting left to right (resp. right to left)

This post attempts to quantify the overhead of the most commonly used iterative methods in Scala and demonstrate the effectiveness of the higher order methods map and foldLeft.

Scala loops for summation

The test runs are executed on a 'plain vanilla' dual core i3 2.1 Ghz running Linux CentOs 6.0. The first test consists of comparing compare the performance of the different options to traverse an array of Float with size varies from 2,000,000 to 40,000,000 elements then apply an operation += z to each of its members. The options are

  foreach   (line 6)
  for loop  (line 9)
  while loop (lines 14 - 16)
  foldLeft   (line 19)

val rGen = new scala.util.Random(System.currentTimeMillis)
var data = Array.fill(size)(rGen.nextFloat)
var sum = 0.0

  // Higher order method
data.foreach(sum += _)

  // for loop
for( x <- data) sum += x

  // while loop
var k = 0
val len = data.size
while( k < len) {
  sum += data(k)
  k += 1
}
   // fold
sum = data.foldLeft(0.0)((x, z) => x + z)

The test is repeated 25 times in order to reduce variance and noise generated by the garbage collector. The first 5 iterations are discarded to avoid the overhead of the initialization of the JVM. The mean value of the execution time for each method is computed for different size of an array from 2,000,000 to 40,000,000 floating point values (type Float). The results of the test are plotted in the graph below. The unit of time on the Y-coordinate is milliseconds.

The for, while and foreach expression have very similar performance.

The foldLeft is significantly faster (ratio 1:6)

Data transformation

The second test consists of comparing the performance of

foreach: "fills-up" iteratively a mutable array of type ArrayBuffer (line 3)

foreach: creates and updates a copy of the original array (immutable approach)(lines 7 & 8)

map: transform the original array into an array of square values (line 11)

   // foreach with mutable array buffer
val newData = new mutable.ArrayBuffer[Float]
data.foreach( (x: Float) => newData.append(x *x))
val result = newData.toArray

  // foreach with update of immmutable array
val pData = Array.fill(sz)(0.0)
data.foreach( pData.update(i, _) )

  // map
val nData = data.map((x:Float) => x*x)

Let's run the same test (with the same setup defined in the previous section).

The test shows that the method dedicated to convert an array to other array by applying a natural transformation, map is by far the most efficient.
The methods dedicated to a specific task such as foldLeft for summation and map for data transformation are far more effective that the "plain vanilla" loop constructors. The tests are conducted with Scala 2.10.2

Important Notes:
The syntax or construct for has a very different meaning in Scala as in C or Java. It is actually a wrapper or syntactic sugar layer around the monadic chain of flatMap and map transformation as follows

  for (
      a <- f(x)  // flatMap
      b <- g(a)  // flatMap
      c <- h(b)  // map
  ) yield { }

A more elaborate and time consuming benchmark would consist of running multiple tests using several boolean (< !=..) and numeric (+, *, x => sin(x) ..) operators and computes the normalize mean and variance.

References

Scala for the Impatient - C. Horstman - Addison-Wesley 2012
github.com/patnicolas

Tuesday, December 10, 2013

Reinforcement Learning in Scala: States & Policies

Overview

Temporal difference

On-policy vs. off-policy

States and actions

Learning policy

References

Sunday, November 3, 2013

Breakable Loops in Scala

Introduction

Breakable control

scan & fold to the rescue

Tail recursion

References

Tuesday, October 8, 2013

Symbolic Regression for Data Modeling

Overview

Presentation

References

Friday, September 13, 2013

Performance of Error Handling in Scala

Overview

Evaluation

Positive test

Negative test

References

Monday, August 12, 2013

Performance of Scala iterators

Objective

Scala loops for summation

Data transformation

References

Contact Form

Equation Editor

Popular Posts