Target audience: Advanced
Estimated reading time: 5'
This post is the second installment of our introduction to reinforcement learning using the temporal difference. This section is dedicated to the ubiquitous Q-learning algorithm.
Overview
The previous post, Reinforcement learning in Scala: Policies introduced the concept of reinforcement learning, temporal difference and more specifically the Q-learning algorithm:
Reinforcement Learning I: States & Policy
The following components were implemented:
- States QLState and their associated actions QLAction
- States space QLSpace
- Policy QLPolicy as a container of tuple {reward, Q-value, probabilities} of the Q-learning model
The last two components to complete the implementation of Q-learning are the model and the training algorithm.
Modeling and training
The first step is to define a model for the reinforcement learning. A model is created during training and is composed of
- Best policy to transition from any initial state to a goal state
- Coverage ratio as defined as the percentage of training cycles that reach the (or one of the) goal.
class QLModel[T](val bestPolicy: QLPolicy[T], val coverage: Double)
The QLearning class takes 3 arguments
- A set of configuration parameters config
- The search/states space qlSpace
- The initial policy associated with the states (reward and probabilities) qlPolicy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 | class QLearning[T](
config: QLConfig,
qlSpace: QLSpace[T],
qlPolicy: QLPolicy[T])
//model in Q-learning algorithm
val model: Option[QLModel[T]] = train.toOption
// Generate a model through multi-epoch training
def train: Try[Option[QLModel[T]]] {}
private def train(r: Random): Boolean {}
// Predict a state as a destination of this current
// state, given a model
def predict : PartialFunction[QLState[T], QLState[T]] {}
// Select next state and action index
def nextState(st: (QLState[T], Int)): (QLState[T], Int) {}
}
|
The model of type Option[QLModel] (line 7) is created by the method train (line 10). Its value is None if training failed.
The training method train consists of executing config.numEpisodes cycle or episode of a sequence of state transition (line 5). The random generator r is used in the initialization of the search space.
1
2
3
4
5
6
7
8
9
10
11
12
13 | def train: Option[QLModel[T]] = {
val r = new Random(System.currentTimeMillis)
Try {
val completions = Range(0, config.numEpisodes).filter(train(r) )
val coverage = completions.toSize.toDouble/config.numEpisodes
if(coverage > config.minCoverage)
new QLModel[T](qlPolicy, coverage)
else
QLModel.empty[T]
}.toOption
}
|
The training process exits with the model if the minimum minCoverage (number of episodes for which the goal state is reached) is met (line 8).
The method train(r: scala.util.Random) uses a tail recursion to transition from the initial random state to one of the goal state. The tail recursion is implemented by the search method (line 4). The method implements the recursive temporal difference formula introduced in the previous post (Reinforcement Learning I: States & Policy) (lines 14-18).
The state for which the action generates the highest reward R given a policy qlPolicy (line 10) is computed for each new state transition. The Q-value of the current policy is then updated qlPolicy.setQ before repeating the process for the next state, through recursion (line 21).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 | def train(r: Random): Boolean = {
@scala.annotation.tailrec
def search(st: (QLState[T], Int)): (QLState[T], Int) = {
val states = qlSpace.nextStates(st._1)
if( states.isEmpty || st._2 >= config.episodeLength )
(st._1, -1)
else {
val state = states.maxBy(s => qlPolicy.R(st._1.id, s.id))
if( qlSpace.isGoal(state) )
(state, st._2)
else {
val r = qlPolicy.R(st._1.id, state.id)
val q = qlPolicy.Q(st._1.id, state.id)
// Q-Learning formula
val deltaQ = r + config.gamma*qlSpace.maxQ(state, qlPolicy) -q
val nq = q + config.alpha*deltaQ
qlPolicy.setQ(st._1.id, state.id, nq)
search((state, st._2+1))
}
}
}
r.setSeed(System.currentTimeMillis*Random.nextInt)
val finalState = search((qlSpace.init(r), 0))
if( finalState._2 != -1)
qlSpace.isGoal(finalState._1)
else
false
}
|
Note: There is no guaranty that one of the goal state is reached from any initial state chosen randomly. It is expected that some of the training epoch fails. This is the reason why monitoring coverage is critical. Obviously, you may choose a deterministic approach to the initialization of each training epoch by picking up any state beside the goal state(s), as a starting state.
Prediction
Once trained, the model is used to predict the next state with the highest value (or probability) given an existing state. The prediction is implemented as a partial function.
1
2
3
4 | def predict : PartialFunction[QLState[T], QLState[T]] = {
case state: QLState[T] if(model != None) =>
if( state.isGoal) state else nextState(state, 0)._1
}
|
The method nextState does the heavy lifting. It retrieves the list of states associated with the current state st through its actions set (line 2). The next most rewarding state qState is computed using the reward matrix R of the best policy of the QLearning model (lines 6 - 8).
1
2
3
4
5
6
7
8
9
10
11
12
13 | def nextState(st: (QLState[T], Int)): (QLState[T], Int) = {
val states = qlSpace.nextStates(st._1)
if( states.isEmpty || st._2 >= config.episodeLength)
st
else {
val qState = states.maxBy(
s => model.map(_.bestPolicy.R(st._1.id, s.id))
.getOrElse(-1.0)
)
nextState( (qState, st._2+1))
}
}
|
References
- Online Value Function Determination for Reinforcement Learning J Laird, N Debersky, N Tinkerhess
- Scala for Machine Learning - Chapter 11 Reinforcement Learning / Q-Learning algorithm - P. Nicolas - Packt Publishing 2014
- github.com/patnicolas
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.