Estimated reading time: 4'
Just be lazy
A common requirement in machine learning is to load the configuration parameters associated with model at run-time. The model may have been trained with data segregated by customers, or categories. When deployed for prediction, it is critical to select/load the right set of parameters according to the characteristic of the requestFor instance, a topic extraction model may have been trained with scientific corpus, medical articles or computer science papers.
A simple approach is to pre-load all variants of a model when the underlying application is deployed in production. However, consuming uncessary memory and CPU cycle for a model that may be needed, at least right away, is a waste of resource. In this post, we assume that the model parameters are stored on Amazon S3.
Lazy instantiation of objects allows us to reduce unnecessary memory consumption by invoking a constructor once, only when needed. This capability becomes critical for data with large footprint such as Apache Spark data sets.
A simple, efficient repository
Let's consider all the credentials to access multiple devices consisting of an id, password and hint that have been previously uploaded on S3.case class Credentials(
device: String,
id: String,
password: String,
hint: String
)
A hash table is the simplest incarnation of a dynamic repository of models. Therefore we implement a lazy hash table by sub-classing the mutable HashMap.
The first time a model is requested, it is loaded into memory from S3 that returned to the client code. To this purpose we need to define the following argument for the constructor of the lazy hash table
- Dynamic loading mechanism from S3 - loader is responsible for loading the data from S3
- Key generator - toKey converts a string key to a the type of key of the Hash map
final class LazyHashMap[T, U](
loader: String => Option[U],
toKey: String => T) extends HashMap[T, U] {
// Override the HashMap.get method
override def get(item: String): Option[U] = synchronized {
val key = toKey(item)
if(super.contains(key)) // Is is already in memory?
super.get(key)
else
loader(item).map( // otherwise load the item from S3
l => {
super.put(key, l)
l
}
)
}
// Prevent for updating this immutable map
@throws(class = classOf[UnsupportedOperationException])
override def put(key: T, value: U): Option[U]
throw new UnsupportedOperationException("lazy map is immutable")
Here is an example of the two arguments for the constructor of the lazy hash table for a type MyValue. The key identifies the data set and the model which has been trained on.
val load: String => Option[Credentials] =
(dataSource: String) => loadData(dataSource)
val key = (s: String) => s
val lazyHashMap = new LazyHashMap[String, Dataset[Credentials]](load, key)
Data loader
Let's write a loader for a Spark data set of type T stored on AWS S3 in a given bucket, bucketName and folder, s3InputPathdef s3ToDataset[T](
s3InputPath: String
)(implicit encoder: Encoder[T]): Dataset[T] = {
import sparkSession.implicits._
// Needed for access keys and infer schema
val loadDS = Seq[T]().toDS
val accessConfig = loadDS.sparkSession
.sparkContext
.hadoopConfiguration
// Credentials to read from S3
accessConfig.set("fs.s3a.access.key", myAccessKey)
accessConfig.set("fs.s3a.secret.key", mySecretKey)
try {
// Enforce the schema
val inputSchema = loadDS.schema
sparkSession.read
.format("json")
.schema(inputSchema)
.load(path = s"s3a://$bucketName/${s3InputPath}")
.as[T]
}
catch {
case e: FileNotFoundException => log.error(e.getMessage)
case e: SparkException => log.error(e.getMessage)
case e: IOException => log.error(e.getMessage)
}
}
- Access the Hadoop configuration to specify the credentials for S3
- Enforce the schema when reading the data (in JSON) format from S3. Alternatively, the schema could have been inferred.
// Create a simple Spark session
implicit val sparkSession = SparkSession.builder
.appName("ExecutionContext").config(conf)
.getOrCreate()
def loadData(s3Path: String): Option[Dataset[Credentials]] = {
import sparkSession.implicits._ // need for encoding
s3ToDataset[Credentials]
}
References
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3