Sunday, March 28, 2021

MLOps for Data Scientists

Target audience: Beginner
Estimated reading time: 4'

A few of my colleagues in data science are hesitant about embracing MLOps. Why should it matter to them?   Actually a lot!

This article presents a comprehensive overview of MLOps, especially from a data scientist's perspective. Essentially, MLOps aims to address common issues of reliability and clarity that frequently arise during the development and deployment of machine learning models.


Table of contents
       Data-centric AI
Follow me on LinkedIn

AI productization

MLOps encompasses a suite of tools that facilitate the lifecycle of data-centric AI. This includes training models, performing error analysis to pinpoint data types where the algorithm underperforms, expanding the dataset through data augmentation, resolving discrepancies in data label definitions, and leveraging production data for ongoing model enhancement.

MLOps aims to streamline and automate the training and validation of machine learning models, enhancing their quality and ensuring they meet business and regulatory standards. It merges the roles of data engineering, data science, and dev-ops into a cohesive and predictable process across the following domains:
  • Deployment and automation
  • Reproducibility of models and predictions
  • Diagnostics
  • Governance and regulatory compliance (Socs-2, HIPAA)
  • Scalability and latency
  • Collaboration
  • Business use cases & metrics
  • Monitoring and management
  • Technical support

Predictable ML lifecycle

MLOps outlines the management of the entire machine learning lifecycle. This includes integrating model generation with software development processes (like Jira, Github), ensuring continuous testing and delivery, orchestrating and deploying models, as well as monitoring their health, diagnostics, performance governance, and aligning with business metrics. From a data science standpoint, MLOps involves a consistent and cyclical process of gathering and preprocessing data, training and assessing models, and deploying them in a production environment.

Data-centric AI

Andrew Ng pioneered the idea of data-centric AI, advocating for AI professionals to prioritize the quality of their training data rather than concentrating mainly on model or algorithm development. Unlike the conventional model-centric AI approach, where data is gathered with minimal focus on its quality to train and validate a model, data-centric AI emphasizes improving data quality. This approach enhances the likelihood of success for AI projects and machine learning models in practical applications.

MLOps, on the other hand, involves a continuous and iterative process encompassing data collection and pre-processing, model training and evaluation, and deployment in a production environment.
Fig 1. Overview of continuous development in data-centric AI - courtesy Andrew Ng


There are several difference between the traditional model-centric AI and data centric AI approaches.

Model Centric Data Centric
Goal is to collect all the data you can and develop a model good enough to deal with noise to avoid overfitting. Goal is to select a subset of the training data with the highest consistency and reliability so multiple models performs well.
Hold the data fixed and iteratively improve the model and code. Hold the model and code fixes and iteratively improve the data.


Repeatable processes
The objective is to implement established and reliable software development management techniques (such as Scrum, Kanban, etc.) and DevOps best practices in the training and validation of machine learning models. By operationalizing the training, tuning, and validation processes, the automation of data pipelines becomes more manageable and predictable.

The diagram below showcases how data acquisition, analysis, training, and validation tasks transition into operational data pipelines:
Fig 2. Productization in Model-centric AI

As shown in Figure 2, the deployment procedure in a model-centric AI framework offers limited scope for integrating model training and validation with fresh data. 

Fig 3. Productization in Data-centric AI

Conversely, in a data-centric AI approach, Figure 3, the model is put into action early in the development cycle. This early deployment facilitates ongoing integration and updates to the model(s), utilizing feedback and newly acquired data.

AI lifecycle management tools

While the development tools traditionally used by software engineers are largely applicable to MLOps, there has been an introduction of specialized tools for the ML lifecycle in recent years. Several open-source tools have emerged in the past three years to facilitate the adoption and implementation of MLOps across engineering teams.
  • DVC (Data Version Control) is tailored for version control in ML projects.
  • Polyaxon offers lifecycle automation for data scientists within a collaborative workspace.
  • MLFlow oversees the complete ML lifecycle, from experimentation to deployment, and features a model registry for managing different model versions.
  • Kubeflow streamlines workflow automation and deployment in Kubernetes containers.
  • Metaflow focuses on automating the pipeline and deployment processes.
Additionally, AutoML frameworks are increasingly popular for swift ML development, offering a user experience akin to GUI development.

Canary, frictionless release

A strong testing and deployment strategy is essential for the success of any AI initiative. Implementing a canary release smoothens the transition of a model from a development or staging environment to production. This method involves directing a percentage of user requests to a new version or a sandbox environment based on criteria set by the product manager (such as modality, customer type, metrics, etc.). 

This strategy minimizes the risk of deployment failures since it eliminates the need for rollbacks. If issues arise, it's simply a matter of ceasing traffic to the new version.



Thank you for reading this article. For more information ...

References



---------------------------

Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Tuesday, January 19, 2021

Lazy Instantiation of Dataset from Amazon S3

Target audience: Intermediate
Estimated reading time: 4'

Have you ever wished for the capability to instantaneously instantiate an object or dataset stored on Amazon S3 as the need arises? With Scala and Spark's functional toolkit, this is more straightforward than it appears.

The methodology for lazy instantiation of datasets from S3 was initially crafted using Scala 2.12 and Apache Spark 2.4. Notably, this code isn't tied to a particular version of the language or framework and remains compatible with Apache Spark 3.0.

Table of contents
Follow me on LinkedIn

Just be lazy

A common requirement in machine learning is to load the configuration parameters associated with model at run-time. The model may have been trained with data segregated by customers, or categories. When deployed for prediction, it is critical to select/load the right set of parameters according to the characteristic of the request
For instance, a topic extraction model may have been trained with scientific corpus, medical articles or computer science papers. 

A simple approach is to pre-load all variants of a model when the underlying application is deployed in production. However, consuming uncessary memory and CPU cycle for a model that may be needed, at least right away, is a waste of resource. In this post, we assume that the model parameters are stored on Amazon S3.
Lazy instantiation of objects allows us to reduce unnecessary memory consumption by invoking a constructor once, only when needed. This capability becomes critical for data with large footprint such as Apache Spark data sets.

A simple, efficient repository

Let's consider all the credentials to access multiple devices consisting of an id, password and hint that have been previously uploaded on S3. 

case class Credentials(
   device: String
   id: String
   password: String
   hint: String
)

A hash table is the simplest incarnation of a dynamic repository of models. Therefore we implement a lazy hash table by sub-classing the mutable HashMap.
The first time a model is requested, it is loaded into memory from S3 that returned to the client code. To this purpose we need to define the following argument for the constructor of the lazy hash table

  • Dynamic loading mechanism from S3 - loader is responsible for loading the data from S3
  • Key generator - toKey converts a string key to a the type of key of the Hash map

 
final class LazyHashMap[T, U](
     loader: String => Option[U], 
     toKey: String => T) extends HashMap[T, U] {

     // Override the HashMap.get method 
   override def get(item: String): Option[U] = synchronized {
       val key = toKey(item)

       if(super.contains(key)) // Is is already in memory?
           super.get(key) 
      else
           loader(item).map(  // otherwise load the item from S3
              l => {
                super.put(key, l)
                l
             }
          )
    }
 
      // Prevent for updating this immutable map
   @throws(class = classOf[UnsupportedOperationException])
   override def put(key: T, value: U): Option[U] 
         throw new UnsupportedOperationException("lazy map is immutable")
   

The keyword synchronized implements a critical section to protect the execution from dirty read. 
Here is an example of the two arguments for the constructor of the lazy hash table for a type MyValue. The key identifies the data set and the model which has been trained on.

val load: String => Option[Credentials] = 
     (dataSource: String) => loadData(dataSource)
val key = (s: String) => s

val lazyHashMap = new LazyHashMap[String, Dataset[Credentials]](load, key)


The last business to take care of is the implementation of the function, loadData to load and instantiate the dataset

Data loader

Let's write a loader for a Spark data set of type T stored on AWS S3 in a given bucket, bucketName and folder, s3InputPath

def s3ToDataset[T](
     s3InputPath: String
)(implicit encoder: Encoder[T]): Dataset[T] = {
   import sparkSession.implicits._

    // Needed for access keys and infer schema
   val loadDS = Seq[T]().toDS
   val accessConfig = loadDS.sparkSession 
         .sparkContext 
         .hadoopConfiguration

   // Credentials to read from S3
accessConfig.set("fs.s3a.access.key", myAccessKey) accessConfig.set("fs.s3a.secret.key", mySecretKey) try {
       // Enforce the schema
      val inputSchema = loadDS.schema
      sparkSession.read
	 .format("json")
	 .schema(inputSchema)
	 .load(path = s"s3a://$bucketName/${s3InputPath}")
	 .as[T]
   }
   catch {
      case e: FileNotFoundException => log.error(e.getMessage)
      case e: SparkException => log.error(e.getMessage)
      case e: IOException =>  log.error(e.getMessage)
   }
}

It is assumed that the Apache Spark session has already been created and an encoder (i.e. Kryo) has been already been defined. The encoder for the type T is implicitly defined, usually along with the Spark session.
The first step is to instantiate a 'dummy' empty dataset of type T. The instantiation, loadDS is used to
  • Access the Hadoop configuration to specify the credentials for S3
  • Enforce the schema when reading the data (in JSON) format from S3. Alternatively, the schema could have been inferred.
Note Data from S3 bucket is accessed through the s3a:// protocol. It add an object layer on top of the default S3 protocol which is block-centric. It is significantly faster.

Finally let's implement the load function, loadData

  // Create a simple Spark session
implicit val sparkSession =  SparkSession.builder
     .appName("ExecutionContext").config(conf)
     .getOrCreate()

def loadData(s3Path: String): Option[Dataset[Credentials]] = {
    import sparkSession.implicits._ // need for encoding
    s3ToDataset[Credentials]
}


Thank you for reading this article. For more information ...

References



---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3


Sunday, November 1, 2020

Evaluate Performance of Scala Tail Recursion

Target audience: Intermediate
Estimated reading time: 3'

Recursion refers to the technique where a function invokes itself, either directly or indirectly, and such a function is termed a recursive function. 
Some problems can be more effortlessly addressed using recursive algorithms. In this article, we will assess the performance of Scala's tail recursion in comparison to iterative approaches.


Table of contents
Follow me on LinkedIn
   

Overview

In Scala, the tail recursion is a commonly used technique to apply a transformation to the elements of a collection. The purpose of this post is to evaluate the performance degradation of the tail recursion comparatively to iterative based methods.
For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted.

Test benchmark

Let's consider a "recursive" data transformation on an array using a sliding window. For the sake of simplicity, we create a simple polynomial transform on a array of values
   {X0, ... ,Xn, ... Xp}
with a window w, defined as
   f(Xn) = (n-1)Xn-1 + (n-2)Xn-2 + ... + (n-w)Xn-w.  

Such algorithms are widely used in signal processing and technical analysis of financial markets (i.e. moving average, filters).

def polynomial(values: Array[Int]): Int = 
  (if(values.size < W_SIZE) 
     values 
  else 
     values.takeRight(W_SIZE)
  ).sum


The first implementation of the polynomial transform is a tail recursion on each element Xn of the array. The transform f compute f (values(cursor)) from the array values[0, ... , cursor-1] as describe in the code snippet below

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Evaluation(values: Array[Int]) {
  def recurse(f: Array[Int] => Int): Array[Int] = {

    @scala.annotation.tailrec
    def recurse(
      f: Array[Int] => Int, 
      cursor: Int, 
      results: Array[Int]): Boolean = {  
        
      if( cursor >= values.size) // exit condition
        true
      else {
        val arr = f(values.slice(cursor+1, cursor-W_SIZE))
        results.update(cursor, arr)
        recurse(f, cursor+1, results)
      }
    }

    val results = new Array[Int](values.size)
    recurse(f, 0, results)
    results
  }
}

The second implementation relies on the scanLeft method that return a cumulative of transformed value f(Xn).

def scan(f: Array[Int] => Int): Array[Int] = 
   values.zipWithIndex.scanLeft(0)((sum, vn) => 
         f(values.slice(vn._2+1, vn._2-W_SIZE))
  )

Finally, we implement the polynomial transform on the sliding array window with a map method.

def map(f: Array[Int] => Int): Array[Int] = 
   values.zipWithIndex.map(vn =>  f(values.slice(vn._2+1, vn._2-W_SIZE)))


Performance evaluation

For the test, each of those 3 methods is executed 1000 on a dual core i7 with 8 Gbyte RAM and MacOS X Mountain Lion 10.8. The first test consists of executing the 3 methods and varying the size of the array from 10 to 90. The test is repeated 5 times and the duration is measured in milliseconds.



The tail recursion is significantly faster than the two other methods. The scan methods (scan, scanLeft, scanRight) have significant overhead that cannot be "amortized" over a small array. It is worth noticing that the performance of map and scan are similar. The relative performance of those 3 methods is confirmed while testing with large size array (from 1,000,000 to 9,000,000 items).



Thank you for reading this article. For more information ...

References


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Friday, October 9, 2020

Law of Demeter in Java and Scala

Target audience: Beginner
Estimated reading time: 3'

In this article, we shed light on a set of principles frequently bypassed by software engineers: The Law of Demeter, also known as the principle of least knowledge. This design guideline, pivotal in crafting software, especially within the object-oriented paradigm, underscores the essence of minimalism. It posits that an object should make minimal assumptions about the structure or attributes of other entities. In essence, a module should be privy only to the information and resources imperative for its intended function.


Introduction

The Law of Demeter for methods requires that a method of an object may only invoke the methods of the following kinds of objects:
 1  The object itself: this
 2  Variables or objects which scope is the class (attributes or variable members)
 3  The method parameters (or arguments)
 4  Variables or objects local to the method
 5  A global variable or object accessible by the object

The advantage of  the Law of Demeter is that applications are easier to maintain and update because objects are less dependent on the member attributes of other objects. Such a advantage is important when using 3rd party libraries or frameworks. Design patterns such as Facade, Adapter or Proxy provide developers with similar benefits.

The main drawback of Law of Demeter is the constant need to create wrappers to isolate the internal structure of other objects adding execution time overhead. Such wrappers, commonly used in large frameworks, relies on interfaces that delegate the actual implementation of functionality to concrete classes.  Aspect programming, attempts to get around this overhead, among other things.

The law of Demeter was very popular in early 1990's when C++ gained acceptance in the software engineering community.

Use case

The following Java and Scala code snippets illustrates the programming idioms that complies and also violates the Law of Demeter. The following Java class that implements a string concatenation complies with the law regarding local, class attributes and methods.

public class StringConcatenation  {
  private String _name = null;
 
  public String rightUsage(final String s) {
      // Rule 1: Invoke its own method using 'this'
    if( this.isValid(s) ) {
 
      // Rule 2: Call its own attribute:  '_name'
      StringBuilder buf = new StringBuilder(_name); 
 
      // Rule 3: Call methods parameter: 's'
      buf.append(s );
 
      // Rule 4: Call local object : 'buf'
      buf.append("\n");
    }
     
    return buf.toString();
  }
}

The rightUsage method complies with the Law of Demeter because it is referring to objects, variables or method with either class or local scope. Let's consider the following Scala Trait, Dictionary and class, ScientificDictionary that are provided as part of a 3rd party library. The Translation class uses a specific dictionary (scientific, medical,....) in particular language (English,German..) to translate any document.

sealed trait Dictionary[Language] {  
    def translate[Language](s: String): String 
}
 
case class ScientificDictionary[Language]  extends Dictionary[Language] { }                
case class MedicalDictionary[Language]  extends Dictionary[Language] { }
case class SlangDictionary[Language]  extends Dictionary[Language] { }
 
class Translation[Language](var dictionary: Dictionary[Language])  {
   def translate[Language](s: String): String =  dictionary.translate(s}
}

The method wrongUsage below violates the Law of Demeter because there is no guarantee that the 3rd party library provider may not alter or remove a reference to Dictionary from the Translation object. There is also no guarantee that the translate method may be removed or deprecated in future releases of the library.

class StringConcatenation(_name: String) {
    def wrongUsage(translate: Translation[Spanish], s: String): String = 
        translate._dictionary.translate(s"${_name}$s")
}
 

Some code analysis tools can be configured to enforce one or more Demeter rules. At the minimum, these rules should be part of the tool box of software development technical lead responsible for code reviews.

Reference

Law of Demeter Wikipedia


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3