Sunday, March 28, 2021

MLOps for Data Scientists

Target audience: Beginner
Estimated reading time: 4'

A few of my colleagues in data science are hesitant about embracing MLOps. Why should it matter to them?   Actually a lot!

This article presents a comprehensive overview of MLOps, especially from a data scientist's perspective. Essentially, MLOps aims to address common issues of reliability and clarity that frequently arise during the development and deployment of machine learning models.


Table of contents
       Data-centric AI
Follow me on LinkedIn

AI productization

MLOps encompasses a suite of tools that facilitate the lifecycle of data-centric AI. This includes training models, performing error analysis to pinpoint data types where the algorithm underperforms, expanding the dataset through data augmentation, resolving discrepancies in data label definitions, and leveraging production data for ongoing model enhancement.

MLOps aims to streamline and automate the training and validation of machine learning models, enhancing their quality and ensuring they meet business and regulatory standards. It merges the roles of data engineering, data science, and dev-ops into a cohesive and predictable process across the following domains:
  • Deployment and automation
  • Reproducibility of models and predictions
  • Diagnostics
  • Governance and regulatory compliance (Socs-2, HIPAA)
  • Scalability and latency
  • Collaboration
  • Business use cases & metrics
  • Monitoring and management
  • Technical support

Predictable ML lifecycle

MLOps outlines the management of the entire machine learning lifecycle. This includes integrating model generation with software development processes (like Jira, Github), ensuring continuous testing and delivery, orchestrating and deploying models, as well as monitoring their health, diagnostics, performance governance, and aligning with business metrics. From a data science standpoint, MLOps involves a consistent and cyclical process of gathering and preprocessing data, training and assessing models, and deploying them in a production environment.

Data-centric AI

Andrew Ng pioneered the idea of data-centric AI, advocating for AI professionals to prioritize the quality of their training data rather than concentrating mainly on model or algorithm development. Unlike the conventional model-centric AI approach, where data is gathered with minimal focus on its quality to train and validate a model, data-centric AI emphasizes improving data quality. This approach enhances the likelihood of success for AI projects and machine learning models in practical applications.

MLOps, on the other hand, involves a continuous and iterative process encompassing data collection and pre-processing, model training and evaluation, and deployment in a production environment.
Fig 1. Overview of continuous development in data-centric AI - courtesy Andrew Ng


There are several difference between the traditional model-centric AI and data centric AI approaches.

Model Centric Data Centric
Goal is to collect all the data you can and develop a model good enough to deal with noise to avoid overfitting. Goal is to select a subset of the training data with the highest consistency and reliability so multiple models performs well.
Hold the data fixed and iteratively improve the model and code. Hold the model and code fixes and iteratively improve the data.


Repeatable processes
The objective is to implement established and reliable software development management techniques (such as Scrum, Kanban, etc.) and DevOps best practices in the training and validation of machine learning models. By operationalizing the training, tuning, and validation processes, the automation of data pipelines becomes more manageable and predictable.

The diagram below showcases how data acquisition, analysis, training, and validation tasks transition into operational data pipelines:
Fig 2. Productization in Model-centric AI

As shown in Figure 2, the deployment procedure in a model-centric AI framework offers limited scope for integrating model training and validation with fresh data. 

Fig 3. Productization in Data-centric AI

Conversely, in a data-centric AI approach, Figure 3, the model is put into action early in the development cycle. This early deployment facilitates ongoing integration and updates to the model(s), utilizing feedback and newly acquired data.

AI lifecycle management tools

While the development tools traditionally used by software engineers are largely applicable to MLOps, there has been an introduction of specialized tools for the ML lifecycle in recent years. Several open-source tools have emerged in the past three years to facilitate the adoption and implementation of MLOps across engineering teams.
  • DVC (Data Version Control) is tailored for version control in ML projects.
  • Polyaxon offers lifecycle automation for data scientists within a collaborative workspace.
  • MLFlow oversees the complete ML lifecycle, from experimentation to deployment, and features a model registry for managing different model versions.
  • Kubeflow streamlines workflow automation and deployment in Kubernetes containers.
  • Metaflow focuses on automating the pipeline and deployment processes.
Additionally, AutoML frameworks are increasingly popular for swift ML development, offering a user experience akin to GUI development.

Canary, frictionless release

A strong testing and deployment strategy is essential for the success of any AI initiative. Implementing a canary release smoothens the transition of a model from a development or staging environment to production. This method involves directing a percentage of user requests to a new version or a sandbox environment based on criteria set by the product manager (such as modality, customer type, metrics, etc.). 

This strategy minimizes the risk of deployment failures since it eliminates the need for rollbacks. If issues arise, it's simply a matter of ceasing traffic to the new version.



Thank you for reading this article. For more information ...

References



---------------------------

Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3