Friday, June 10, 2022

Explore Data Centric A.I.

Target audience: Intermediate
Estimated reading time: 4'  

Data-Centric AI (DCAI) is a burgeoning domain focused on methods and models that prioritize the selection, monitoring, and enhancement of datasets for the training, validation, and testing of established machine learning models [ref 1]. Historically, the data science community has placed undue emphasis on refining models, often overlooking the paramount importance of data quality during both training and inference phases. As the adage goes, "garbage in, garbage out" [ref 2].



Table of contents

What you will learn: How to select, analyze, re-balance training data sets and create or correct labels for building robust ML models.


The management of quality of input data and labels has been regarded as an afterthought driven more by intuition than a strict engineering process. However, researchers have recently introduced techniques that turn improvement of data into an engineering discipline [ref 3].
This post lists some of the important challenges in understanding and managing quality of feedback/annotations and input data once a trained and validated model is deployed in production.

Challenges

In this post we assume that the nature or distribution of data used in training or evaluating a machine learning model may change overtime. It is not uncommon for the quality of a trained and validated model to degrade during inference because the distribution of the production data shift overtime.

Let's consider a simple data pipeline for a multi-classification model.
Illustration of basic labeled data flow


Data scientists and engineers go through the process of training, fine-tuning, and validating a model using data that has undergone selection, sampling, cleaning, preprocessing, and annotation, accomplished through various means including crowd sourcing and the expertise of domain specialists. 

When the model meets the established quality metrics and fits within resource limitations, it is then deployed into a production environment, where it encounters numerous challenges:
  • How the model fares against real-world, production data and feedback?
  • Has the distribution of input data in production shifted over a period?
  • How feedback if available differ from the original annotations?
  • Has the data preparation and cleansing accounted for every category of outliers?
  • Were the assumptions regarding end user skills and basic knowledge correct?
  • Was the training data set annotated with a formal consensus and absent of bias?
  • Are the human priors, if any injected into a model, still valid in production?
  • Does the current data pipeline comply with regularization and privacy requirements?
Let's review some of the key issues that may arise once users exercise the model.

Data distribution shift

Detecting data inconsistency

The different types of data shift can be derived from the Naive Bayes formula [ref 4].
\[p(y|x) = \frac{p(x|y).p(y)}{p(x))}\]
Let's consider production or test data x and label y and a trained discriminative model that predict class yp(y|x). 
  • Covariate shift occurs when input, p(x) changes between training and test/production but the mapping input - label, p(y|x) does not.
  • Concept shift occurs when p(y|x) changes between training and production, but input p(x) distribution does not.
  • Prior probability shift (or label shift) occurs when p(y) changes between training and production, but p(x|y) does not [ref 4].
If the data distribution in the test or production environments differs from that in the training and validation sets, follow these steps:
  1. Regularly create test sets by sampling from production data.
  2. Incorporate these new test sets into the original training dataset, and then partition it into a fresh training-test set.
  3. Calculate the error metrics (such as Accuracy, Precision, F1 score, etc.) on the training sets, the combined train-test sets, and the standalone test sets.

Train error

1%

1%

10%

10%

Train-test error

9%

2%

11%

11%

Test error

10%

10%

12%

20%

Diagnostic

Overfitting High variance

Distribution shift

Under-fitting High bias

High bias

Distribution shift

     

In this scenario, the notable difference in error rates between the train-test and the test data clearly indicates a distribution shift or mismatch between the data used for training the model and the data encountered in production. Active learning is a widely used method to tackle this issue.

Active learning for data distribution shift

Inadequate model predictions could stem from the use of training labels that aren't applicable to the current data distribution. For example, a self-driving vehicle model trained predominantly in urban settings would require newly labeled data from rural and less dense environments.

The expense of acquiring new labels and retraining the existing model can be substantial. Active Learning, also known as optimal experimental design in the field of statistics, is a semi-supervised approach that can lessen the need for labeled data in training a model for a semi-supervised learning problem [ref 5].

There are 3 architectures to generate/update labels:
  • Membership query synthesis: This scenario applies to small data sets. It selects samples to be annotated from the shifted distribution.
  • Stream-based selective sampling: Real-time data is analyzed as candidate for annotation.
  • Pool-based sampling: The common technique selects a data point/instance form a pool of unlabeled data, estimates the confidence factor then annotates the data point with the highest information gain (or least confidence).
The objective of any sampling method is to identify unlabeled data that are:
  • Near a decision boundary (Uncertainty or exploitation sampling)
  • Unknown or rarely seen by the current model (diversity or exploration sampling)
There are several algorithms for selecting candidate data points for annotation from a pool of unlabeled data. Among them:
  • Variance reduction: Prioritize data points with the highest variance to converge toward a mean.
  • Contextual multi-armed bandit: Pick data points which maximizes reward expectation (UCB-1).
  • Random sampling: Select data points independently from their entropy.
  • Entropy maximization: Prioritize data points with the highest uncertainty.
  • Margin selection: Prioritize data points with the smallest margin between the two classes/labels with the highest probability (support vector machine).

Relabeling

Techniques to address incorrect labeling or annotating of data abound. Preventing,  finding and correcting errors in labeling a large dataset is not a trivial task even for experts.
Here are some of the most interesting approaches to either produce valid labels or correct wrong labels.

Consensus labeling

Consensus labeling is the simplest technique to validate labels from a group of annotators, domain experts. In case of multiple annotators, we need to estimate the following:

  • Consensus label for each example: Collect the labels with consensus among annotators, whenever possible.
  • Quality score for each consensus label: Estimate the % annotators who select a given label for an input data set.
  • Quality score for each annotator: % of labels selected by a given annotator which agree with consensus.

Dawid-skene aggregator

The Dawid-Skene aggregation model is a statistical model that uses confusion matrices to characterize the skill level of annotators. It employs an expectation-maximization (EM) algorithm to determine the probability of errors made by annotators, based on their annotations and the probabilities of the true (correct) labels [ref 6, 7]. Utilizing this method necessitates a solid understanding of statistical inference.

Confident learning

Confident learning is a data-centric approach that focuses on label quality by identifying label errors in datasets. This technique enables data scientists to
  • Characterize label noise.
  • Find label errors.
  • Learn with noisy labels.
The objective is to estimate the join distribution between the noisy, observed labels and the true latent labels [ref 8]. The method estimates the ratio of wrong positive and negative labels. The noise in actual labels is estimated by learning from uncorrupted labels in ideal conditions.

Encoding human priors

Incorporating human priors either domain knowledge or data scientist assumptions into a model goes well beyond the Bayes prior (class probability) p(C). It may encompass selection of neural architecture (convolution, recurrence, ...), domain related facts, and first order logic.
The most commonly applied method to inject domain knowledge into a model is knowledge distillation (teacher-student networks in deep learning).


Class rebalancing

Training classification models often encounters a noted challenge: class imbalance. This happens when there's a disproportionate distribution of classes in the dataset, potentially causing bias in the model's outcome. This phenomenon is particularly evident in binary classification. For instance, a diagnostic model analyzing doctor visit summaries could lean heavily towards detecting prevalent diseases. 
Interestingly, many data professionals might overlook that class imbalances can also emerge in a live production setting. 
Let's explore methods to tackle this imbalance issue.

Data augmentation

This technique involves creating artificial data based on your existing data set. This could be as simple as creating copies with small alterations. For text data, common methods include synonym replacement, random insertion, random deletion, and sentence shuffling.

Feature vs data space augmentation:
  • Feature space augmentation modifies data in the embedding or representation space. 
  • Data space augmentation modifies raw text data.
Example of augmentation using BERT:
The objective is to duplicate a record (sequence of tokens) and randomly replace a token by ‘[UNK]’. The original and duplicate (or augmented) records associated with the same label are fine-tuned (or optionally pre-trained) using the same loss function and MLM model.
Here are some examples of easy steps for augmentation
  • Replace abbreviations (i.e., don’t with do not and vice versa)
  • Replace with synonyms.
  • Remove out of vocabulary terms from original note.

Under-sampling and Over-sampling

Under-sampling involves randomly discarding examples from the majority class, while oversampling involves duplicating examples from the minority class. However, under-sampling may lead to information loss, while oversampling can lead to overfitting. A more sophisticated oversampling method is the Synthetic Minority Over-sampling Technique (SMOTE), but it's mostly applied on numerical data and might not be directly applicable to text data [ref 9].

Class weighting

Class weighting in supervised learning is a technique used to address imbalances in the distribution of different classes within a training dataset. This imbalance can lead to biased models that perform well on the majority class but poorly on the minority class
Here's a deeper dive into the concept:
  • Imbalance Issue: In many real-world datasets, some classes are underrepresented compared to others (i.e. target advertising, medical diagnostics)
  • Impact on Model: Machine learning models trained on such imbalanced datasets tend to be biased towards the majority class, often at the cost of misclassifying the minority class.
The primary goal of class weighting is to make the model pay more attention to the minority class during the training process by assigning a higher weight to the minority class and a lower weight to the majority class. These weights are used during the calculation of the loss function in the training process.
One noticeable drawback is the increase in complexity in model tuning.

Ensemble methods

This approach involves training multiple models and having them vote on the output. This can often lead to improved results, as different models might make different errors, and the ensemble can often make better decisions.

Ensemble learning is a powerful technique in machine learning that involves combining multiple models to improve the overall performance of a predictive task especially in the case of imbalanced class. This approach is based on the principle that a group of weak learners, when properly combined, can outperform a single strong learner. 
These methods used multiple, diverse weak learners to increase the accuracy of prediction.
The most common techniques in ensemble learning are:
  • Bagging: This involves training multiple models on different subsets of the training data, sampled with replacement (bootstrap samples).
  • Boosting: This method trains learners sequentially with each new model focuses on the errors made by the previous ones in an attempt to correct them. The final prediction is a weighted sum of the predictions made by each model.
  • Stacking: This involves training multiple models on the same data and then training a new model to aggregate their predictions.

Cost-sensitive learning 

In cost-sensitive learning, misclassification costs are incorporated into the decision process. The costs are usually set inversely proportional to the class frequencies. It helps to focus more on minority classes.
In many real-world scenarios, the cost of misclassifying one class of data may be much higher than misclassifying another. Cost-sensitive learning involves modifying the learning algorithm so that it minimizes a weighted sum of errors associated with each class [ref 10]. 

One-vs-all (OvA) strategy

The one-vs-all strategy is a technique used in machine learning for multi-class classification problems dealing with datasets where one class is significantly underrepresented compared to others. 
The concept consists of breaking down a multi-class classification problem into multiple binary classification problems. For each class in the dataset, a separate binary classifier is trained. This classifier distinguishes between the class under consideration (positive case) and all other classes (negative case).
By treating the minority class as the 'positive' case in one of the binary classification techniques described in previous sections such as class weighting or oversampling, can be more effectively applied [ref 11].

Transfer learning

Transfer learning is a technique where a model developed for one task is reused as the starting point for a model on a second task. Transfer learning typically involves using a model pre-trained on a large and diverse dataset [ref 12].

This is a sequential process used when:

  • The model has been originally trained on large data set and need to update with a different new smaller dataset.
  • The training tasks use the same medium (text, image, audio …).
  • Low level features from the original model can be reused for the new model.
These models have learned rich feature representations that can be beneficial for a wide range of tasks, including those with class imbalance. Since the pre-trained model has already learned a considerable amount of information, the need for a large and balanced dataset for the new task is reduced. This is particularly helpful when the available dataset for the new task is imbalanced.

Data privacy

Data privacy may not be on the forefront of data scientists concerns during the training and tuning of models. However, those issues become crucial for productization. For instance, HIPAA requires medical records be fully and properly de-identified: any violation incur significant financial penalties.

Data observability

Data observability is defined by 5 key attributes [ref 13]:
Freshness: How up to date is your data (latency, refresh rate…)
Quality: Is the distribution at inference differs from training, Missing data, duplicates
Volume: Change in traffic pattern, rate?
Schema: Change in data organization, new source of data
Lineage: Metadata on upstream and downstream source


The 3 pillars of observability data:
Metrics: refer to the numeric representation of data measured over time
Logs Records of an event with associated context
Traces: causality related events in a distributed environment.

     Data downtime = Number of incidents * (Time to detection + Time to resolution)


Automated data pipeline

Finally, the variety and breath of techniques used to address data quality issues is a critical impediment for deploying and orchestrating these solutions. Quite often, these techniques are evaluated and deployed in an ad-hoc fashion.

One effective solution is to create a configurable streaming pipeline of various data manipulation and correction techniques that execute concurrently.
Open-source frameworks such as Apache Spark and Kafka can be starting point to build a data centric AI platform for test and production data.

Illustration of data centric AI workflow

The design of the annotator interface is a critical element in the successful active learning or re-labeling strategy.

Thank you for reading this article. For more information ...

References




---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3