Monday, July 4, 2022

Manage Memory in Deep Java Library

Target audience: Advanced

Estimated reading time: 4'

Posts history

This post introduces some techniques to monitor memory usage and leaks in machine learning applications using the Deep Java Learning (DJL) library [ref 1]. This bag of tricks is far from being exhaustive.

Table of contents

The basics

JMX to the rescue

Memory leaks detection

References

DJL is an open source framework to support distributed inference in Java for deep learning models such as MXNet, Tensor flow or PyTorch.
The training of deep learning models may require a very significant amount of floating computations which are best supported by GPUs. However, the memory model in JVM is incompatible with column-based resident memory requires by the GPU.

Vectorization libraries such as Blast are implemented in C/C++ and support fast execution of linear algebra operations. The ubiquitous Python numerical library, numpy [ref 2] commonly used in data science is a wrapper around these low level math functions. The ND interface, used in DJL, provide Java developers with similar functionality.

Notes: The code snippets in this post are written in Scala but can be easily reworked in Java

The basics

Memory types

DJL supports monitoring 3 memory components

Resident Set Size (RSS) is the portion of the memory used by a process that is held in RAM memory and cannot be swapped.
Heap is the section of memory used by object dynamically allocated
Non-heap is the section encompassing static memory and stack allocation

Tensor representation

Deep learning frameworks operations on tensors. Those tensors are implemented as NDArray objects, created dynamically from array of values (integer, float,...). NDManager is memory collector/manager native to the underlying C++ implementation of the various deep learning frameworks. Its purpose is to create and delete (close) NDArray instances. NDManager has a hierarchical (single root tree) structure the child manager can be spawn from a parent [ref 3].

Let's consider the following, simple example of the computation of the mean of a sequence of floating point values.

import ai.djl.ndarray.NDManager

// Set up the memory manager
val ndManager = ndManager.newBaseManager()
    
val input = Array.fill(1000)(Random.nexFloat())
// Allocate resources outside JVM
val ndInput = ndManager.create(input)
val ndMean = ndInput.means()
val mean = ndMean.toFloatArray.head

// Release ND resources
ndManager.close()

The steps implemented in the code snippet are:

instantiates the root resource manager, ndManager
creates an array of 1000 random floating point values
convert into a ND array, ndInput
computes the mean, ndMean
convert back to Java data types
and finally close the root manager.

The root NDManager can be broken down it child managers to allow a finer granularity of allocation and release of resources. The following method, computeMean, instantiates a child manager, subNDManager, to compute the mean value. The child manager has to be explicitly closed (releasing associated resources) before the function returns.

The memory associated with the local ND variables, ndInput and ndMean are automatically released when going out of scope.

import ai.djl.ndarray.NDManager

def computeMean(input: Array[Float], ndManager: NDManager): Float = 
   if(input.nonEmpty) {
      val subNDManager = ndManager.newSubManager()
      val ndInput = ndManager.create(input)
      val ndMean = ndInput.means()
      val mean = ndMean.toFloatArray.head
     
      subNDManager.close()
      mean

////f// Local resources, ndInput and ndMean are released

     // when going out of scope
  }
  else
     0.0F

JMX to the rescue

The JVM provides developers with the ability to access operating system metrics such as CPU, or heap consumption through the Java Management Extension (JMX) interface [ref 4]

The DJL class, MemoryTrainingListener, leverages JMX monitoring capability, It provides developers with a simple method, collectMemoryInfo to collect metrics

First we need to instruct DJL to enable collection of memory stats as a Java property

System.setProperty("collect-memory", "true")

Similarly to the VisualVM heap memory snapshot, described in the next section, we can collect memory metrics (RSS, Heap and NonHeap) before and after each new NDArray object is created or released.

def computeMean(
   input: Array[Float], 
   ndManager: NDManager, 
   metricName: String): Float = {
      
    val manager = ndManager.newSubManager()
    // Initialize a new metrics
    val metrics = new Metrics()

    //  Initialize the collection of memory related metrics
    MemoryTrainingListener.collectMemoryInfo(metrics)
    val initVal = metrics.latestMetric(metricName).getValue.longValue
      
    val ndInput = ndManager.create(input)
    val ndMean = ndInput.mean()

    collectMetric(metrics, initVal, metricName)
    val mean = ndMean.toFloatArray.head

    // Close the output array and collect metrics
    ndMean.close()
    collectMetric(metrics, initVal, metricName)
     
    // Close the input array and collect metrics
    ndInput.close()
    collectMetric(metrics, originalValue, metricName)
      
    // Close the sub manager and collect metrics
    ndManager.close()
    collectMetric(metrics, initVal, metricName) 
    mean
}

First we instantiate a Metrics that is passed along all the various snapshots. Given the metrics and current NDManager, we create a base line in heap memory size, initVal. We then collect the value of the metric for each creation and release of NDArray instances (collectMetric) from our mean computation example.

Here is a simple snapshot method which compute the increase/decrease in heap memory from the base line.

def collectMetric(
  metrics: Metrics, 
  initVal: Long, 
  metricName: String): Unit = {

    MemoryTrainingListener.collectMemoryInfo(metrics)  
    val newVal = metrics.latestMetric(metricName).getValue.longValue
    println(s"$metricName: ${(newVal - initVal)/1024.0} KB")
}

Memory leaks detection

I have been a combination of several investigative techniques for estimating the source of a memory leak.

MemoryTrainingListener.debugDump

This method will dump basic memory and CPU stats into a local file for a given metrics

  MemoryTrainingListener.debugDump(metrics, outputFile)

Output

Heap.Bytes:72387328|#Host:10.5.67.192

Heap.Bytes:74484480|#Host:10.5.67.192

NonHeap.Bytes:39337256|#Host:10.5.67.192

NonHeap.Bytes:40466888|#Host:10.5.67.192

cpu.Percent:262.2|#Host:10.5.67.192

rss.Bytes:236146688|#Host:10.5.67.192

rss.Bytes:244297728|#Host:10.5.67.192

NDManager.cap

It is not uncommon to have a NDArray objects associated with a sub manager not been properly closed. One simple solution is to prevent allocating new objects into the parent manager.

// Protect the parent/root manager from

// accidental allocation of NDArray objects

 ndManager.cap()

 // Set up the memory manager

 val ndManager = ndManager.newBaseManager()

 val ndInput = ndManager.create(input)

Profilers

For reference, DJL introduces a set of experimental profilers to support investigation of memory consumption bottlenecks [ref 5]

VisualVM
We select VisualVM [ref 6] among the various JVM profiling solutions to highlight some key statistics in investigating a memory leak. VisualVM is a utility that is to be downloaded for Oracle site. It is not bundled with JDK.

A simple way to identify excessive memory consumption is taking regular snapshots or dump of the objects allocated from the heap, as illustrated below.

VisualVM has an intuitive UI to drill down into the sequence or composite objects. Besides quantifying memory consumption during inference, the following details view illustrates the hierarchical nature of the ND manager.

References

[1] Deep Java Library web site

[2] numpy.org

[3] Deep Java Library Memory Management
[4] Introduction to JMX

[5] DJL profilers

[6] VisualVM

Environments; JDK 11, Scala 2.12.15, Deep Java Library 0.20.0

---------------------------

Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Friday, June 10, 2022

Explore Data Centric A.I.

Target audience: Intermediate

Estimated reading time: 4'

Posts history

Data-Centric AI (DCAI) is a burgeoning domain focused on methods and models that prioritize the selection, monitoring, and enhancement of datasets for the training, validation, and testing of established machine learning models [ref 1]. Historically, the data science community has placed undue emphasis on refining models, often overlooking the paramount importance of data quality during both training and inference phases. As the adage goes, "garbage in, garbage out" [ref 2].

Table of contents

Challenges

Data distribution shift

Detecting data inconsistency

Active learning for data distribution shift

Relabeling

Consensus labeling

Dawid-Skene aggregator

Confident learning

Encoding human priors

Class rebalancing

Data augmentation

Under-sampling and over-sampling

Class weighting

Ensemble methods

Cost-sensitive learning

One-vs-all strategy

Transfer learning

What you will learn: How to select, analyze, re-balance training data sets and create or correct labels for building robust ML models.

The management of quality of input data and labels has been regarded as an afterthought driven more by intuition than a strict engineering process. However, researchers have recently introduced techniques that turn improvement of data into an engineering discipline [ref 3].

This post lists some of the important challenges in understanding and managing quality of feedback/annotations and input data once a trained and validated model is deployed in production.

Challenges

In this post we assume that the nature or distribution of data used in training or evaluating a machine learning model may change overtime. It is not uncommon for the quality of a trained and validated model to degrade during inference because the distribution of the production data shift overtime.

Let's consider a simple data pipeline for a multi-classification model.

Illustration of basic labeled data flow

Data scientists and engineers go through the process of training, fine-tuning, and validating a model using data that has undergone selection, sampling, cleaning, preprocessing, and annotation, accomplished through various means including crowd sourcing and the expertise of domain specialists.

When the model meets the established quality metrics and fits within resource limitations, it is then deployed into a production environment, where it encounters numerous challenges:

How the model fares against real-world, production data and feedback?
Has the distribution of input data in production shifted over a period?
How feedback if available differ from the original annotations?
Has the data preparation and cleansing accounted for every category of outliers?
Were the assumptions regarding end user skills and basic knowledge correct?
Was the training data set annotated with a formal consensus and absent of bias?
Are the human priors, if any injected into a model, still valid in production?
Does the current data pipeline comply with regularization and privacy requirements?

Let's review some of the key issues that may arise once users exercise the model.

Data distribution shift

Detecting data inconsistency

The different types of data shift can be derived from the Naive Bayes formula [ref 4].

\[p(y|x) = \frac{p(x|y).p(y)}{p(x))}\]

Let's consider production or test data x and label y and a trained discriminative model that predict class y, p(y|x).

Covariate shift occurs when input, p(x) changes between training and test/production but the mapping input - label, p(y|x) does not.
Concept shift occurs when p(y|x) changes between training and production, but input p(x) distribution does not.
Prior probability shift (or label shift) occurs when p(y) changes between training and production, but p(x|y) does not [ref 4].

If the data distribution in the test or production environments differs from that in the training and validation sets, follow these steps:

Regularly create test sets by sampling from production data.
Incorporate these new test sets into the original training dataset, and then partition it into a fresh training-test set.
Calculate the error metrics (such as Accuracy, Precision, F1 score, etc.) on the training sets, the combined train-test sets, and the standalone test sets.

Train error	1%	1%	10%	10%
Train-test error	9%	2%	11%	11%
Test error	10%	10%	12%	20%
Diagnostic	Overfitting High variance	Distribution shift	Under-fitting High bias	High bias Distribution shift

In this scenario, the notable difference in error rates between the train-test and the test data clearly indicates a distribution shift or mismatch between the data used for training the model and the data encountered in production. Active learning is a widely used method to tackle this issue.

Active learning for data distribution shift

Inadequate model predictions could stem from the use of training labels that aren't applicable to the current data distribution. For example, a self-driving vehicle model trained predominantly in urban settings would require newly labeled data from rural and less dense environments.

The expense of acquiring new labels and retraining the existing model can be substantial. Active Learning, also known as optimal experimental design in the field of statistics, is a semi-supervised approach that can lessen the need for labeled data in training a model for a semi-supervised learning problem [ref 5].

There are 3 architectures to generate/update labels:

Membership query synthesis: This scenario applies to small data sets. It selects samples to be annotated from the shifted distribution.
Stream-based selective sampling: Real-time data is analyzed as candidate for annotation.
Pool-based sampling: The common technique selects a data point/instance form a pool of unlabeled data, estimates the confidence factor then annotates the data point with the highest information gain (or least confidence).

The objective of any sampling method is to identify unlabeled data that are:

Near a decision boundary (Uncertainty or exploitation sampling)
Unknown or rarely seen by the current model (diversity or exploration sampling)

There are several algorithms for selecting candidate data points for annotation from a pool of unlabeled data. Among them:

Variance reduction: Prioritize data points with the highest variance to converge toward a mean.
Contextual multi-armed bandit: Pick data points which maximizes reward expectation (UCB-1).
Random sampling: Select data points independently from their entropy.
Entropy maximization: Prioritize data points with the highest uncertainty.
Margin selection: Prioritize data points with the smallest margin between the two classes/labels with the highest probability (support vector machine).

Relabeling

Techniques to address incorrect labeling or annotating of data abound. Preventing, finding and correcting errors in labeling a large dataset is not a trivial task even for experts.
Here are some of the most interesting approaches to either produce valid labels or correct wrong labels.

Consensus labeling

Consensus labeling is the simplest technique to validate labels from a group of annotators, domain experts. In case of multiple annotators, we need to estimate the following:

Consensus label for each example: Collect the labels with consensus among annotators, whenever possible.
Quality score for each consensus label: Estimate the % annotators who select a given label for an input data set.
Quality score for each annotator: % of labels selected by a given annotator which agree with consensus.

Dawid-skene aggregator

The Dawid-Skene aggregation model is a statistical model that uses confusion matrices to characterize the skill level of annotators. It employs an expectation-maximization (EM) algorithm to determine the probability of errors made by annotators, based on their annotations and the probabilities of the true (correct) labels [ref 6, 7]. Utilizing this method necessitates a solid understanding of statistical inference.

Confident learning

Confident learning is a data-centric approach that focuses on label quality by identifying label errors in datasets. This technique enables data scientists to

Characterize label noise.
Find label errors.
Learn with noisy labels.

The objective is to estimate the join distribution between the noisy, observed labels and the true latent labels [ref 8]. The method estimates the ratio of wrong positive and negative labels. The noise in actual labels is estimated by learning from uncorrupted labels in ideal conditions.

Encoding human priors

Incorporating human priors either domain knowledge or data scientist assumptions into a model goes well beyond the Bayes prior (class probability) p(C). It may encompass selection of neural architecture (convolution, recurrence, ...), domain related facts, and first order logic.

The most commonly applied method to inject domain knowledge into a model is knowledge distillation (teacher-student networks in deep learning).

Class rebalancing

Training classification models often encounters a noted challenge: class imbalance. This happens when there's a disproportionate distribution of classes in the dataset, potentially causing bias in the model's outcome. This phenomenon is particularly evident in binary classification. For instance, a diagnostic model analyzing doctor visit summaries could lean heavily towards detecting prevalent diseases.

Interestingly, many data professionals might overlook that class imbalances can also emerge in a live production setting.

Let's explore methods to tackle this imbalance issue.

Data augmentation

This technique involves creating artificial data based on your existing data set. This could be as simple as creating copies with small alterations. For text data, common methods include synonym replacement, random insertion, random deletion, and sentence shuffling.

Feature vs data space augmentation:

Feature space augmentation modifies data in the embedding or representation space.
Data space augmentation modifies raw text data.

Example of augmentation using BERT:
The objective is to duplicate a record (sequence of tokens) and randomly replace a token by ‘[UNK]’. The original and duplicate (or augmented) records associated with the same label are fine-tuned (or optionally pre-trained) using the same loss function and MLM model.
Here are some examples of easy steps for augmentation

Replace abbreviations (i.e., don’t with do not and vice versa)
Replace with synonyms.
Remove out of vocabulary terms from original note.

Under-sampling and Over-sampling

Under-sampling involves randomly discarding examples from the majority class, while oversampling involves duplicating examples from the minority class. However, under-sampling may lead to information loss, while oversampling can lead to overfitting. A more sophisticated oversampling method is the Synthetic Minority Over-sampling Technique (SMOTE), but it's mostly applied on numerical data and might not be directly applicable to text data [ref 9].

Class weighting

Class weighting in supervised learning is a technique used to address imbalances in the distribution of different classes within a training dataset. This imbalance can lead to biased models that perform well on the majority class but poorly on the minority class.

Here's a deeper dive into the concept:

Imbalance Issue: In many real-world datasets, some classes are underrepresented compared to others (i.e. target advertising, medical diagnostics)
Impact on Model: Machine learning models trained on such imbalanced datasets tend to be biased towards the majority class, often at the cost of misclassifying the minority class.

The primary goal of class weighting is to make the model pay more attention to the minority class during the training process by assigning a higher weight to the minority class and a lower weight to the majority class. These weights are used during the calculation of the loss function in the training process.

One noticeable drawback is the increase in complexity in model tuning.

Ensemble methods

This approach involves training multiple models and having them vote on the output. This can often lead to improved results, as different models might make different errors, and the ensemble can often make better decisions.

Ensemble learning is a powerful technique in machine learning that involves combining multiple models to improve the overall performance of a predictive task especially in the case of imbalanced class. This approach is based on the principle that a group of weak learners, when properly combined, can outperform a single strong learner.

These methods used multiple, diverse weak learners to increase the accuracy of prediction.

The most common techniques in ensemble learning are:

Bagging: This involves training multiple models on different subsets of the training data, sampled with replacement (bootstrap samples).
Boosting: This method trains learners sequentially with each new model focuses on the errors made by the previous ones in an attempt to correct them. The final prediction is a weighted sum of the predictions made by each model.
Stacking: This involves training multiple models on the same data and then training a new model to aggregate their predictions.

Cost-sensitive learning

In cost-sensitive learning, misclassification costs are incorporated into the decision process. The costs are usually set inversely proportional to the class frequencies. It helps to focus more on minority classes.

In many real-world scenarios, the cost of misclassifying one class of data may be much higher than misclassifying another. Cost-sensitive learning involves modifying the learning algorithm so that it minimizes a weighted sum of errors associated with each class [ref 10].

One-vs-all (OvA) strategy

The one-vs-all strategy is a technique used in machine learning for multi-class classification problems dealing with datasets where one class is significantly underrepresented compared to others.

The concept consists of breaking down a multi-class classification problem into multiple binary classification problems. For each class in the dataset, a separate binary classifier is trained. This classifier distinguishes between the class under consideration (positive case) and all other classes (negative case).

By treating the minority class as the 'positive' case in one of the binary classification techniques described in previous sections such as class weighting or oversampling, can be more effectively applied [ref 11].

Transfer learning

Transfer learning is a technique where a model developed for one task is reused as the starting point for a model on a second task. Transfer learning typically involves using a model pre-trained on a large and diverse dataset [ref 12].

This is a sequential process used when:

The model has been originally trained on large data set and need to update with a different new smaller dataset.
The training tasks use the same medium (text, image, audio …).
Low level features from the original model can be reused for the new model.

These models have learned rich feature representations that can be beneficial for a wide range of tasks, including those with class imbalance. Since the pre-trained model has already learned a considerable amount of information, the need for a large and balanced dataset for the new task is reduced. This is particularly helpful when the available dataset for the new task is imbalanced.

Data privacy

Data privacy may not be on the forefront of data scientists concerns during the training and tuning of models. However, those issues become crucial for productization. For instance, HIPAA requires medical records be fully and properly de-identified: any violation incur significant financial penalties.

Data observability

Data observability is defined by 5 key attributes [ref 13]:
Freshness: How up to date is your data (latency, refresh rate…)
Quality: Is the distribution at inference differs from training, Missing data, duplicates
Volume: Change in traffic pattern, rate?
Schema: Change in data organization, new source of data
Lineage: Metadata on upstream and downstream source

The 3 pillars of observability data:
Metrics: refer to the numeric representation of data measured over time
Logs Records of an event with associated context
Traces: causality related events in a distributed environment.

Data downtime = Number of incidents * (Time to detection + Time to resolution)

Automated data pipeline

Finally, the variety and breath of techniques used to address data quality issues is a critical impediment for deploying and orchestrating these solutions. Quite often, these techniques are evaluated and deployed in an ad-hoc fashion.

One effective solution is to create a configurable streaming pipeline of various data manipulation and correction techniques that execute concurrently.

Open-source frameworks such as Apache Spark and Kafka can be starting point to build a data centric AI platform for test and production data.