Showing posts with label PyTorch. Show all posts
Showing posts with label PyTorch. Show all posts

Thursday, May 2, 2024

Posts History

I delve into a diverse range of topics, spanning programming languages, machine learning, data engineering tools, and DevOps. Our articles are enriched with practical code examples, ensuring their applicability in real-world scenarios.

Follow me on LinkedIn

2024

Sunday, September 17, 2023

Compare Python, NumPy and PyTorch Performance

Target audience: Beginner
Estimated reading time: 4'

Recently, I embarked on a healthcare project that involved extracting diagnostic information from Electronic Health Records. While fine-tuning a BERT model, I noticed some atypical latency behaviors. This prompted me to conduct a performance comparison between Python lists, NumPy arrays, and PyTorch tensors.
The implementation relies on a timer decorator to collect latency values.


Table of contents
Follow me on LinkedIn

Notes
  • The implementation uses Python 3.11
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

I assume that most readers are familiar with the various Python, NumPy and PyTorch containers used in this article. But just in case, here is a quick refresh:

Python list
A list in Python is similar to an array in C, C++, Java or Scala except that its elements can have different types.

Python arrays: Array is a container which can hold a fix number of items or elements. Contrary to lists, items of an array should be of the same type. Most of the data structures make use of arrays to implement their algorithms [ref 1].

NumPy arraysA numPy array represents a multidimensional, homogeneous array of fixed-size items. It is implemented as a static buffer of contiguous values of identical types which index can be dynamically modified to generate matrix, tensor or higher dimension numerical structures [ref 2
].

PyTorch tensors: Similarly to numpy array, PyTorch tensors are multi-dimensional arrays containing elements of a single data type. The tensors share the same semantic and operators as NumPy arrays but also support automatic differentiation and support GPU/Cuda math libraries [ref 3].

Timing with decorator

Decorators are very powerful tools in Python since it allows programmers to modify the behavior of a function, method or even a class. Decorators wrap another function in order to extend the behavior of the wrapped function, without permanently modifying it [ref 4].

def timeit(func):
    ''' Decorator for timing execution of methods'''

    def wrapper(*args, **kwargs):
        start = time.time()
        func(*args, **kwargs)
        duration = '{:.3f}'.format(time.time() - start)
        logging.info(f'{args[1]}:{args[3]}\t{duration} secs.')
        return 0

    return wrapper


Benchmark implementation

The objective is to automate the comparison of the various framework and functions by creating a wrapper EvalFunction class.
The evaluation class has two arguments:
  • Descriptive name of the function, func_name used to evaluate the data structures
  • The signature of the function , func used to evaluate the data structures
import array as ar
import time
import numpy as np
from random import Random
from typing import List, AnyStr, Callable, Any, NoReturn
import math
import torch
from dataclasses import dataclass
import logging
from matplotlib import pyplot as plt

collector = {}
@dataclass class EvalFunction: """ Data class for evaluation of Python lists, Array, Numpy array and torch tensor :param func_name Description of the function to execute :param func Lambda to be executed """ func_name: AnyStr func: Callable[[Any], float]  
   def compare(self, input_list: List[float], fraction: float = 0.0) -> NoReturn:
     input_max: int = \
math.floor(len(input_list)*fraction) if 0.0 < fraction <= 1.0 \
else len(input_list)

input_data = input_list[:input_max]

       # Execute lambda through Python list
       self.__execute('python', input_data, 'list:      ')

       # Execute lambda through Python array
       input_array = ar.array('d', input_data)
       self.__execute('python', input_array, 'array:      ')

       # Execute lambda through numpy array
       np_input = np.array(input_list, dtype=np.float32)
       self.__execute('python', np_input, 'lambda: ')

       # Execute native numpy methods
       self.__execute('numpy', np_input, 'native:   ')

       # Execute PyTorch method on CPU
       tensor = torch.tensor(np_input, dtype=torch.float32, device='cpu')
       self.__execute('pytorch', tensor, '(CPU):    ')

       # Execute PyTorch method on GPU
       tensor = torch.tensor(np_input, dtype=torch.float32, device='cuda:0')
       self.__execute('pytorch', tensor, '(CUDA)')


The implementation of the supporting, private method, __execute is described in the Appendix

Evaluation

We've chosen a collection of mathematical transformations that vary in complexity and computational demand to evaluate different frameworks. These transformations involve calculating the mean values produced by the subsequent functions:
\[x_{i}=1+rand{[0, 1]}\]
\[average(x)=\frac{1}{n}\sum_{1}^{n}x_{i}\]
\[sine(x) = average\left ( \sum_{1}^{n}sin\left ( x_{i} \right ) \right )\]
\[sin.exp(x) = average\left ( \sum_{1}^{n}sin\left ( x_{i} \right ) e^{-x_{i}^{2}} \right )\]
\[sin.exp.log(x) = average\left ( \sum_{1}^{n}sin\left ( x_{i} \right ) e^{-x_{i}^{2}} + log(1 + x_{i}))\right )\]

# Functions to evaluate data structures
def average(x) -> float:
    return sum(x)/len(x)
def sine(x) -> float:
    return sum([math.sin(t) for t in x])/len(x)
def sin_exp(x) -> float:
    return sum([math.sin(t)*math.exp(-t) for t in x])/len(x)


# Random value generator
rand = Random(42) num_values = 500_000_000 my_list: List[float] = [1.0 + rand.uniform(0.0, 0.1)] * num_values

# Fraction of the original data set of 500 million data points
fractions = [0.2, 0.4, 0.6, 0.8, 1.0]

# Evaluate the latency for sub data sets of size , len(my_list)*fraction
for fraction in fractions:
eval_average = EvalFunction('sin_exp', average)
eval_average.compare(my_list, fraction)

# x-axis values as size=  len(my_list)*fraction
data_sizes = [math.floor(num_values*fraction) for fraction in fractions]

# Invoke the plotting method
plotter = Plotter(data_sizes, collector)
plotter.plot('Sin*exp 500M')

We conducted the test on an AWS EC2 instance of type p3.2xlarge, equipped with 8 virtual cores, 64GB of memory, and an Nvidia V100 GPU. A basic method for plotting the results is provided in the appendix.

Study 1
We compared the computation time required to determine the {x} -> average{x}  of 500 million real numbers within a Python list, array, NumPy array, and PyTorch tensor.


We compared the computation time required to apply the {x} -> sin{x}.exp{-x} function to 500 million real numbers within a Python list, array, NumPy array, and PyTorch tensor.


Conclusion
  • The performance difference between executing on the GPU versus the CPU becomes more pronounced as the dataset size grows.
  • Predictably, the runtime for both the 'average' and 'sin_exp' functions scales linearly with the size of the dataset when using Python lists or arrays.
  • When executed on the CPU, PyTorch tensors show a 20% performance improvement over NumPy arrays.

Study 2
Le't compare the relative performance of GPU and GPU during the processing of a large PyTorch tensor.


Conclusion
The size of dataset has a very limited impact on the performance of processing PyTorch tensor on GPU while the execution time increases linearly on CPU.

Thank you for reading this article. For more information ...

References


Appendix

The __execute method take two arguments used in the structural pattern match:
  • The framework used to identify the
  • The input data to be processed
The private method __numpy_func applies each the functions (average, sine,...) to a NumPy array, np_array generated from the original list.
The method, __pytorch_func applies each function to a torch tensor derived from np_array.


def __execute(self, framework: AnyStr, input: Any) -> float:
    match framework:
        case 'python':
           return self.func(input)
        case 'numpy':
           return self.__numpy_func(input)
        case 'pytorch':
           return self.__pytorch_func(input)
        case _
           return -1.0


def __numpy_func(self, np_array: np.array) -> float:
   match self.func_name:
      case 'average':
          return np.average(np_array).item()
      case 'sine':
          return np.average(np.sin(np_array)).item()
      case 'sin_exp':
          return np.average(np.sin(np_array)*np.exp(-np_array)).item()


def __pytorch_func(self, tensor: torch.tensor) -> float:
    match self.func_name:
       case 'average':
          return torch.mean(tensor).float()
       case 'sine':
          return torch.mean(torch.sin(tensor)).float()
       case 'sin_exp':
          return torch.mean(torch.sin(tensor) * torch.exp(-tensor)).float()


A simple class, Plotter, to wraps the creation and display of plots using matplotlib.

class Plotter(object):
    markers = ['r--', '^-', '+--', '--', '*-']

    def __init__(self, dataset_sizes: List[int], results_map):
        self.sizes = dataset_sizes
        self.results_map = results_map

    def plot(self, title: AnyStr) -> NoReturn:
        index = 0
        np_sizes = np.array(self.sizes)
        for key, values in self.results_map.items():
            np_values = np.array(values)
            plt.plot(np_sizes, np_values, Plotter.markers[index % len(Plotter.markers)])
            index += 1

        plt.title(title, fontsize=16, fontstyle='italic')
        plt.xlabel('Dataset size', fontsize=13, fontstyle='normal')
        plt.ylabel('time secs', fontsize=13, fontstyle='normal')
        plt.legend(['Python List', 'Python Array', 'Numpy native', 'PyTorch CPU', 'PyTorch GPU'])
        plt.show()


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Monday, November 21, 2022

Accelerate Deep Learning with Neural Blocks

Target audience: Advanced
Estimated reading time: 6'

As a machine learning engineer, I've found it challenging to encounter the same level of reusability and design patterns in this field that are common in conventional software development. Implementations of deep learning models often depend heavily on repetitive, boilerplate code.

In this post, we explore the idea of reusable neural blocks, a straightforward and practical approach for packaging and reusing components of neural networks. Specifically, we'll delve into creating neural blocks for a variational auto-encoder using PyTorch, as well as for a Bidirectional Embeddings Representation from Transformers (BERT) encoder, utilizing the Deep Java Library.


Table of contents
Follow me on LinkedIn
Notes
  • Source code is available on GitHub Github Neural Architecture
  • Environments: Python 3.10, PyTorch 2.1.1
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.


Reusable neural blocks

Complex deep learning models have large stack of neural transformation such as convolution, fully connected network layers, activations, regularization modules, loss functions or embedding layers.

Creating these models using basic components from existing deep learning library is a daunting task.  A neural block aggregates multiple components of a neural network into a logical, clearly defined function or task. A block is a transformation in the data flow used in training and inference.


Neural blocks in PyTorch

Modular convolutional neural network

A convolutional neural network can be broken down into neural blocks that organize PyTorch modules such as hidden layers, input and output channels, batch normalization, regularization, pooling mode and activation function into a single computation unit.
First, let's consider a conventional convolutional neural network with a fully connected (restricted Boltzmann machine) network. The PyTorch modules associated with any given layer are assembled as a neural block class.

A PyTorch modules of the convolutional neural block are:
  • Conv2dConvolutional layer with input, output channels, kernel, stride and padding
  • DropoutDrop-out regularization layer
  • BatchNorm2dBatch normalization module
  • MaxPool2d Pooling layer
  • ReLu, Sigmoid, ... Activation functions
Here is a schematic representation of a convolutional neural network as a stack of neural blocks.

The constructor for the neural block class initializes all its parameters and its modules in the proper order. For the sake of simplicity, regularization elements such as drop-out (bagging of sub-network) is omitted.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class ConvNeuralBlock(nn.Module):
  def __init__(self,
      in_channels: int,
      out_channels: int,
      kernel_size: int,
      stride: int,
      padding: int,
      batch_norm: bool,
      max_pooling_kernel: int,
      activation: nn.Module,
      bias: bool,
      is_spectral: bool = False):
    
   super(ConvNeuralBlock, self).__init__()
        
   # Assertions are omitted
   # 1- initialize the input and output channels
   self.in_channels = in_channels
   self.out_channels = out_channels
   self.is_spectral = is_spectral
   modules = []
   
   # 2- create a 2 dimension convolution layer
   conv_module = nn.Conv2d(   
       self.in_channels,
       self.out_channels,
       kernel_size=kernel_size,
       stride=stride,
       padding=padding,
       bias=bias)

   # 6- if this is a spectral norm block
   if self.is_spectral:        
     conv_module = nn.utils.spectral_norm(conv_module)
     modules.append(conv_module)
        
   # 3- Batch normalization
   if batch_norm:               
     modules.append(nn.BatchNorm2d(self.out_channels))
     
   # 4- Activation function
   if activation is not None: 
     modules.append(activation)
        
   # 5- Pooling module
   if max_pooling_kernel > 0:   
     modules.append(nn.MaxPool2d(max_pooling_kernel))
   
   self.modules = tuple(modules)

The code snippet describes the various stages of building a convolutional block. The first step (1) is to initialize the number of input and output channels, then create the 2-dimension convolution (2), a batch normalization module (3) an activation function (4) and finally a Max  pooling module (5). The spectral norm regularization (6) is optional.

Next, we package the various convolutional and feedback forward neural blocks 
into a full fledge convolutional model, in the following build method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class ConvModel(NeuralModel):
  def __init__(self,                    
       model_id: str,
       # 1 Number of input and output unites
       input_size: int,
       output_size: int,
       # 2- PyTorch convolutional modules
       conv_model: nn.Sequential,
       dff_model_input_size: int = -1,
       # 3- PyTorch fully connected
       dff_model: nn.Sequential = None):
        
   super(ConvModel, self).__init__(model_id)
   self.input_size = input_size
   self.output_size = output_size
   self.conv_model = conv_model
   self.dff_model_input_size = dff_model_input_size
   self.dff_model = dff_model
   
  @classmethod
  def build(cls,
      model_id: str,
      conv_neural_blocks: list,  
      dff_neural_blocks: list) -> NeuralModel:
            
   # 4- Initialize the input and output size 
   # for the convolutional layer
   input_size = conv_neural_blocks[0].in_channels
   output_size = conv_neural_blocks[len(conv_neural_blocks) - 1].out_channels

   # 5- Generate the model from the sequence 
   # of conv. neural blocks
   conv_modules = [conv_module for conv_block in conv_neural_blocks
         for conv_module in conv_block.modules]
   conv_model = nn.Sequential(*conv_modules)

   # 6- If a fully connected RBM is included in the model ..
   if dff_neural_blocks is not None and not is_vae:
     dff_modules = [dff_module for dff_block in dff_neural_blocks
        for dff_module in dff_block.modules]
         
     dff_model_input_size = dff_neural_blocks[0].output_size
     dff_model = nn.Sequential(*tuple(dff_modules))
   else:
     dff_model_input_size = -1
     dff_model = None
      
  return cls(
     model_id, 
     conv_dimension, 
     input_size, 
     output_size, 
     conv_model,
     dff_model_input_size, 
     dff_model)

The default constructor (1) initializes the number of input/output channels, the PyTorch modules for the convolutional layers (2) and the fully connected layers (3).
The class method, build, instantiates the convolutional model from several convolutional neural blocks and one feed forward neural block. It initializes the size of input and output layers from the first and last neural blocks (4), generate the PyTorch convolutional modules (5) and fully-connected layers' modules (6) from the neural blocks.


Modular variational auto-encoder

A de-convolutional neural network, DeConvModel is created from the convolutional model, ConvModel through reflection (see Automating the configuration of a GAN in PyTorch for more details)A mean, variance and sampling PyTorch modules are packaged into a variational neural block, VAENeuralBlock.


Finally, the variational auto-encoder, VAE is assembled by stacking the convolutional, variational and de-convolutional neural blocks.



Neural blocks in Deep Java Library

Deep Java Library (DJL) is an Apache open-source Java framework that supports the most commonly used deep learning frameworks; MXNet, PyTorch and TensorFlow. DJL ability to leverage any hardware configuration (CPU, GPU) and integrated with big data frameworks makes it and ideal solution for a highly performant distributed inference engine. DJL can be optionally used for training.


Everyone who has been involved with GPT-3 or GPT-4 decoder (ChatGPT) is aware of the complexity and interaction of neural components in transformers.


Let's apply DJL to build a BERT transformer encoder using neural blocks, knowing 

  • A BERT encoder is a stack of multiple transformer modules
  • Pre-training block which contains BERT block, Masked Language Model (MLM) module and Next Sentence Predictor (NSP) with their associated loss functions
  • A BERT block is composed of embedding block.

The following Scala code snippet illustrates the composition of a BERT pre-training block using the transformer encoder block, thisTransformerBlock, the Masked Language Model component, thisMlmBlock and the Next Sentence Prediction module, thisNspBlock.

class CustomPretrainingBlock protected (
    mlmActivation: String
) extends AbstractBaseBlock {

  lazy val activationFunc: java.util.function.Function[NDArray, NDArray] =    
       ActivationConfig.getNDActivationFunc(activationType)

  // Transformer encoder block  // 1- Initialize the shape of tensors for the encoder, MLM and NSP blocks  
  lazy val thisTransformerBlock: BertBlock = BertBlock.builder().base()
     .setTokenDictionarySize(Math.toIntExact(vocabularySize))
     .build
  
  // MLM block
  lazy val thisMlmBlock: BertMaskedLanguageModelBlock = 
       new BertMaskedLanguageModelBlock(bertBlock, activationFunc)

  // NSP block
lazy val thisNspBlock: BertNextSentenceBlock = new BertNextSentenceBlock
  // 1- Initialize the shape of tensors for the encoder, MLM and NSP blocks
  override def initializeChildBlocks(
      ndManager: NDManager, 
      dataType: DataType, 
      shapes: Shape*): Unit
 
  
  // 2- Forward execution (i.e. PyTorch forward / __call__
  override protected def forwardInternal(
      parameterStore: ParameterStore,
      inputNDList: NDList,
      training : Boolean,
      params: PairList[String, java.lang.Object]): NDList 
}

DJL provides developers with two important methods

  • initializeChildBlock (1) initializes the shape of the tensors for the inner/child blocks
  • forwardInternal (2) implement the forward execution of neural network for the transformer and downstream classifier.
def forwardInternal(
    parameterStore: ParameterStore,
    inputNDList: NDList,
    training : Boolean,
    params: PairList[String, java.lang.Object]): NDList = {
 
    // Dimension batch_size x max_sentence_size
  val tokenIds = inputNDList.get(0)
  val typeIds = inputNDList.get(1)
  val inputMasks = inputNDList.get(2)

    // Dimension batch_size x num_masked_token
  val maskedIndices = inputNDList.get(3)

  val ndChildManager = NDManager.subManagerOf(tokenIds)
  ndChildManager.tempAttachAll(inputNDList)

      // Step 1: Process the transformer block for Bert
  val bertBlockNDInput = new NDList(tokenIds, typeIds, inputMasks)
  val ndBertResult = thisTransformerBlock.forward(
    parameterStore, 
    bertBlockNDInput, 
    training)

      // Step 2 Process the Next Sentence Predictor block
      // Embedding sequence dimensions are 
      // batch_size x max_sentence_size x embedding_size
  val embeddedSequence = ndBertResult.get(0)
  val pooledOutput = ndBertResult.get(1)

      // Need to un-squeeze for batch size =1,   
      // (embedding_vector) => (1, embedding_vector)
  val unSqueezePooledOutput =
     if(pooledOutput.getShape().dimension() == 1) {
       val expanded = pooledOutput.expandDims(0) 
       ndChildManager.tempAttachAll(expanded)
       expanded
     }
     else
      pooledOutput

      // We compute the NSP probabilities in case there are more than 
      // a single sentence
  val logNSPProbabilities: NDArray =
       thisNspBlock.forward(
     parameterStore, 
     new NDList(unSqueezePooledOutput),
     training
  ).singletonOrThrow

        // Step 3: Process the Masked Language Model block
        // Embedding table dimension are vocabulary_size x Embeddings size
  val embeddingTable = thisTransformerBlock
            .getTokenEmbedding
            .getValue(parameterStore, embeddedSequence.getDevice(), training)

        // Dimension:  (batch_size x maskSize) x Vocabulary_size
  val logMLMProbabilities: NDArray = thisMlmBlock.forward(
      parameterStore,
      new NDList(embeddedSequence, maskedIndices, embeddingTable),
      training
  ).singletonOrThrow

        // Finally build the output
  val ndOutput = new NDList(logNSPProbabilities, logMLMProbabilities)
  ndChildManager.ret(ndOutput)
 


Thank you for reading this article. For more information ...

References



---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3