Friday, October 21, 2022

Accelerate Deep Learning with Neural Blocks

Target audience: Advanced

Estimated reading time: 6'

As a machine learning engineer, I've found it challenging to encounter the same level of reusability and design patterns in this field that are common in conventional software development. Implementations of deep learning models often depend heavily on repetitive, boilerplate code.

In this post, we explore the idea of reusable neural blocks, a straightforward and practical approach for packaging and reusing components of neural networks. Specifically, we'll delve into creating neural blocks for a variational auto-encoder using PyTorch, as well as for a Bidirectional Embeddings Representation from Transformers (BERT) encoder, utilizing the Deep Java Library.

Reusable neural blocks

Neural blocks in PyTorch

Modular convolutional neural networks

Modular variational auto-encoders

Neural blocks in Deep Java Library

References

Notes:

Source code is available on GitHub Github Neural Architecture
Environments: Python 3.10, PyTorch 2.1.1
To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Reusable neural blocks

Complex deep learning models have large stack of neural transformation such as convolution, fully connected network layers, activations, regularization modules, loss functions or embedding layers.

Creating these models using basic components from existing deep learning library is a daunting task. A neural block aggregates multiple components of a neural network into a logical, clearly defined function or task. A block is a transformation in the data flow used in training and inference.

Neural blocks in PyTorch

Modular convolutional neural network

A convolutional neural network can be broken down into neural blocks that organize PyTorch modules such as hidden layers, input and output channels, batch normalization, regularization, pooling mode and activation function into a single computation unit.

First, let's consider a conventional convolutional neural network with a fully connected (restricted Boltzmann machine) network. The PyTorch modules associated with any given layer are assembled as a neural block class.

A PyTorch modules of the convolutional neural block are:

Conv2d: Convolutional layer with input, output channels, kernel, stride and padding
Dropout: Drop-out regularization layer
BatchNorm2d: Batch normalization module
MaxPool2d Pooling layer
ReLu, Sigmoid, ... Activation functions

Here is a schematic representation of a convolutional neural network as a stack of neural blocks.

The constructor for the neural block class initializes all its parameters and its modules in the proper order. For the sake of simplicity, regularization elements such as drop-out (bagging of sub-network) is omitted.

class ConvNeuralBlock(nn.Module):
  def __init__(self,
      in_channels: int,
      out_channels: int,
      kernel_size: int,
      stride: int,
      padding: int,
      batch_norm: bool,
      max_pooling_kernel: int,
      activation: nn.Module,
      bias: bool,
      is_spectral: bool = False):
    
   super(ConvNeuralBlock, self).__init__()
        
   # Assertions are omitted
   # 1- initialize the input and output channels
   self.in_channels = in_channels
   self.out_channels = out_channels
   self.is_spectral = is_spectral
   modules = []
   
   # 2- create a 2 dimension convolution layer
   conv_module = nn.Conv2d(   
       self.in_channels,
       self.out_channels,
       kernel_size=kernel_size,
       stride=stride,
       padding=padding,
       bias=bias)

   # 6- if this is a spectral norm block
   if self.is_spectral:        
     conv_module = nn.utils.spectral_norm(conv_module)
     modules.append(conv_module)
        
   # 3- Batch normalization
   if batch_norm:               
     modules.append(nn.BatchNorm2d(self.out_channels))
     
   # 4- Activation function
   if activation is not None: 
     modules.append(activation)
        
   # 5- Pooling module
   if max_pooling_kernel > 0:   
     modules.append(nn.MaxPool2d(max_pooling_kernel))
   
   self.modules = tuple(modules)

The code snippet describes the various stages of building a convolutional block. The first step (1) is to initialize the number of input and output channels, then create the 2-dimension convolution (2), a batch normalization module (3) an activation function (4) and finally a Max pooling module (5). The spectral norm regularization (6) is optional.

Next, we package the various convolutional and feedback forward neural blocks
into a full fledge convolutional model, in the following build method.

class ConvModel(NeuralModel):
  def __init__(self,                    
       model_id: str,
       # 1 Number of input and output unites
       input_size: int,
       output_size: int,
       # 2- PyTorch convolutional modules
       conv_model: nn.Sequential,
       dff_model_input_size: int = -1,
       # 3- PyTorch fully connected
       dff_model: nn.Sequential = None):
        
   super(ConvModel, self).__init__(model_id)
   self.input_size = input_size
   self.output_size = output_size
   self.conv_model = conv_model
   self.dff_model_input_size = dff_model_input_size
   self.dff_model = dff_model
   
  @classmethod
  def build(cls,
      model_id: str,
      conv_neural_blocks: list,  
      dff_neural_blocks: list) -> NeuralModel:
            
   # 4- Initialize the input and output size 
   # for the convolutional layer
   input_size = conv_neural_blocks[0].in_channels
   output_size = conv_neural_blocks[len(conv_neural_blocks) - 1].out_channels

   # 5- Generate the model from the sequence 
   # of conv. neural blocks
   conv_modules = [conv_module for conv_block in conv_neural_blocks
         for conv_module in conv_block.modules]
   conv_model = nn.Sequential(*conv_modules)

   # 6- If a fully connected RBM is included in the model ..
   if dff_neural_blocks is not None and not is_vae:
     dff_modules = [dff_module for dff_block in dff_neural_blocks
        for dff_module in dff_block.modules]
         
     dff_model_input_size = dff_neural_blocks[0].output_size
     dff_model = nn.Sequential(*tuple(dff_modules))
   else:
     dff_model_input_size = -1
     dff_model = None
      
  return cls(
     model_id, 
     conv_dimension, 
     input_size, 
     output_size, 
     conv_model,
     dff_model_input_size, 
     dff_model)

The default constructor (1) initializes the number of input/output channels, the PyTorch modules for the convolutional layers (2) and the fully connected layers (3).
The class method, build, instantiates the convolutional model from several convolutional neural blocks and one feed forward neural block. It initializes the size of input and output layers from the first and last neural blocks (4), generate the PyTorch convolutional modules (5) and fully-connected layers' modules (6) from the neural blocks.

Modular variational auto-encoder

A de-convolutional neural network, DeConvModel is created from the convolutional model, ConvModel through reflection (see Automating the configuration of a GAN in PyTorch for more details). A mean, variance and sampling PyTorch modules are packaged into a variational neural block, VAENeuralBlock.

Finally, the variational auto-encoder, VAE is assembled by stacking the convolutional, variational and de-convolutional neural blocks.

Neural blocks in Deep Java Library

Deep Java Library (DJL) is an Apache open-source Java framework that supports the most commonly used deep learning frameworks; MXNet, PyTorch and TensorFlow. DJL ability to leverage any hardware configuration (CPU, GPU) and integrated with big data frameworks makes it and ideal solution for a highly performant distributed inference engine. DJL can be optionally used for training.

Everyone who has been involved with GPT-3 or GPT-4 decoder (ChatGPT) is aware of the complexity and interaction of neural components in transformers.

Let's apply DJL to build a BERT transformer encoder using neural blocks, knowing

A BERT encoder is a stack of multiple transformer modules
Pre-training block which contains BERT block, Masked Language Model (MLM) module and Next Sentence Predictor (NSP) with their associated loss functions
A BERT block is composed of embedding block.

The following Scala code snippet illustrates the composition of a BERT pre-training block using the transformer encoder block, thisTransformerBlock, the Masked Language Model component, thisMlmBlock and the Next Sentence Prediction module, thisNspBlock.

class CustomPretrainingBlock protected (
    mlmActivation: String
) extends AbstractBaseBlock {

  lazy val activationFunc: java.util.function.Function[NDArray, NDArray] =    
       ActivationConfig.getNDActivationFunc(activationType)

  // Transformer encoder block  // 1- Initialize the shape of tensors for the encoder, MLM and NSP blocks

  lazy val thisTransformerBlock: BertBlock = BertBlock.builder().base()
     .setTokenDictionarySize(Math.toIntExact(vocabularySize))
     .build

  // MLM block

  lazy val thisMlmBlock: BertMaskedLanguageModelBlock = 
       new BertMaskedLanguageModelBlock(bertBlock, activationFunc)

  // NSP block
  lazy val thisNspBlock: BertNextSentenceBlock =
       new BertNextSentenceBlock

  // 1- Initialize the shape of tensors for the encoder, MLM and NSP blocks

  override def initializeChildBlocks(
      ndManager: NDManager, 
      dataType: DataType, 
      shapes: Shape*): Unit

  
  // 2- Forward execution (i.e. PyTorch forward / __call__

  override protected def forwardInternal(
      parameterStore: ParameterStore,
      inputNDList: NDList,
      training : Boolean,
      params: PairList[String, java.lang.Object]): NDList 
}

DJL provides developers with two important methods

initializeChildBlock (1) initializes the shape of the tensors for the inner/child blocks
forwardInternal (2) implement the forward execution of neural network for the transformer and downstream classifier.

def forwardInternal(
    parameterStore: ParameterStore,
    inputNDList: NDList,
    training : Boolean,
    params: PairList[String, java.lang.Object]): NDList = {
 
    // Dimension batch_size x max_sentence_size
  val tokenIds = inputNDList.get(0)
  val typeIds = inputNDList.get(1)
  val inputMasks = inputNDList.get(2)


    // Dimension batch_size x num_masked_token
  val maskedIndices = inputNDList.get(3)

  val ndChildManager = NDManager.subManagerOf(tokenIds)
  ndChildManager.tempAttachAll(inputNDList)

      // Step 1: Process the transformer block for Bert
  val bertBlockNDInput = new NDList(tokenIds, typeIds, inputMasks)
  val ndBertResult = thisTransformerBlock.forward(
    parameterStore, 
    bertBlockNDInput, 
    training)

      // Step 2 Process the Next Sentence Predictor block
      // Embedding sequence dimensions are

      // batch_size x max_sentence_size x embedding_size
  val embeddedSequence = ndBertResult.get(0)
  val pooledOutput = ndBertResult.get(1)

      // Need to un-squeeze for batch size =1,

      // (embedding_vector) => (1, embedding_vector)
  val unSqueezePooledOutput =
     if(pooledOutput.getShape().dimension() == 1) {
       val expanded = pooledOutput.expandDims(0) 
       ndChildManager.tempAttachAll(expanded)
       expanded
     }
     else
      pooledOutput

      // We compute the NSP probabilities in case there are more than

      // a single sentence
  val logNSPProbabilities: NDArray =
       thisNspBlock.forward(
     parameterStore, 
     new NDList(unSqueezePooledOutput),
     training
  ).singletonOrThrow

        // Step 3: Process the Masked Language Model block
        // Embedding table dimension are vocabulary_size x Embeddings size
  val embeddingTable = thisTransformerBlock
            .getTokenEmbedding
            .getValue(parameterStore, embeddedSequence.getDevice(), training)

        // Dimension:  (batch_size x maskSize) x Vocabulary_size
  val logMLMProbabilities: NDArray = thisMlmBlock.forward(
      parameterStore,
      new NDList(embeddedSequence, maskedIndices, embeddingTable),
      training
  ).singletonOrThrow

        // Finally build the output
  val ndOutput = new NDList(logNSPProbabilities, logMLMProbabilities)
  ndChildManager.ret(ndOutput)

Thank you for reading this article. For more information ...

References

PyTorch
Deep Java Library
Automating the configuration of a GAN in PyTorch
github.com/patnicolas
Environments: Python 3.8, PyTorch 1.8.1, Scala 2.12.15, Java 11, Deep Java Library 0.20.0

---------------------------

Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Sunday, September 18, 2022

Generative AI with Kafka & Spark

Target audience: Advanced

Estimated reading time: 6'

Posts history

While Python is predominantly recognized as the go-to programming language for data science, leveraging Java-based frameworks can offer substantial advantages, especially for rapid distributed inference.

In this article, we will outline the structure of a swift, distributed inference system for Bidirectional Encoder Representations from Transformers (BERT) models [ref 1], harnessing the capabilities of Apache Spark, Kafka, and the Deep Java Library (DJL).

Combining the best of both worlds

Java/Scala for inference

Distributed inference pipeline

Architecture

Neural blocks

References

Appendix

What you will learn: How to leverage Apache Kafka, Spark & Deep Java Library for faster inference on transformer models.

Notes:

This article doesn't delve into the specifics of Apache Spark, Kafka, Deep Java Library, or BERT individually. Instead, it focuses on how these components are integrated to create an efficient solution for inference tasks.
Development environments: JDK 11, Scala 2.12.15, Apache Spark 3.3.1, Apache Kafka 2.8.0, Deep Java Library 0.20.0
Comments and ancillary code are omitted for the sake of clarity.
Source code available at https://github.com/patnicolas/bertspark

Combining the best of both worlds

Python is a popular environment for developing and training deep learning models like TensorFlow, PyTorch, and MXNet. As an interpreted language, it offers data scientists the flexibility of notebooks for interactive development, evaluation, and refinement of neural models. Python boasts an extensive library encompassing natural language processing, machine learning models, statistical algorithms, and data management tools.

However, this dynamic development environment faces two significant challenges in runtime inference:

Python's limited capacity for task parallelization, whether through concurrent threads or distributing tasks across a network.
Commercial applications often depend on web services running on Java Virtual Machine (JVM) and make extensive use of Apache's open-source libraries.

This raises the question: Can we use Python to define, train, and evaluate deep learning models, and then employ JVM-based languages for real-time inference?

The solution hinges on the fact that deep learning frameworks like PyTorch and TensorFlow are fundamentally binary executables written in C++. The binary versions of these deep learning libraries can be accessed by both Python and Java through their respective interfaces.

Java/Scala for inference

As mentioned earlier, Python frameworks are frequently employed for training deep learning models. However, when it comes to deployment into production (specifically for inference purposes), these models need to be merged with current applications that are written in Java or Scala. This integration often involves utilizing data processing frameworks like Flink, Presto, or Spark.

This particular study concentrates on the integration of a BERT model into an existing Spark application. Apache Spark is renowned for its ability to rapidly process large datasets concurrently across multiple distributed services, as noted [ref 1]. Meanwhile, the Deep Java Library serves as a Java library that implements the most popular deep learning models, providing access via a Java native interface.

Training (Python) and Inference (Java/Scala) stack

Apache Spark and Amazon's Deep Java Library (DJL) tackle the two main challenges associated with deploying machine learning models in production that were developed using Python.

Typically, the process involves creating models in a Python environment like Jupyter, an IDE, or Anaconda, and then saving the model parameters. DJL then takes over by loading these saved parameters and initializing the inference model, which is then ready to handle runtime requests.

Distributed inference pipeline

The goal is to utilize Apache Spark for distributed computation and Kafka for asynchronous, or non-blocking, data queuing.
By integrating these two technologies, we can enhance the scalability of predictions by parallelizing the execution of deep learning models. The critical components of this distributed inference pipeline include:

Apache Spark: This tool segments runtime requests for predictions into batches. These batches are then processed concurrently across remote worker nodes.
Apache Kafka: This acts as an asynchronous messaging queue, effectively separating the client application from the inference pipeline, ensuring smooth data flow without bottlenecks.
Deep Java Library (DJL): It connects with the binary executables of the deep learning models.
Kubernetes: This system containerizes the instances of the inference pipelines, facilitating scalable and automated deployment. Notably, Spark version 3.2 and later versions offer direct integration with Kubernetes.
Deep Learning Frameworks: This includes well-known frameworks like TensorFlow, MXNet, and PyTorch, which are part of the overall architecture.

Through this combination, we achieve a robust and scalable system for managing and executing deep learning model inferences efficiently.

Generic data flow for Inference of deep learning models with DJL

The two main benefits of such pipeline are simplicity (all tasks/processes run on JVM) and low latency.

Note: Spark and DJL can also be used in the training phase to distribute the training of a mini batch.

Apache Kafka

Apache Kafka is an open-source distributed event streaming platform for high volume data pipelines, streaming analytics, data integration, and mission-critical applications. Kafka supports event streaming ensures a continuous flow of data through a pipeline or sequence of transformation such as Extract, Transform and Load [ref 2].

First ,we construct the handler class, KafkaPrediction that

consumes requests from Kafka topic consumeTopic
invokes the prediction model and transformation, predictionPipeline
produces prediction into Kafka topic produceTopic

The actual request is wrapped into the consumed message, RequestMessage. Same for the prediction produced back to the Kafka queue.

class KafkaPrediction(
 consumeTopic: String,
 produceTopic: String,
 predictionPipeline: Seq[Request] => Seq[Prediction])  {
     
      // 1 - Constructs the transform of Kafka messages for prediction
  val transform = (requestMsg: Seq[RequestMessage]) => {
      // 2- Invoke the execution of the pipeline
      val predictions = predictionPipeline(requestMsg.map(_.requestPayload))
      predictions.map(ResponseMessage(_))
  } 
    
    // 3- Build the Kafka consumer for prediction request
  val consumer = new KafkaConsumer[RequestMessage](
    RequestSerDe.deserializingClass,
    consumeTopic
  )
    // 4- Build the Kafka producer for prediction response
  val producer = new KafkaProducer[ResponseMessage](
     ResponseSerDe.serializingClass, 
     produceTopic
  )
  .....
}

We first need to create a wrapper function, transform to generate a prediction. The function converts a request message of type RequestMessage into a prediction of type ResponseMessage.
The wrapper, transform invoke the prediction pipeline predictionPipeline after converting the messages of type RequestMessage consumed from Kafka into actual request (Request). The predictions are converted into message of type ResponseMessage produced to Kafka
The consumer is fully defined by the de-serialization of data consumed from Kafka and its associated topic
The producer serialized the response back to Kafka service.

def executeBatch(
  consumeTopic: String, 
  produceTopic: String, 
  maxNumResponses: Int): Unit = { 
 
   // 1 - Initialize the prediction pipeline
 val kafkaHandler = new KafkaPrediction(
    consumeTopic, 
    produceTopic, 
    predictionPipeline
  )

  while(running)  {
      // 2 - Pool the request topic (has its own specific Kafka exception handler)
   val consumerRecords = kafkaHandler.consumer.receive
 
   if(consumerRecords.nonEmpty) {
        // 3 - Generate and apply transform to the batch
     val input: Seq[RequestMessage] = consumerRecords.map(_._2)
     val responses = kafkaHandler.predict(input) 
 
     if(responses.nonEmpty) {
         // 4 - Produce to the output topic
         val respMessages = responses.map(
             response =>(response.payload.id, response)
         ) 
 
         // 5- Produce the batch of response messages to Kafka
        kafkaHandler.producer.send(respMessages)
             
        // 6 - Get confirmation from Kafka has indeed processed the response
        kafkaHandler.consumer.asyncCommit
     }
     else
        logger.error("No response is produced to Kafka")
   }
   kafkaHandler.close
}

First we instantiate the Kafka message handler class, KafkaPrediction we created earlier
At regular interval, we pull a batch of new requests from Kafka
If the batch is not empty, we invoke the handler, predict to the prediction models
Once done, we encapsulate the predictions into the ResponseMessage instances
The messages are produced into the producer topic in the Kafka queue
Finally, Kafka acknowledges the correct reception of the responses, asynchronously.

Next, we leverage Spark to distribute the batch of requests across multiple computation nodes (workers)

Apache Spark

Apache Spark, an open-source distributed processing system, is adept at handling large-scale data sets. It leverages in-memory caching and refined query execution strategies for real-time analytics [ref 3].

In our specific use case, we employ Spark to distribute a batch of requests, which are sourced from Kafka, across a network. This setup enables the simultaneous execution of multiple BERT models. Such an architecture not only prevents a single point of failure, ensuring fault tolerance, but also permits the use of generic, cost-efficient hardware.

Leveraging Spark data set and partitioning is surprisingly simple.

def predict(
   requests: Seq[Request]
)(implicit sparkSession: SparkSession): Seq[Prediction] = {
  import sparkSession.implicits._

    // 1 - Convert request into a Spark data set
  val requestDataset = requests.toDS()


    // 2 - Execute the prediction by invoking the DJL model
  val responseDataset: Dataset[Prediction] = requestDataset(predict(_))


    // 3 - Convert Spark data set response 
  responseDataset.collect() 
}

Once the spark session (context) is initiated, the batch of requests is converted into a data set, requestDataset
Spark applies the prediction model (DJL) on each request on the partitioned data
Finally, the predictions are collected from the Spark worker nodes before been returned to the Kafka handler

Note: The Spark context is assumed to be created and passed as implicit parameter to the prediction method.

Deep Java Library

This component is crucial as it connects the flow of incoming and outgoing data with the deep learning models. The Deep Java Library (DJL) is an open-source Java framework that accommodates popular deep learning frameworks like MXNet, PyTorch, and TensorFlow.

DJL's capability to adapt to any hardware setup (be it CPU or GPU) and its integration with big data frameworks position it as an ideal choice for a high-performance distributed inference engine [ref 4]. The library is particularly well-suited for constructing transformer encoders like BERT or GPT, as well as decoders such as GPT and ChatGPT.

In this setup, the input tensors are processed by the deep learning models on a GPU. Importantly, the data is allocated in the native memory space, which is external to the JVM and its garbage collector. The DJL library supports native tensor types such as NDArray and lists of tensors like NDList, along with a straightforward memory management tool, NDManager.

The classifier operates on the Spark worker node. The following code snippet, though a simplified version, illustrates the steps involved in invoking a BERT-based classifier using the DJL framework.

class BERTClassifier(
   minTermFrequency: Int, 
   path: Path)(implicit sparkSession: SparkSession) {

  // 1 - Manage tensor allocation as NDArray
  val ndManager = NDManager.newManager()
 
  // 2 - Define the configuration of the classifier
  val classifyCriteria: Criteria[NDList, NDList] = Criteria.builder()
     .optApplication(Application.UNDEFINED)
     .setTypes(classOf[NDList], classOf[NDList])
     .optOptions(options)
     .optModelUrls(s"file://${path.toAbsolutePath}")
     .optBlock(classificationBlock)
     .optEngine(Engine.getDefaultEngineName())
     .optProgress(new ProgressBar())
     .build()
 
  // 3- Load the model from a local file
  val thisModel = classifyCriteria.loadModel()


  // 4 - Instantiate a new predictor
  val predictor = thisModel.newPredictor()

  // 5 - Execute this request on this worker node
  def predict(requests: Request): Prediction = {
    predictor.predict(ndManager, requests)
  }

  // 6- Close resources
  def close(): Unit = {
    model.close()
    predictor.close()
    ndManager.close()
  }
}

Set the manager for tensor in native memory
Configure the classifier with its related neural block (classificationBlock)
Load the model (MXNet, PyTorch or TensorFlow) from local file
Instantiate a predictor from the model
Submit the request to the DL model and return a prediction
Close all the resources allocated in the native memory at the end of the run

Note: DJL can be optionally used for training.

Use case: BERT

In order to illustrate the application of Spark and DJL to BERT we consider a model to predict a topic given a document.

Architecture

Our model has 3 components:

Text processor (Tokenizer, Document segmentation,...)
Pre-trained BERT
Fully-connected neural network classifier (supervised)

A transformer model consists of two main components: an encoder and a decoder. The encoder's role is to convert sentences and paragraphs into an internal format, typically a numerical matrix, that captures the context of the input. Conversely, the decoder interprets and reverses this process. When combined, the encoder and decoder enable the transformer to execute sequence-to-sequence tasks like translation. Interestingly, isolating the encoder part of the transformer provides insights into the context, enabling various intriguing applications.

BERT particularly capitalizes on the attention mechanism to gain a more nuanced understanding of language context. BERT is composed of several layers of encoder blocks. In this model, the input text is divided into tokens, akin to the traditional transformer model, and each token is subsequently converted into a vector at BERT's output.

BERT has been applied to various problems including the automation of medical coding [ref 5]

Neural blocks

The practice of arranging components of neural networks, such as layers and activation functions, into modular, reusable blocks is a common strategy to simplify and deconstruct complex models [ref 6] .

In DJL, a block is a composable function that forms a neural network. It can represent single operation, parts of a neural network, and even the whole neural network. What makes blocks special is that they contain a number of parameters that are used in their function and are trained during deep learning. As these parameters are trained, the functions represented by the blocks get more and more accurate.

The core purpose of a block is to perform an operation on the inputs, and return an output. It is defined in the forward method. The forward function could be defined explicitly in terms of parameters or implicitly and could be a combination of the functions of the child blocks.

The following code snippet illustrates the composition of blocks for a transformer encoder using Deep Java Library blocks. The 3 main components are

Transformer, self-attention block with token, position and sentence order embeddings
Masked Language Model (MLM) block
Next Sentence Prediction (NSP) block

class CustomPretrainingBlock (
  bertModelType: String
  activationType: String,
  vocabularySize: Long) extends BaseNetBlock {
 
  // First block: BERT transformer
  val bertBlock = getBertConfig(bertModelType)
        .setTokenDictionarySize(Math.toIntExact(vocabularySize))
        .build
  val activationFunc: java.util.function.Function[NDArray, NDArray] = 
         ActivationConfig.getNDActivationFunc(activationType)

    // Second block: Masked Language Model
  val bertMLMBlock = new BertMaskedLanguageModelBlock(bertBlock, activationFunc)

   // Third: block: Next Sentence Predictor
  val bertNSPBlock = new BertNextSentenceBlock
  val pretrainingBlocks = new BERTPretrainingBlocks(
      ("transformer", bertBlock),
      ("mlm", bertMLMBlock),
      ("nsp", bertNSPBlock)
   )

  override protected def forwardInternal(
    parameterStore: ParameterStore,
    inputNDList: NDList,
    training : Boolean,
    params: PairList[String, java.lang.Object]): NDList

BERT has several models with various number of encoder blocks, attention heads, embedding sizes and dimensions.

def getBertConfig(bertModelType: String): BertBlock.Builder = bertModelType match {
  case `nanoBertLbl` => 
      // 4 encoders, 4 attention heads, embedding size: 256, dimension 256x4
    BertBlock.builder().nano()

  case `microBertLbl`=>
      // 12 encoders,8 attention heads, embedding size: 512, dimension 512x4
    BertBlock.builder().micro()

  case `baseBertLbl` =>
      // 12 encoders,12 attention heads, embedding size: 768, dimension 768x4
    BertBlock.builder().base()

  case `largeBertLbl` =>
      // 24 encoders,16 attention heads, embedding size: 1024, dimension 1024x4
    BertBlock.builder().large()

  case _ =>
}

The appendix provides a detailed implementation guide for executing the 'forward' method used in pre-training, written in Scala, for reference purposes.

Thank you for reading this article. For more information ...

References

[1] BiDirectional Encoder Representations from Transformer

[2] Apache Kafka
[3] Apache Spark

[4] Deep Java Library

[5] Automated Medical Coding with BERT

[6] Accelerate Deep Learning with Neural Blocks

Appendix

You can implement your own variant of BERT by overriding the method forwardInternal.

override protected def forwardInternal(
  parameterStore: ParameterStore,
  inputNDList: NDList,
  training : Boolean,
  params: PairList[String, java.lang.Object]): NDList = {

    // Dimension batch_size x max_sentence_size
  val tokenIds = inputNDList.get(0)
  val typeIds = inputNDList.get(1)
  val inputMasks = inputNDList.get(2)

    // Dimension batch_size x num_masked_token
  val maskedIndices = inputNDList.get(3)

  try {
    val ndChildManager = NDManager.subManagerOf(tokenIds)
    ndChildManager.tempAttachAll(inputNDList)

      // Step 1: Process the transformer block for Bert
    val bertBlockNDInput = new NDList(tokenIds, typeIds, inputMasks)
    val ndBertResult = transformerBlock.forward(parameterStore, bertBlockNDInput, training)

      // Step 2 Process the Next Sentence Predictor block
      // Embedding sequence dimensions are batch_size x max_sentence_size x embedding_size
    val embeddedSequence = ndBertResult.get(0)
    val pooledOutput = ndBertResult.get(1)

      // Need to un-squeeze for batch size =1,   (embedding_vector) => (1, embedding_vector)
    val unSqueezePooledOutput =
      if(pooledOutput.getShape.dimension() == 1) {
         val expanded = pooledOutput.expandDims(0)
         ndChildManager.tempAttachAll(expanded)
         expanded
      }
      else
         pooledOutput

      // We compute the NSP probabilities in case there are more than one single sentences
    val logNSPProbabilities: NDArray =
       bertNSPBlock.forward(parameterStore, new NDList(unSqueezePooledOutput), training)
                 .singletonOrThrow

        // Step 3: Process the Masked Language Model block
        // Embedding table dimension are vocabulary_size x Embeddings size
    val embeddingTable = transformerBlock
            .getTokenEmbedding
            .getValue(parameterStore, embeddedSequence.getDevice, training)

        // Dimension:  (batch_size x maskSize) x Vocabulary_size
    val logMLMProbabilities: NDArray = bertMLMBlock
        .forward(
           parameterStore,
           new NDList(embeddedSequence, maskedIndices, embeddingTable),
           training)
        .singletonOrThrow

        // Finally build the output
    val ndOutput = new NDList(logNSPProbabilities, logMLMProbabilities)
      ndChildManager.ret(ndOutput)
  }
  catch { ... }
}

---------------------------

Friday, October 21, 2022

Accelerate Deep Learning with Neural Blocks

Reusable neural blocks

Neural blocks in PyTorch

Modular convolutional neural network

Modular variational auto-encoder

Neural blocks in Deep Java Library

References

Sunday, September 18, 2022

Generative AI with Kafka & Spark

Combining the best of both worlds

Java/Scala for inference

Distributed inference pipeline

Apache Kafka

Apache Spark

Deep Java Library

Use case: BERT

Architecture

Neural blocks

References

Appendix

Contact Form

Equation Editor

Popular Posts